Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder3.xlned.com!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'win32': 0.03; 'encoding': 0.05; 'indexing': 0.07; 'sys': 0.07; 'utf-8': 0.07; 'string': 0.09; 'encode': 0.09; 'width': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; '(just': 0.16; 'character.': 0.16; 'correctness': 0.16; 'opposite': 0.16; 'previously,': 0.16; 'tradeoffs': 0.16; 'traverse': 0.16; 'utf8': 0.16; 'varies': 0.16; 'subject:python': 0.16; 'demonstrate': 0.16; 'wrote:': 0.18; 'do.': 0.18; 'bit': 0.19; 'numerical': 0.19; 'examples': 0.20; '>>>': 0.22; 'memory': 0.22; 'import': 0.22; 'email addr:gmail.com>': 0.22; 'handles': 0.22; 'cc:addr:python.org': 0.22; '>>>': 0.24; 'byte': 0.24; 'bytes': 0.24; 'string,': 0.24; 'unicode': 0.24; 'earlier': 0.24; 'versions': 0.24; 'cc:2**0': 0.24; 'developers': 0.25; 'handling': 0.26; 'certain': 0.27; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'character': 0.29; 'points': 0.29; 'tim': 0.29; 'message-id:@mail.gmail.com': 0.30; 'skip:( 20': 0.30; '(which': 0.31; 'code': 0.31; 'overhead': 0.31; 'priorities': 0.31; 'reduced': 0.31; 'sep': 0.31; 'handled': 0.32; 'skip:m 30': 0.32; '(e.g.': 0.33; 'cases': 0.33; 'used,': 0.33; 'skip:s 30': 0.35; 'skip:u 20': 0.35; 'operations': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'skip:f 40': 0.36; 'possible': 0.36; 'january': 0.37; 'two': 0.37; 'performance': 0.37; 'being': 0.38; 'minimum': 0.38; 'skip:& 10': 0.38; 'e.g.': 0.38; 'previous': 0.38; 'skip:& 20': 0.39; 'does': 0.39; 'highest': 0.39; 'obtain': 0.39; 'structure': 0.39; 'skip:u 10': 0.60; 'subject:"': 0.60; 'most': 0.60; 'lower': 0.61; 'matter': 0.61; 'first': 0.61; 'maximum': 0.63; 'decided': 0.64; 'more': 0.64; '2000': 0.65; 'to:addr:gmail.com': 0.65; 'here': 0.66; 'between': 0.67; '3000': 0.68; '2024': 0.84; 'complexity': 0.84; 'increases': 0.91; 'whereas': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=2bqIXk15qZlCNWSCeNT8vX5J2U8ipO9x8sNf4+7HzBk=; b=rCeQybawPlwYMBm8oHs0JIr2uzUxsIN2bhmLxXrZr+84QKEd1HdYbVXBK94o/00Xpt FetF7kRDH8pSi8s0KPnzQRZtu/ovGoku0jp7zCW26s+4deh7aKFMhkdsrDJm7MOr71Fm 4SpXguqwwkTwMhT3F8xwHpja9dY+tt7yf/JGNg2yYjgGgKBbg7Nza70pxFsHU5oKcHyc grEsPGy+MyG/htWwPl5pHvhKWadViE12sLrsUPqXQjvVHdYgcdfFrDSdtySaNr50dirj PpOJrL2lmHPzZsvQwp5GiufQNdkKBP4o8tgFPpeiucAb3zeSZZLlxo4ZV0ujPKxXQSH1 hr4w== MIME-Version: 1.0 X-Received: by 10.60.124.138 with SMTP id mi10mr3055254oeb.57.1389134332051; Tue, 07 Jan 2014 14:38:52 -0800 (PST) In-Reply-To: <2fbf4f89-caaa-4fab-8d7e-ff7ef84029a2@googlegroups.com> References: <52c1dc4c$0$2877$c3e8da3$76491128@news.astraweb.com> <52C1F5EC.3020808@stoneleaf.us> <52c29416$0$29987$c3e8da3$5496439d@news.astraweb.com> <52c6415c$0$29972$c3e8da3$5496439d@news.astraweb.com> <52C6AD00.5050000@chamonix.reportlab.co.uk> <3519f85e-0909-4f5a-9a6e-09b6fd4c312d@googlegroups.com> <2fbf4f89-caaa-4fab-8d7e-ff7ef84029a2@googlegroups.com> Date: Wed, 8 Jan 2014 09:38:51 +1100 Subject: Re: Blog "about python 3" From: Tim Delaney To: wxjmfauth@gmail.com Content-Type: multipart/alternative; boundary=047d7b5d5fba6847ac04ef690a23 Cc: Python-List X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 211 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1389134341 news.xs4all.nl 2962 [2001:888:2000:d::a6]:46656 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:63451 --047d7b5d5fba6847ac04ef690a23 Content-Type: text/plain; charset=UTF-8 On 8 January 2014 00:34, wrote: > > Point 2: This Flexible String Representation does no > "effectuate" any memory optimization. It only succeeds > to do the opposite of what a corrrect usage of utf* > do. > UTF-8 is a variable-width encoding that uses less memory to encode code points with lower numerical values, on a per-character basis e.g. if a code point <= U+007F it will use a single byte to encode; if <= U+07FF two bytes will be used; ... up to a maximum of 6 bytes for code points >= U+4000000. FSR is a variable-width memory structure that uses the width of the code point with the highest numerical value in the string e.g. if all code points in the string are <= U+00FF a single byte will be used per character; if all code points are <= U+FFFF two bytes will be used per character; and in all other cases 4 bytes will be used per character. In terms of memory usage the difference is that UTF-8 varies its width per-character, whereas the FSR varies its width per-string. For any particular string, UTF-8 may well result in using less memory than the FSR, but in other (quite common) cases the FSR will use less memory than UTF-8 e.g. if the string contains only contains code points <= U+00FF, but some are between U+0080 and U+00FF (inclusive). In most cases the FSR uses the same or less memory than earlier versions of Python 3 and correctly handles all code points (just like UTF-8). In the cases where the FSR uses more memory than previously, the previous behaviour was incorrect. No matter which representation is used, there will be a certain amount of overhead (which is the majority of what most of your examples have shown). Here are examples which demonstrate cases where UTF-8 uses less memory, cases where the FSR uses less memory, and cases where they use the same amount of memory (accounting for the minimum amount of overhead required for each). Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> >>> fsr = u"" >>> utf8 = fsr.encode("utf-8") >>> min_fsr_overhead = sys.getsizeof(fsr) >>> min_utf8_overhead = sys.getsizeof(utf8) >>> min_fsr_overhead 49 >>> min_utf8_overhead 33 >>> >>> fsr = u"\u0001" * 1000 >>> utf8 = fsr.encode("utf-8") >>> sys.getsizeof(fsr) - min_fsr_overhead 1000 >>> sys.getsizeof(utf8) - min_utf8_overhead 1000 >>> >>> fsr = u"\u0081" * 1000 >>> utf8 = fsr.encode("utf-8") >>> sys.getsizeof(fsr) - min_fsr_overhead 1024 >>> sys.getsizeof(utf8) - min_utf8_overhead 2000 >>> >>> fsr = u"\u0001\u0081" * 1000 >>> utf8 = fsr.encode("utf-8") >>> sys.getsizeof(fsr) - min_fsr_overhead 2024 >>> sys.getsizeof(utf8) - min_utf8_overhead 3000 >>> >>> fsr = u"\u0101" * 1000 >>> utf8 = fsr.encode("utf-8") >>> sys.getsizeof(fsr) - min_fsr_overhead 2025 >>> sys.getsizeof(utf8) - min_utf8_overhead 2000 >>> >>> fsr = u"\u0101\u0081" * 1000 >>> utf8 = fsr.encode("utf-8") >>> sys.getsizeof(fsr) - min_fsr_overhead 4025 >>> sys.getsizeof(utf8) - min_utf8_overhead 4000 Indexing a character in UTF-8 is O(N) - you have to traverse the the string up to the character being indexed. Indexing a character in the FSR is O(1). In all cases the FSR has better performance characteristics for indexing and slicing than UTF-8. There are tradeoffs with both UTF-8 and the FSR. The Python developers decided the priorities for Unicode handling in Python were: 1. Correctness a. all code points must be handled correctly; b. it must not be possible to obtain part of a code point (e.g. the first byte only of a multi-byte code point); 2. No change in the Big O characteristics of string operations e.g. indexing must remain O(1); 3. Reduced memory use in most cases. It is impossible for UTF-8 to meet both criteria 1b and 2 without additional auxiliary data (which uses more memory and increases complexity of the implementation). The FSR meets all 3 criteria. Tim Delaney --047d7b5d5fba6847ac04ef690a23 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On 8= January 2014 00:34, <wxjmfauth@gmail.com> wrote:

Point 2: This Flexible String Representation does no
"effectuate" any memory optimization. It only succeeds
to do the opposite of what a corrrect usage of utf*
do.

UTF-8 is a variable-width encoding = that uses less memory to encode code points with lower numerical values, on= a per-character basis e.g. if a code point <=3D U+007F it will use a si= ngle byte to encode; if <=3D U+07FF two bytes will be used; ... up to a = maximum of 6 bytes for code points >=3D U+4000000.

FSR is a variable-width memory structure that uses the = width of the code point with the highest numerical value in the string e.g.= if all code points in the string are <=3D U+00FF a single byte will be = used per character; if all code points are <=3D U+FFFF two bytes will be= used per character; and in all other cases 4 bytes will be used per charac= ter.

In terms of memory usage the difference is that UT= F-8 varies its width per-character, whereas the FSR varies its width per-st= ring. For any particular string, UTF-8 may well result in using less memory= than the FSR, but in other (quite common) cases the FSR will use less memo= ry than UTF-8 e.g. if the string contains only contains code points <=3D= U+00FF, but some are between U+0080 and U+00FF (inclusive).

In most cases the FSR uses the same or less memory than= earlier versions of Python 3 and correctly handles all code points (just l= ike UTF-8).=C2=A0In the cases where the FSR uses more memory than previousl= y, the previous behaviour was incorrect.

No matter which representation is used, there will be a= certain amount of overhead (which is the majority of what most of your exa= mples have shown). Here are examples which demonstrate cases where UTF-8 us= es less memory, cases where the FSR uses less memory, and cases where they = use the same amount of memory (accounting for the minimum amount of overhea= d required for each).

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep= 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type &qu= ot;help", "copyright", "credits" or "license&= quot; for more information.
>>> import sys
>>>
>>&= gt; fsr =3D u""
>>> utf8 =3D fsr.encode("= utf-8")
>>> min_fsr_overhead =3D sys.getsizeof(fsr)=
>>> min_utf8_overhead =3D sys.getsizeof(utf8)
>&= gt;> min_fsr_overhead
49
>>> min_utf8_overh= ead
33
>>>
>>> fsr =3D = u"\u0001" * 1000
>>> utf8 =3D fsr.encode("utf-8")
>>= > sys.getsizeof(fsr) - min_fsr_overhead
1000
>>= ;> sys.getsizeof(utf8) - min_utf8_overhead
1000
>= >>
>>> fsr =3D u"\u0081" * 1000
>&g= t;> utf8 =3D fsr.encode("utf-8")
>>> sys.ge= tsizeof(fsr) - min_fsr_overhead
1024
>>> sys.g= etsizeof(utf8) - min_utf8_overhead
2000
>>>
>>> fsr =3D u"= \u0001\u0081" * 1000
>>> utf8 =3D fsr.encode("= utf-8")
>>> sys.getsizeof(fsr) - min_fsr_overhead
2024
>>> sys.getsizeof(utf8) - min_utf8_overhead
3000
>>>
>>> fsr =3D u&qu= ot;\u0101" * 1000
>>> utf8 =3D fsr.encode("utf= -8")
>>> sys.getsizeof(fsr) - min_fsr_overhead
2025
>>> sys.getsizeof(utf8) - min_utf8_overhead
2000
>>>
>>> fsr =3D u"\u0101\u008= 1" * 1000
>>> utf8 =3D fsr.encode("utf-8")
>>= > sys.getsizeof(fsr) - min_fsr_overhead
4025
>>= ;> sys.getsizeof(utf8) - min_utf8_overhead
4000

Indexing a character in UTF-8 is O(N) - you have to tra= verse the the string up to the character being indexed. Indexing a characte= r in the FSR is O(1). In all cases the FSR has better performance character= istics for indexing and slicing than UTF-8.

There are tradeoffs with both UTF-8 and the FSR. = The Python developers decided the priorities for Unicode handling in Python= were:

1. Correctness
=C2=A0 a. all code= points must be handled correctly;
=C2=A0 b. =C2=A0it must not be possible to obtain part of a code point= (e.g. the first byte only of a multi-byte code point);

2. No change in the Big O characteristics of string operations e.g. i= ndexing must remain O(1);

3. Reduced memory use in most cases.

It is impossible for UTF-8 to meet both criteria 1b and 2 without ad= ditional auxiliary data (which uses more memory and increases complexity of= the implementation). The FSR meets all 3 criteria.

Tim Delaney=C2=A0
--047d7b5d5fba6847ac04ef690a23--