Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder3.xlned.com!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <2fbf4f89-caaa-4fab-8d7e-ff7ef84029a2@googlegroups.com>
References: <mailman.4724.1388432539.18130.python-list@python.org> <52c1dc4c$0$2877$c3e8da3$76491128@news.astraweb.com> <l9soij$ck6$1@ger.gmane.org> <52C1F5EC.3020808@stoneleaf.us> <mailman.4748.1388478161.18130.python-list@python.org> <52c29416$0$29987$c3e8da3$5496439d@news.astraweb.com> <mailman.4753.1388499265.18130.python-list@python.org> <roy-4C0A2F.10412731122013@news.panix.com> <mailman.4797.1388684229.18130.python-list@python.org> <52c6415c$0$29972$c3e8da3$5496439d@news.astraweb.com> <la5u8j$hqr$1@ger.gmane.org> <52C6AD00.5050000@chamonix.reportlab.co.uk> <la7btn$u5$1@ger.gmane.org> <mailman.4882.1388808283.18130.python-list@python.org> <roy-1820F1.08551004012014@news.panix.com> <mailman.4905.1388845063.18130.python-list@python.org> <3519f85e-0909-4f5a-9a6e-09b6fd4c312d@googlegroups.com> <mailman.4915.1388875627.18130.python-list@python.org> <d8438ee4-1429-4855-9d78-b833f4f2748f@googlegroups.com> <mailman.4976.1388960067.18130.python-list@python.org> <2fbf4f89-caaa-4fab-8d7e-ff7ef84029a2@googlegroups.com>
Date: Wed, 8 Jan 2014 09:38:51 +1100
Subject: Re: Blog "about python 3"
From: Tim Delaney <timothy.c.delaney@gmail.com>
To: wxjmfauth@gmail.com
Content-Type: multipart/alternative; boundary=047d7b5d5fba6847ac04ef690a23
Cc: Python-List <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5152.1389134341.18130.python-list@python.org>
Lines: 211
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:63451

--047d7b5d5fba6847ac04ef690a23
Content-Type: text/plain; charset=UTF-8

On 8 January 2014 00:34, <wxjmfauth@gmail.com> wrote:

>
> Point 2: This Flexible String Representation does no
> "effectuate" any memory optimization. It only succeeds
> to do the opposite of what a corrrect usage of utf*
> do.
>

UTF-8 is a variable-width encoding that uses less memory to encode code
points with lower numerical values, on a per-character basis e.g. if a code
point <= U+007F it will use a single byte to encode; if <= U+07FF two bytes
will be used; ... up to a maximum of 6 bytes for code points >= U+4000000.

FSR is a variable-width memory structure that uses the width of the code
point with the highest numerical value in the string e.g. if all code
points in the string are <= U+00FF a single byte will be used per
character; if all code points are <= U+FFFF two bytes will be used per
character; and in all other cases 4 bytes will be used per character.

In terms of memory usage the difference is that UTF-8 varies its width
per-character, whereas the FSR varies its width per-string. For any
particular string, UTF-8 may well result in using less memory than the FSR,
but in other (quite common) cases the FSR will use less memory than UTF-8
e.g. if the string contains only contains code points <= U+00FF, but some
are between U+0080 and U+00FF (inclusive).

In most cases the FSR uses the same or less memory than earlier versions of
Python 3 and correctly handles all code points (just like UTF-8). In the
cases where the FSR uses more memory than previously, the previous
behaviour was incorrect.

No matter which representation is used, there will be a certain amount of
overhead (which is the majority of what most of your examples have shown).
Here are examples which demonstrate cases where UTF-8 uses less memory,
cases where the FSR uses less memory, and cases where they use the same
amount of memory (accounting for the minimum amount of overhead required
for each).

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64
bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>>
>>> fsr = u""
>>> utf8 = fsr.encode("utf-8")
>>> min_fsr_overhead = sys.getsizeof(fsr)
>>> min_utf8_overhead = sys.getsizeof(utf8)
>>> min_fsr_overhead
49
>>> min_utf8_overhead
33
>>>
>>> fsr = u"\u0001" * 1000
>>> utf8 = fsr.encode("utf-8")
>>> sys.getsizeof(fsr) - min_fsr_overhead
1000
>>> sys.getsizeof(utf8) - min_utf8_overhead
1000
>>>
>>> fsr = u"\u0081" * 1000
>>> utf8 = fsr.encode("utf-8")
>>> sys.getsizeof(fsr) - min_fsr_overhead
1024
>>> sys.getsizeof(utf8) - min_utf8_overhead
2000
>>>
>>> fsr = u"\u0001\u0081" * 1000
>>> utf8 = fsr.encode("utf-8")
>>> sys.getsizeof(fsr) - min_fsr_overhead
2024
>>> sys.getsizeof(utf8) - min_utf8_overhead
3000
>>>
>>> fsr = u"\u0101" * 1000
>>> utf8 = fsr.encode("utf-8")
>>> sys.getsizeof(fsr) - min_fsr_overhead
2025
>>> sys.getsizeof(utf8) - min_utf8_overhead
2000
>>>
>>> fsr = u"\u0101\u0081" * 1000
>>> utf8 = fsr.encode("utf-8")
>>> sys.getsizeof(fsr) - min_fsr_overhead
4025
>>> sys.getsizeof(utf8) - min_utf8_overhead
4000

Indexing a character in UTF-8 is O(N) - you have to traverse the the string
up to the character being indexed. Indexing a character in the FSR is O(1).
In all cases the FSR has better performance characteristics for indexing
and slicing than UTF-8.

There are tradeoffs with both UTF-8 and the FSR. The Python developers
decided the priorities for Unicode handling in Python were:

1. Correctness
  a. all code points must be handled correctly;
  b.  it must not be possible to obtain part of a code point (e.g. the
first byte only of a multi-byte code point);

2. No change in the Big O characteristics of string operations e.g.
indexing must remain O(1);

3. Reduced memory use in most cases.

It is impossible for UTF-8 to meet both criteria 1b and 2 without
additional auxiliary data (which uses more memory and increases complexity
of the implementation). The FSR meets all 3 criteria.

Tim Delaney

--047d7b5d5fba6847ac04ef690a23
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On 8=
 January 2014 00:34,  <span dir=3D"ltr">&lt;<a href=3D"mailto:wxjmfauth@gma=
il.com" target=3D"_blank">wxjmfauth@gmail.com</a>&gt;</span> wrote:<br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left=
-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;paddi=
ng-left:1ex">
<br>
Point 2: This Flexible String Representation does no<br>
&quot;effectuate&quot; any memory optimization. It only succeeds<br>
to do the opposite of what a corrrect usage of utf*<br>
do.<br></blockquote><div><br></div><div>UTF-8 is a variable-width encoding =
that uses less memory to encode code points with lower numerical values, on=
 a per-character basis e.g. if a code point &lt;=3D U+007F it will use a si=
ngle byte to encode; if &lt;=3D U+07FF two bytes will be used; ... up to a =
maximum of 6 bytes for code points &gt;=3D U+4000000.</div>
<div><br></div><div>FSR is a variable-width memory structure that uses the =
width of the code point with the highest numerical value in the string e.g.=
 if all code points in the string are &lt;=3D U+00FF a single byte will be =
used per character; if all code points are &lt;=3D U+FFFF two bytes will be=
 used per character; and in all other cases 4 bytes will be used per charac=
ter.</div>
<div><br></div><div><div>In terms of memory usage the difference is that UT=
F-8 varies its width per-character, whereas the FSR varies its width per-st=
ring. For any particular string, UTF-8 may well result in using less memory=
 than the FSR, but in other (quite common) cases the FSR will use less memo=
ry than UTF-8 e.g. if the string contains only contains code points &lt;=3D=
 U+00FF, but some are between U+0080 and U+00FF (inclusive).</div>
<div><br></div><div>In most cases the FSR uses the same or less memory than=
 earlier versions of Python 3 and correctly handles all code points (just l=
ike UTF-8).=C2=A0In the cases where the FSR uses more memory than previousl=
y, the previous behaviour was incorrect.</div>
<div><br></div><div>No matter which representation is used, there will be a=
 certain amount of overhead (which is the majority of what most of your exa=
mples have shown). Here are examples which demonstrate cases where UTF-8 us=
es less memory, cases where the FSR uses less memory, and cases where they =
use the same amount of memory (accounting for the minimum amount of overhea=
d required for each).<br>
</div></div><div><br></div><div><div>Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep=
 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32</div><div>Type &qu=
ot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&=
quot; for more information.</div>
<div>&gt;&gt;&gt; import sys</div><div>&gt;&gt;&gt;<br></div><div>&gt;&gt;&=
gt; fsr =3D u&quot;&quot;</div><div>&gt;&gt;&gt; utf8 =3D fsr.encode(&quot;=
utf-8&quot;)</div><div>&gt;&gt;&gt; min_fsr_overhead =3D sys.getsizeof(fsr)=
</div>
<div>&gt;&gt;&gt; min_utf8_overhead =3D sys.getsizeof(utf8)</div><div>&gt;&=
gt;&gt; min_fsr_overhead</div><div>49</div><div>&gt;&gt;&gt; min_utf8_overh=
ead</div><div>33</div><div>&gt;&gt;&gt;<br></div><div>&gt;&gt;&gt; fsr =3D =
u&quot;\u0001&quot; * 1000</div>
<div>&gt;&gt;&gt; utf8 =3D fsr.encode(&quot;utf-8&quot;)</div><div>&gt;&gt;=
&gt; sys.getsizeof(fsr) - min_fsr_overhead</div><div>1000</div><div>&gt;&gt=
;&gt; sys.getsizeof(utf8) - min_utf8_overhead</div><div>1000</div><div>&gt;=
&gt;&gt;<br>
</div><div>&gt;&gt;&gt; fsr =3D u&quot;\u0081&quot; * 1000</div><div>&gt;&g=
t;&gt; utf8 =3D fsr.encode(&quot;utf-8&quot;)</div><div>&gt;&gt;&gt; sys.ge=
tsizeof(fsr) - min_fsr_overhead</div><div>1024</div><div>&gt;&gt;&gt; sys.g=
etsizeof(utf8) - min_utf8_overhead</div>
<div>2000</div><div>&gt;&gt;&gt;<br></div><div>&gt;&gt;&gt; fsr =3D u&quot;=
\u0001\u0081&quot; * 1000</div><div>&gt;&gt;&gt; utf8 =3D fsr.encode(&quot;=
utf-8&quot;)</div><div>&gt;&gt;&gt; sys.getsizeof(fsr) - min_fsr_overhead</=
div>
<div>2024</div><div>&gt;&gt;&gt; sys.getsizeof(utf8) - min_utf8_overhead</d=
iv><div>3000</div><div>&gt;&gt;&gt;<br></div><div>&gt;&gt;&gt; fsr =3D u&qu=
ot;\u0101&quot; * 1000</div><div>&gt;&gt;&gt; utf8 =3D fsr.encode(&quot;utf=
-8&quot;)</div>
<div>&gt;&gt;&gt; sys.getsizeof(fsr) - min_fsr_overhead</div><div>2025</div=
><div>&gt;&gt;&gt; sys.getsizeof(utf8) - min_utf8_overhead</div><div>2000</=
div><div>&gt;&gt;&gt;<br></div><div>&gt;&gt;&gt; fsr =3D u&quot;\u0101\u008=
1&quot; * 1000</div>
<div>&gt;&gt;&gt; utf8 =3D fsr.encode(&quot;utf-8&quot;)</div><div>&gt;&gt;=
&gt; sys.getsizeof(fsr) - min_fsr_overhead</div><div>4025</div><div>&gt;&gt=
;&gt; sys.getsizeof(utf8) - min_utf8_overhead</div><div>4000</div></div><di=
v>
<br></div><div><div>Indexing a character in UTF-8 is O(N) - you have to tra=
verse the the string up to the character being indexed. Indexing a characte=
r in the FSR is O(1). In all cases the FSR has better performance character=
istics for indexing and slicing than UTF-8.</div>
</div><div><br></div><div>There are tradeoffs with both UTF-8 and the FSR. =
The Python developers decided the priorities for Unicode handling in Python=
 were:</div><div><br></div><div>1. Correctness</div><div>=C2=A0 a. all code=
 points must be handled correctly;</div>
<div>=C2=A0 b. =C2=A0it must not be possible to obtain part of a code point=
 (e.g. the first byte only of a multi-byte code point);</div><div><br></div=
><div>2. No change in the Big O characteristics of string operations e.g. i=
ndexing must remain O(1);</div>
<div><br></div><div>3. Reduced memory use in most cases.</div><div><br></di=
v><div>It is impossible for UTF-8 to meet both criteria 1b and 2 without ad=
ditional auxiliary data (which uses more memory and increases complexity of=
 the implementation). The FSR meets all 3 criteria.</div>
<div><br></div><div>Tim Delaney=C2=A0</div></div></div></div>

--047d7b5d5fba6847ac04ef690a23--