Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Sun, 28 Jul 2013 21:55:50 +0200
From: Antoon Pardon <antoon.pardon@rece.vub.ac.be>
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130704 Icedove/17.0.7
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: FSR and unicode compliance - was Re: RE Module Performance
References: <mailman.4618.1373613834.3114.python-list@python.org> <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> <CAN1F8qUFP3uX57HhiiUPaYqO3h_HiT8Q_YD=vCYky3EAWsdE7Q@mail.gmail.com> <mailman.4666.1373670835.3114.python-list@python.org> <4f1067f6-bc99-42ad-9166-37fb228b90e8@googlegroups.com> <mailman.5094.1374759404.3114.python-list@python.org> <51f14395$0$29971$c3e8da3$5496439d@news.astraweb.com> <mailman.5106.1374766576.3114.python-list@python.org> <51f15e03$0$29971$c3e8da3$5496439d@news.astraweb.com> <mailman.5127.1374808181.3114.python-list@python.org> <8203e802-9dc5-44c5-9547-6e1947ee224b@googlegroups.com> <mailman.5160.1374890711.3114.python-list@python.org> <f4bb2528-930e-4c0a-820e-66f00ac2b5b6@googlegroups.com> <mailman.5191.1375026785.3114.python-list@python.org> <c5eed93b-bfa1-44fe-9a8f-67a7d9380b20@googlegroups.com>
In-Reply-To: <c5eed93b-bfa1-44fe-9a8f-67a7d9380b20@googlegroups.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5204.1375041352.3114.python-list@python.org>
Lines: 154
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:51404

Op 28-07-13 21:23, wxjmfauth@gmail.com schreef:
> Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit :
>> On 07/27/2013 12:21 PM, wxjmfauth@gmail.com wrote:
>>
>>> Good point. FSR, nice tool for those who wish to teach
>>
>>> Unicode. It is not every day, one has such an opportunity.
>>
>>
>>
>> I had a long e-mail composed, but decided to chop it down, but still too
>>
>> long.  so I ditched a lot of the context, which jmf also seems to do.
>>
>> Apologies.
>>
>>
>>
>> 1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32
>>
>> is an official encoding.  FSR only differs from UTF-32 in that the
>>
>> padding zeros are stripped off such that it is stored in the most
>>
>> compact form that can handle all the characters in string, which is
>>
>> always known at string creation time.  Now you can argue many things,
>>
>> but to say FSR is not unicode compliant is quite a stretch!  What
>>
>> unicode entities or characters cannot be stored in strings using FSR?
>>
>> What sequences of bytes in FSR result in invalid Unicode entities?
>>
>>
>>
>> 2. strings in Python *never change*.  They are immutable.  The +
>>
>> operator always copies strings character by character into a new string
>>
>> object, even if Python had used UTF-8 internally.  If you're doing a lot
>>
>> of string concatenations, perhaps you're using the wrong data type.  A
>>
>> byte buffer might be better for you, where you can stuff utf-8 sequences
>>
>> into it to your heart's content.
>>
>>
>>
>> 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
>>
>> slicing a string would be very very slow, and that's unacceptable for
>>
>> the use cases of python strings.  I'm assuming you understand big O
>>
>> notation, as you talk of experience in many languages over the years.
>>
>> FSR and UTF-32 both are O(1) for slicing and lookups.  UTF-8, 16 and any
>>
>> variable-width encoding are always O(n).  A lot slower!
>>
>>
>>
>> 4. Unicode is, well, unicode.  You seem to hop all over the place from
>>
>> talking about code points to bytes to bits, using them all
>>
>> interchangeably.  And now you seem to be claiming that a particular byte
>>
>> encoding standard is by definition unicode (UTF-8).  Or at least that's
>>
>> how it sounds.  And also claim FSR is not compliant with unicode
>>
>> standards, which appears to me to be completely false.
>>
>>
>>
>> Is my understanding of these things wrong?
>
> ------
>
> Compare these (a BDFL exemple, where I'using a non-ascii char)
>
> Py 3.2 (narrow build)
>
>>>> timeit.timeit("a = 'hundred'; 'x' in a")
> 0.09897159682121348
>>>> timeit.timeit("a = 'hundre€'; 'x' in a")
> 0.09079501961732461
>>>> sys.getsizeof('d')
> 32
>>>> sys.getsizeof('€')
> 32
>>>> sys.getsizeof('dd')
> 34
>>>> sys.getsizeof('d€')
> 34
>
>
> Py3.3
>
>>>> timeit.timeit("a = 'hundred'; 'x' in a")
> 0.12183182740848858
>>>> timeit.timeit("a = 'hundre€'; 'x' in a")
> 0.2365732969632326
>>>> sys.getsizeof('d')
> 26
>>>> sys.getsizeof('€')
> 40
>>>> sys.getsizeof('dd')
> 27
>>>> sys.getsizeof('d€')
> 42
>
> Tell me which one seems to be more "unicode compliant"?

Cant tell, you give no relevant information on which one can decide
this question.

> The goal of Unicode is to handle every char "equaly".

Not to this kind of detail, which is looking at irrelevant
implementation details.

> Now, the problem: memory. Do not forget that à la "FSR"
> mechanism for a non-ascii user is *irrelevant*. As
> soon as one uses one single non-ascii, your ascii feature
> is lost. (That why we have all these dedicated coding
> schemes, utfs included).

So? Why should that trouble me? As far as I understand
whether I have an ascii string or not is totally irrelevant
to the application programmer. Within the application I
just process strings and let the programming environment
keep track of these details in a transparant way unless
you start looking at things like getsizeof, which gives
you implementation details that are mostly irrelevant
in deciding whether the behaviour is compliant or not.

>>>> sys.getsizeof('abc' * 1000 + 'z')
> 3026
>>>> sys.getsizeof('abc' * 1000 + '\U00010010')
> 12044
>
> A bit secret. The larger a repertoire of characters
> is, the more bits you needs.
> Secret #2. You can not escape from this.

And totally unimportant for deciding complyance.

-- 
Antoon Pardon