Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Thu, 28 Mar 2013 21:50:14 +0000
From: MRAB <python@mrabarnett.plus.com>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130307 Thunderbird/17.0.4
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]
References: <mailman.3703.1364248275.2939.python-list@python.org> <kit1kg$g2u$1@ger.gmane.org> <nad-98F0A4.17004226032013@news.gmane.org> <kitdqr$4m4$2@ger.gmane.org> <nad-8CB9C0.18315026032013@news.gmane.org> <mailman.3805.1364385073.2939.python-list@python.org> <5153a12d$0$29998$c3e8da3$5496439d@news.astraweb.com> <mailman.3845.1364441182.2939.python-list@python.org> <d2cc443a-e049-42ed-abc6-66b5ea600fe7@j1g2000pbq.googlegroups.com> <mailman.3860.1364451682.2939.python-list@python.org> <987c4bd9-0e5e-4387-9c78-1075a77d3c47@c6g2000yqh.googlegroups.com> <mailman.3868.1364466636.2939.python-list@python.org> <b3808ea9-03fa-4781-aefe-af428899ee9c@5g2000yqz.googlegroups.com> <mailman.3901.1364488470.2939.python-list@python.org> <7f993624-8105-4055-a268-3417e5fe21dc@g4g2000yqd.googlegroups.com> <mailman.3913.1364502948.2939.python-list@python.org> <691c604c-b643-4d66-a2ea-c5c52d603316@c6g2000yqh.googlegroups.com>
In-Reply-To: <691c604c-b643-4d66-a2ea-c5c52d603316@c6g2000yqh.googlegroups.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: list
Reply-To: python-list@python.org
Newsgroups: comp.lang.python
Message-ID: <mailman.3916.1364507414.2939.python-list@python.org>
Lines: 102
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:42196

On 28/03/2013 21:11, jmfauth wrote:
> On 28 mar, 21:29, Benjamin Kaplan <benjamin.kap...@case.edu> wrote:
>> On Thu, Mar 28, 2013 at 10:48 AM, jmfauth <wxjmfa...@gmail.com> wrote:
>> > On 28 mar, 17:33, Ian Kelly <ian.g.ke...@gmail.com> wrote:
>> >> On Thu, Mar 28, 2013 at 7:34 AM, jmfauth <wxjmfa...@gmail.com> wrote:
>> >> > The flexible string representation takes the problem from the
>> >> > other side, it attempts to work with the characters by using
>> >> > their representations and it (can only) fails...
>>
>> >> This is false.  As I've pointed out to you before, the FSR does not
>> >> divide characters up by representation.  It divides them up by
>> >> codepoint -- more specifically, by the *bit-width* of the codepoint.
>> >> We call the internal format of the string "ASCII" or "Latin-1" or
>> >> "UCS-2" for conciseness and a point of reference, but fundamentally
>> >> all of the FSR formats are simply byte arrays of *codepoints* -- you
>> >> know, those things you keep harping on.  The major optimization
>> >> performed by the FSR is to consistently truncate the leading zero
>> >> bytes from each codepoint when it is possible to do so safely.  But
>> >> regardless of to what extent this truncation is applied, the string is
>> >> *always* internally just an array of codepoints, and the same
>> >> algorithms apply for all representations.
>>
>> > -----
>>
>> > You know, we can discuss this ad nauseam. What is important
>> > is Unicode.
>>
>> > You have transformed Python back in an ascii oriented product.
>>
>> > If Python had imlemented Unicode correctly, there would
>> > be no difference in using an "a", "é", "€" or any character,
>> > what the narrow builds did.
>>
>> > If I am practically the only one, who speakes /discusses about
>> > this, I can ensure you, this has been noticed.
>>
>> > Now, it's time to prepare the Asparagus, the "jambon cru"
>> > and a good bottle a dry white wine.
>>
>> > jmf
>>
>> You still have yet to explain how Python's string representation is
>> wrong. Just how it isn't optimal for one specific case. Here's how I
>> understand it:
>>
>> 1) Strings are sequences of stuff. Generally, we talk about strings as
>> either sequences of bytes or sequences of characters.
>>
>> 2) Unicode is a format used to represent characters. Therefore,
>> Unicode strings are character strings, not byte strings.
>>
>> 2) Encodings  are functions that map characters to bytes. They
>> typically also define an inverse function that converts from bytes
>> back to characters.
>>
>> 3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
>> mentioned in the previous point. It happens to be one of the five
>> standard encodings that is defined for all characters in the Unicode
>> standard (the others being the little and big endian variants of
>> UTF-16 and UTF-32).
>>
>> 4) The internal representation of a character string DOES NOT MATTER.
>> All that matters is that the API represents it as a string of
>> characters, regardless of the representation. We could implement
>> character strings by putting the Unicode code-points in binary-coded
>> decimal and it would be a Unicode character string.
>>
>> 5) The String type that .NET and Java (and unicode type in Python
>> narrow builds) use is not a character string. It is a string of
>> shorts, each of which corresponds to a UTF-16 code point. I know this
>> is the case because in all of these, the length of "\u1f435" is 2 even
>> though it only consists of one character.
>>
>> 6) The new string representation in Python 3.3 can successfully
>> represent all characters in the Unicode standard. The actual number of
>> bytes that each character consumes is invisible to the user.
>
> ----------
>
>
> I shew enough examples. As soon as you are using non latin-1 chars
> your "optimization" just became irrelevant and not only this, you
> are penalized.
>
> I'm sorry, saying Python now is just covering the whole unicode
> range is not a valuable excuse. I prefer a "correct" version with
> a narrower range of chars, especially if this range represents
> the "daily used chars".
>
> I can go a step further, if I wish to write an application for
> Western European users, I'm better served if I'm using a coding
> scheme covering all thesee languages/scripts. What about cp1252 [*]?
> Does this not remind somthing?
>
> Python can do better, it only succeeds to do worth!
>
> [*] yes, I kwnow, internally ....
>
If you're that concerned about it, why don't you modify the source code so
that the string representation chooses between only 2 bytes and 4 bytes per
codepoint, and then see whether that you prefer that situation. How do
the memory usage and speed compare?