Path: csiph.com!usenet.pasdenom.info!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Fri, 17 Aug 2012 23:30:22 -0400
From: Dave Angel <d@davea.name>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0
MIME-Version: 1.0
To: Ian Kelly <ian.g.kelly@gmail.com>
Subject: Re: How do I display unicode value stored in a string variable using ord()
References: <f801e06f-f7b2-4aca-b352-66856a939746@googlegroups.com> <mailman.3406.1345161591.4697.python-list@python.org> <a6c030b2-25da-47a2-97b5-1e349394d762@googlegroups.com> <mailman.3422.1345227697.4697.python-list@python.org> <253ddd61-4bb5-4f46-b58c-525e55b27558@googlegroups.com> <502EAFB2.7050405@davea.name> <CALwzidkhc+Lf=A8bMWye=Cy36f91fhcwYt8rBB52AgTE=V1HKw@mail.gmail.com>
In-Reply-To: <CALwzidkhc+Lf=A8bMWye=Cy36f91fhcwYt8rBB52AgTE=V1HKw@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: Python <python-list@python.org>
Precedence: list
Reply-To: d@davea.name
Newsgroups: comp.lang.python
Message-ID: <mailman.3440.1345260650.4697.python-list@python.org>
Lines: 44
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:27279

On 08/17/2012 08:21 PM, Ian Kelly wrote:
> On Aug 17, 2012 2:58 PM, "Dave Angel" <d@davea.name> wrote:
>> The internal coding described in PEP 393 has nothing to do with latin-1
>> encoding.
> It certainly does. PEP 393 provides for Unicode strings to be represented
> internally as any of Latin-1, UCS-2, or UCS-4, whichever is smallest and
> sufficient to contain the data. I understand the complaint to be that while
> the change is great for strings that happen to fit in Latin-1, it is less
> efficient than previous versions for strings that do not.

That's not the way I interpreted the PEP 393.  It takes a pure unicode
string, finds the largest code point in that string, and chooses 1, 2 or
4 bytes for every character, based on how many bits it'd take for that
largest code point.   Further i read it to mean that only 00 bytes would
be dropped in the process, no other bytes would be changed.   I take it
as a coincidence that it happens to match latin-1;  that's the way
Unicode happened historically, and is not Python's fault.  Am I reading
it wrong?

I also figure this is going to be more space efficient than Python 3.2
for any string which had a max code point of 65535 or less (in Windows),
or 4billion or less (in real systems).  So unless French has code points
over 64k, I can't figure that anything is lost.

I have no idea about the times involved, so i wanted a more specific
complaint.

> I don't know how much merit there is to this claim. It would seem to me
> that even in non-western locales, most strings are likely to be Latin-1 or
> even ASCII, e.g.  class and attribute and function names.
>
>

The jmfauth rant I was responding to was saying that French isn't
efficiently encoded, and that performance of some vague operations were
somehow reduced by several fold.  I was just trying to get him to be
more specific.



-- 

DaveA