Path: csiph.com!usenet.pasdenom.info!dedibox.gegeweb.org!gegeweb.eu!nntpfeed.proxad.net!proxad.net!feeder1-2.proxad.net!usenet-fr.net!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!newsgate.cistron.nl!newsgate.news.xs4all.nl!194.109.133.85.MISMATCH!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <CALwzidnyROMzPGb1vFRiQ8+8Tx0_vniNsHPUSMSRxeMqmLdp2g@mail.gmail.com>
References: <f801e06f-f7b2-4aca-b352-66856a939746@googlegroups.com> <mailman.3406.1345161591.4697.python-list@python.org> <a6c030b2-25da-47a2-97b5-1e349394d762@googlegroups.com> <mailman.3422.1345227697.4697.python-list@python.org> <253ddd61-4bb5-4f46-b58c-525e55b27558@googlegroups.com> <502EAFB2.7050405@davea.name> <CALwzidkhc+Lf=A8bMWye=Cy36f91fhcwYt8rBB52AgTE=V1HKw@mail.gmail.com> <mailman.3440.1345260650.4697.python-list@python.org> <502f15b5$0$29978$c3e8da3$5496439d@news.astraweb.com> <CALwzidnyROMzPGb1vFRiQ8+8Tx0_vniNsHPUSMSRxeMqmLdp2g@mail.gmail.com>
From: Ian Kelly <ian.g.kelly@gmail.com>
Date: Sat, 18 Aug 2012 09:18:39 -0600
Subject: Re: How do I display unicode value stored in a string variable using ord()
To: Python <python-list@python.org>
Content-Type: text/plain; charset=ISO-8859-1
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3452.1345303152.4697.python-list@python.org>
Lines: 36
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:27297

(Resending this to the list because I previously sent it only to
Steven by mistake.  Also showing off a case where top-posting is
reasonable, since this bit requires no context. :-)

On Sat, Aug 18, 2012 at 1:41 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
>
> On Aug 17, 2012 10:17 PM, "Steven D&apos;Aprano"
> <steve+comp.lang.python@pearwood.info> wrote:
>>
>> Unicode strings are not represented as Latin-1 internally. Latin-1 is a
>> byte encoding, not a unicode internal format. Perhaps you mean to say
>> that they are represented as a single byte format?
>
> They are represented as a single-byte format that happens to be equivalent
> to Latin-1, because Latin-1 is a proper subset of Unicode; every character
> representable in Latin-1 has a byte value equal to its Unicode codepoint.
> This talk of whether it's a byte encoding or a 1-byte Unicode representation
> is then just semantics. Even the PEP refers to the 1-byte representation as
> Latin-1.
>
>>
>> >> I understand the complaint
>> >> to be that while the change is great for strings that happen to fit in
>> >> Latin-1, it is less efficient than previous versions for strings that
>> >> do not.
>> >
>> > That's not the way I interpreted the PEP 393.  It takes a pure unicode
>> > string, finds the largest code point in that string, and chooses 1, 2 or
>> > 4 bytes for every character, based on how many bits it'd take for that
>> > largest code point.
>>
>> That's how I interpret it too.
>
> I don't see how this is any different from what I described. Using all 4
> bytes of the code point, you get UCS-4. Truncating to 2 bytes, you get
> UCS-2. Truncating to 1 byte, you get Latin-1.