Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #5265

Re: unicode by default

From Terry Reedy <tjreedy@udel.edu>
Subject Re: unicode by default
Date 2011-05-12 16:42 -0400
References (5 earlier) <mailman.1439.1305167541.9059.python-list@python.org> <874o50k1eb.fsf@benfinney.id.au> <U8Lyp.1000$dL5.14@newsfe08.iad> <3ae7c960dc8cf622fcf95aa48ed9df40.squirrel@webmail.lexicon.net> <BANLkTi=CJVPX+w=VzHoWHmd9GoE3o7DFeA@mail.gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.1497.1305232982.9059.python-list@python.org> (permalink)

Show all headers | View raw


On 5/12/2011 12:17 PM, Ian Kelly wrote:
> On Thu, May 12, 2011 at 1:58 AM, John Machin<sjmachin@lexicon.net>  wrote:
>> On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:
>>
>>>
>>> So, the UTF-16 UTF-32 is INTERNAL only, for Python
>>
>> NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
>> encodings for the EXTERNAL representation of Unicode characters in byte
>> streams.
>
> Right.  *Under the hood* Python uses UCS-2 (which is not exactly the
> same thing as UTF-16, by the way) to represent Unicode strings.

I know some people say that, but according to the definitions of the 
unicode consortium, that is wrong! The earlier UCS-2 *cannot* represent 
chars in the Supplementary Planes. The later (1996) UTF-16, which Python 
uses, can. The standard considers 'UCS-2' obsolete long ago. See

https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
or http://www.unicode.org/faq/basic_q.html#14

The latter says: "Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode 
implementation up to Unicode 1.1, before surrogate code points and 
UTF-16 were added to Version 2.0 of the standard. This term should now 
be avoided."

It goes on: "Sometimes in the past an implementation has been labeled 
"UCS-2" to indicate that it does not support supplementary characters 
and doesn't interpret pairs of surrogate code points as characters. Such 
an implementation would not handle processing of character properties, 
code point boundaries, collation, etc. for supplementary characters."

I know that 16-bit Python *does* use surrogate pairs for supplementary 
chars and at least some properties work for them. I am not sure exactly 
what the rest means.

> However, this is entirely transparent.  To the Python programmer, a
> unicode string is just an abstraction of a sequence of code-points.
> You don't need to think about UCS-2 at all.  The only times you need
> to worry about encodings are when you're encoding unicode characters
> to byte strings, or decoding bytes to unicode characters, or opening a
> stream in text mode; and in those cases the only encoding that matters
> is the external one.

If one uses unicode chars in the Supplementary Planes above the BMP (the 
first 2**16), which require surrogate pairs for 16 bit unicode (UTF-16), 
then the abstraction leaks.

-- 
Terry Jan Reedy

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 16:37 -0500
  Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-11 16:09 -0600
    Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 17:51 -0500
      Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 09:32 +1000
        Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 20:22 -0500
          Re: unicode by default MRAB <python@mrabarnett.plus.com> - 2011-05-12 03:31 +0100
            Re: unicode by default Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-12 03:16 +0000
              Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 22:44 -0500
                Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 00:12 -0400
                Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:43 -0500
                Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:14 +1000
                Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 21:14 -0700
                Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:41 +1000
                Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:14 -0500
                Re: unicode by default TheSaint <nobody@nowhere.net.no> - 2011-05-12 20:40 +0800
            Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-12 14:07 +1000
              Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:31 -0500
                Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 17:58 +1000
                Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 10:17 -0600
                Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-12 23:28 -0700
                Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-13 14:53 -0500
                Re: unicode by default Robert Kern <robert.kern@gmail.com> - 2011-05-13 15:18 -0500
                Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-13 21:41 -0400
                Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-14 02:41 -0500
                Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-14 03:26 -0700
                Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-14 16:26 -0400
                Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-15 09:47 +1000
                Re: unicode by default Nobody <nobody@nowhere.com> - 2011-05-14 09:34 +0100
                Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 16:42 -0400
                Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 16:25 -0600
          Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 13:54 +1000
  Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 15:34 -0700

csiph-web