Re: unicode by default

References	(6 earlier) <874o50k1eb.fsf@benfinney.id.au> <U8Lyp.1000$dL5.14@newsfe08.iad> <3ae7c960dc8cf622fcf95aa48ed9df40.squirrel@webmail.lexicon.net> <BANLkTi=CJVPX+w=VzHoWHmd9GoE3o7DFeA@mail.gmail.com> <iqhgo6$uar$1@dough.gmane.org>
From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2011-05-12 16:25 -0600
Subject	Re: unicode by default
Newsgroups	comp.lang.python
Message-ID	<mailman.1500.1305239156.9059.python-list@python.org> (permalink)

Show all headers | View raw

On Thu, May 12, 2011 at 2:42 PM, Terry Reedy <tjreedy@udel.edu> wrote:
> On 5/12/2011 12:17 PM, Ian Kelly wrote:
>> Right.  *Under the hood* Python uses UCS-2 (which is not exactly the
>> same thing as UTF-16, by the way) to represent Unicode strings.
>
> I know some people say that, but according to the definitions of the unicode
> consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the
> Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The
> standard considers 'UCS-2' obsolete long ago. See
>
> https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
> or http://www.unicode.org/faq/basic_q.html#14

At the first link, in the section _Use in major operating systems and
environments_ it states, "The Python language environment officially
only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to
"Unicode" produces correct UTF-16. Python can be compiled to use UCS-4
(UTF-32) but this is commonly only done on Unix systems."

PEP 100 says:

    The internal format for Unicode objects should use a Python
    specific fixed format <PythonUnicode> implemented as 'unsigned
    short' (or another unsigned numeric type having 16 bits).  Byte
    order is platform dependent.

    This format will hold UTF-16 encodings of the corresponding
    Unicode ordinals.  The Python Unicode implementation will address
    these values as if they were UCS-2 values. UCS-2 and UTF-16 are
    the same for all currently defined Unicode character points.
    UTF-16 without surrogates provides access to about 64k characters
    and covers all characters in the Basic Multilingual Plane (BMP) of
    Unicode.

    It is the Codec's responsibility to ensure that the data they pass
    to the Unicode object constructor respects this assumption.  The
    constructor does not check the data for Unicode compliance or use
    of surrogates.

I'm getting out of my depth here, but that implies to me that while
Python stores UTF-16 and can correctly encode/decode it to UTF-8,
other codecs might only work correctly with UCS-2, and the unicode
class itself ignores surrogate pairs.

Although I'm not sure how much this might have changed since the
original implementation, especially for Python 3.

Thread

unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 16:37 -0500
  Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-11 16:09 -0600
    Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 17:51 -0500
      Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 09:32 +1000
        Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 20:22 -0500
          Re: unicode by default MRAB <python@mrabarnett.plus.com> - 2011-05-12 03:31 +0100
            Re: unicode by default Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-12 03:16 +0000
              Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 22:44 -0500
                Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 00:12 -0400
                Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:43 -0500
                Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:14 +1000
                Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 21:14 -0700
                Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:41 +1000
                Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:14 -0500
                Re: unicode by default TheSaint <nobody@nowhere.net.no> - 2011-05-12 20:40 +0800
            Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-12 14:07 +1000
              Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:31 -0500
                Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 17:58 +1000
                Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 10:17 -0600
                Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-12 23:28 -0700
                Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-13 14:53 -0500
                Re: unicode by default Robert Kern <robert.kern@gmail.com> - 2011-05-13 15:18 -0500
                Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-13 21:41 -0400
                Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-14 02:41 -0500
                Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-14 03:26 -0700
                Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-14 16:26 -0400
                Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-15 09:47 +1000
                Re: unicode by default Nobody <nobody@nowhere.com> - 2011-05-14 09:34 +0100
                Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 16:42 -0400
                Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 16:25 -0600
          Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 13:54 +1000
  Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 15:34 -0700

csiph-web