Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #5270
| Path | csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <ian.g.kelly@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.000 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'constructor': 0.07; 'definitions': 0.07; 'numeric': 0.07; 'pep': 0.07; 'produces': 0.07; 'terry': 0.07; 'python': 0.07; '(bmp)': 0.09; 'can.': 0.09; 'commonly': 0.09; 'internally': 0.09; 'kelly': 0.09; 'obsolete': 0.09; 'unsigned': 0.09; 'url:faq': 0.09; 'utf-8': 0.09; 'pm,': 0.11; 'wrote:': 0.14; 'defined': 0.15; "'unsigned": 0.16; '*cannot*': 0.16; 'bits).': 0.16; 'dependent.': 0.16; 'ignores': 0.16; 'reedy': 0.16; 'subject:unicode': 0.16; 'surrogate': 0.16; 'surrogates': 0.16; 'surrogates.': 0.16; 'ucs-4': 0.16; 'url:unicode': 0.16; 'compiled': 0.18; 'language': 0.20; '(which': 0.21; '(or': 0.22; 'header:In-Reply-To:1': 0.22; 'right.': 0.22; 'thu,': 0.22; 'values': 0.23; 'stores': 0.23; 'values.': 0.23; 'objects': 0.24; 'url:wiki': 0.24; 'version': 0.25; 'byte': 0.25; 'received:209.85.161.46': 0.26; 'received:mail- fx0-f46.google.com': 0.26; 'correct': 0.26; 'environment': 0.26; "i'm": 0.26; 'later': 0.26; 'object': 0.27; 'pass': 0.27; 'changed': 0.27; 'fixed': 0.27; 'message-id:@mail.gmail.com': 0.28; '"the': 0.28; 'received:209.85.161': 0.29; 'class': 0.29; 'unicode': 0.29; 'operating': 0.30; 'depth': 0.31; 'respects': 0.31; 'strings.': 0.31; 'does': 0.31; 'to:addr:python-list': 0.32; 'another': 0.32; 'done': 0.32; 'character': 0.33; 'implemented': 0.33; 'uses': 0.34; 'characters': 0.35; 'correctly': 0.35; 'that,': 0.35; 'url:en': 0.35; 'implies': 0.35; 'getting': 0.36; 'platform': 0.36; 'data': 0.37; 'some': 0.37; 'represent': 0.37; 'should': 0.37; 'received:209.85': 0.37; 'exactly': 0.37; 'received:google.com': 0.38; 'but': 0.38; 'url:org': 0.38; 'here,': 0.39; 'hold': 0.39; 'earlier': 0.39; 'to:addr:python.org': 0.39; 'received:209': 0.39; 'how': 0.39; 'basic': 0.40; 'header:Received:5': 0.40; 'might': 0.40; 'address': 0.61; 'order': 0.61; '2011': 0.62; 'covers': 0.65; 'multilingual': 0.68; 'plane': 0.68; 'url:secure': 0.68; '100': 0.70; 'states,': 0.73; '12:17': 0.84; 'url:wikimedia': 0.84; 'way)': 0.91; 'officially': 0.96 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type:content-transfer-encoding; bh=Na/m8t4iXa+4RwCYPqPUolBhHqdn6B+7DWXHCaDP3ZY=; b=KjWFydXLyPqs9eAWvavo2H6lBcURvwmfEjPy/23eMQ5zL3LiAEM72bxOeW9pkzAgWr qIek5YBgEJ5elfTgIMxvSzSKV3ewvURroDc+H3sqK+YNlblFYOndahnrBxxrUlgDCUBS esN+FJK6HGsJA3UQntk1PW6zsLzwVFQXJrhrk= |
| DomainKey-Signature | a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=oulZ72LVUb7WYEQ6pKy/mg3a1XNwCYkeemUdDgjou1Rpv0CST+Ru92MqfRcpyWNtAs vFxOjwY6drBnC/FQ0Ao53StcIMCX2Gc2sj3YJ3bfjJBOcH5/JoBMc838b7wclCBSIZjy KGpTn5V7FK9jhHOuodZbRRkJ3L37F8f90wwvM= |
| MIME-Version | 1.0 |
| In-Reply-To | <iqhgo6$uar$1@dough.gmane.org> |
| References | <OkDyp.2983$M61.450@newsfe07.iad> <mailman.1433.1305151801.9059.python-list@python.org> <vpEyp.981$dL5.736@newsfe08.iad> <mailman.1435.1305157329.9059.python-list@python.org> <KDGyp.180$0t1.7@newsfe04.iad> <mailman.1439.1305167541.9059.python-list@python.org> <874o50k1eb.fsf@benfinney.id.au> <U8Lyp.1000$dL5.14@newsfe08.iad> <3ae7c960dc8cf622fcf95aa48ed9df40.squirrel@webmail.lexicon.net> <BANLkTi=CJVPX+w=VzHoWHmd9GoE3o7DFeA@mail.gmail.com> <iqhgo6$uar$1@dough.gmane.org> |
| From | Ian Kelly <ian.g.kelly@gmail.com> |
| Date | Thu, 12 May 2011 16:25:24 -0600 |
| Subject | Re: unicode by default |
| To | Python <python-list@python.org> |
| Content-Type | text/plain; charset=ISO-8859-1 |
| Content-Transfer-Encoding | quoted-printable |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.12 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.1500.1305239156.9059.python-list@python.org> (permalink) |
| Lines | 49 |
| NNTP-Posting-Host | 82.94.164.166 |
| X-Trace | 1305239156 news.xs4all.nl 41113 [::ffff:82.94.164.166]:57549 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | x330-a1.tempe.blueboxinc.net comp.lang.python:5270 |
Show key headers only | View raw
On Thu, May 12, 2011 at 2:42 PM, Terry Reedy <tjreedy@udel.edu> wrote:
> On 5/12/2011 12:17 PM, Ian Kelly wrote:
>> Right. *Under the hood* Python uses UCS-2 (which is not exactly the
>> same thing as UTF-16, by the way) to represent Unicode strings.
>
> I know some people say that, but according to the definitions of the unicode
> consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the
> Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The
> standard considers 'UCS-2' obsolete long ago. See
>
> https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
> or http://www.unicode.org/faq/basic_q.html#14
At the first link, in the section _Use in major operating systems and
environments_ it states, "The Python language environment officially
only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to
"Unicode" produces correct UTF-16. Python can be compiled to use UCS-4
(UTF-32) but this is commonly only done on Unix systems."
PEP 100 says:
The internal format for Unicode objects should use a Python
specific fixed format <PythonUnicode> implemented as 'unsigned
short' (or another unsigned numeric type having 16 bits). Byte
order is platform dependent.
This format will hold UTF-16 encodings of the corresponding
Unicode ordinals. The Python Unicode implementation will address
these values as if they were UCS-2 values. UCS-2 and UTF-16 are
the same for all currently defined Unicode character points.
UTF-16 without surrogates provides access to about 64k characters
and covers all characters in the Basic Multilingual Plane (BMP) of
Unicode.
It is the Codec's responsibility to ensure that the data they pass
to the Unicode object constructor respects this assumption. The
constructor does not check the data for Unicode compliance or use
of surrogates.
I'm getting out of my depth here, but that implies to me that while
Python stores UTF-16 and can correctly encode/decode it to UTF-8,
other codecs might only work correctly with UCS-2, and the unicode
class itself ignores surrogate pairs.
Although I'm not sure how much this might have changed since the
original implementation, especially for Python 3.
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 16:37 -0500
Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-11 16:09 -0600
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 17:51 -0500
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 09:32 +1000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 20:22 -0500
Re: unicode by default MRAB <python@mrabarnett.plus.com> - 2011-05-12 03:31 +0100
Re: unicode by default Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-12 03:16 +0000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 22:44 -0500
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 00:12 -0400
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:43 -0500
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:14 +1000
Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 21:14 -0700
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:41 +1000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:14 -0500
Re: unicode by default TheSaint <nobody@nowhere.net.no> - 2011-05-12 20:40 +0800
Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-12 14:07 +1000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:31 -0500
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 17:58 +1000
Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 10:17 -0600
Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-12 23:28 -0700
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-13 14:53 -0500
Re: unicode by default Robert Kern <robert.kern@gmail.com> - 2011-05-13 15:18 -0500
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-13 21:41 -0400
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-14 02:41 -0500
Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-14 03:26 -0700
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-14 16:26 -0400
Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-15 09:47 +1000
Re: unicode by default Nobody <nobody@nowhere.com> - 2011-05-14 09:34 +0100
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 16:42 -0400
Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 16:25 -0600
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 13:54 +1000
Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 15:34 -0700
csiph-web