Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!xlned.com!feeder5.xlned.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <OkDyp.2983$M61.450@newsfe07.iad>
References: <OkDyp.2983$M61.450@newsfe07.iad>
Date: Wed, 11 May 2011 15:34:02 -0700
Subject: Re: unicode by default
From: Benjamin Kaplan <benjamin.kaplan@case.edu>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1434.1305153267.9059.python-list@python.org>
Lines: 59
NNTP-Posting-Host: 82.94.164.166
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:5171

On Wed, May 11, 2011 at 2:37 PM, harrismh777 <harrismh777@charter.net> wrot=
e:
> hi folks,
> =A0 I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in pyth=
on
> 3.x by default. (I know what default means, I mean, what changed?)
>
> =A0 I think part of my problem is that I'm spoiled (American, ascii herit=
age)
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implication=
s
> for using 3.x, because I am reading that there are significant things to =
be
> aware of... what?
>
> =A0 On my installation 2.6 =A0sys.maxunicode comes up with 1114111, and m=
y 2.7
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set fo=
r
> UCS-2 (UTF-16) or 2 byte unicode(?). =A0 Do I understand this much correc=
tly?
>

Not really sure about that, but it doesn't matter anyway. Because even
though internally the string is stored as either a UCS-2 or a UCS-4
string, you never see that. You just see this string as a sequence of
characters. If you want to turn it into a sequence of bytes, you have
to use an encoding.

> =A0 The books say that the .py sources are UTF-8 by default... and that 3=
.x is
> either UCS-2 or UCS-4. =A0If I use the file handling capabilities of Pyth=
on in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?
>
> =A0 If I do not specify any code points above ascii 0xFF does any of this
> matter anyway?

ASCII only goes up to 0x7F. If you were using UTF-8 bytestrings, then
there is a difference for anything over that range. A byte string is a
sequence of bytes. A unicode string is a sequence of these mythical
abstractions called characters. So a unicode string u'\u00a0' will
have a length of 1. Encode that to UTF-8 and you'll find it has a
length of 2 (because UTF-8 uses 2 bytes to encode everything over 128-
the top bit is used to signal that you need the next byte for this
character)

 If you want the history behind the whole encoding mess, Joel Spolsky
wrote a rather amusing article explaining how this all came about:
http://www.joelonsoftware.com/articles/Unicode.html

And the biggest reason to use Unicode is so that you don't have to
worry about your program messing up because someone hands you input in
a different encoding than you used.