Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <k0s0p3$rm7$1@ger.gmane.org>
References: <f801e06f-f7b2-4aca-b352-66856a939746@googlegroups.com> <308df2af-abe7-4043-b199-0a39f440e0ab@googlegroups.com> <502f8a2a$0$29978$c3e8da3$5496439d@news.astraweb.com> <7xehn4vyya.fsf@ruckus.brouhaha.com> <5030832d$0$29978$c3e8da3$5496439d@news.astraweb.com> <7x8vdbmho6.fsf@ruckus.brouhaha.com> <k0r830$mb9$1@ger.gmane.org> <CAPTjJmrFNceZuLy9V2JBTH2gR_yPHXo7f2UbDqzT1LuROaMsqw@mail.gmail.com> <k0s0p3$rm7$1@ger.gmane.org>
Date: Mon, 20 Aug 2012 14:07:39 +1000
Subject: Re: How do I display unicode value stored in a string variable using ord()
From: Chris Angelico <rosuav@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3536.1345435662.4697.python-list@python.org>
Lines: 46
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:27457

On Mon, Aug 20, 2012 at 10:35 AM, Terry Reedy <tjreedy@udel.edu> wrote:
> On 8/19/2012 6:42 PM, Chris Angelico wrote:
>> However, Python goes a bit further by making it VERY clear that this
>> is a mere optimization, and that Unicode strings and bytes strings are
>> completely different beasts. In Pike, it's possible to forget to
>> encode something before (say) writing it to a socket. Everything works
>> fine while you have only ASCII characters in the string, and then
>> breaks when you have a >255 codepoint - or perhaps worse, when you
>> have a 127<x<256, and the other end misinterprets it.
>
> Python writes strings to file objects, including open sockets, without
> creating a bytes object -- IF the file is opened in text mode, which always
> has an associated encoding, even if the default 'ascii'. From what you say,
> this is what Pike is missing.

In text mode, the library does the encoding, but an encoding still happens.

> I am pretty sure that the obvious optimization has already been done. The
> internal bytes of all-ascii text can safely be sent to a file with ascii (or
> ascii-compatible) encoding without intermediate 'decoding'. I remember
> several patches of that sort. If a string is internally ucs2 and the file is
> declared usc2 or utf-16 encoding, then again, pairs of bytes can go directly
> (possibly with a byte swap).

Maybe it doesn't take any memory change, but there is a data type
change. A Unicode string cannot be sent over the network; an encoding
is needed.

In Pike, I can take a string like "\x20AC" (or "\u20ac" or
"\U000020ac", same thing) and manipulate it as a one-character string,
but I cannot write it to a file or file-like object. I can, however,
pass it through a codec (and there's string_to_utf8() for the
convenience of the common case), and get back something like
"\xe2\x82\xac", which is a three-byte string. The thing is, though,
that this new string is of exactly the same data type as the original:
'string'. Which means that I could have a string containing Latin-1
but not ASCII characters, and Pike will happily write it to a socket
without raising a compile-time or run-time error. Python, under the
same circumstances, would either raise an error or quietly (and
correctly) encode the data.

But this is a relatively trivial point, in the scheme of things.
Python has an excellent model now for handling Unicode strings, and I
would STRONGLY recommend everyone to upgrade to 3.3.

ChrisA