Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <52CAC780.1010204@stoneleaf.us>
References: <lablra$1mc$2@ger.gmane.org> <52C9FD02.3080109@stoneleaf.us> <CAGGBd_qBA0OBELxgzERO4Tfs6quK7oYq8v_2idA=K2ycoiO6Dg@mail.gmail.com> <52CAC780.1010204@stoneleaf.us>
Date: Tue, 7 Jan 2014 02:46:08 +1100
Subject: Re: "More About Unicode in Python 2 and 3"
From: Chris Angelico <rosuav@gmail.com>
Cc: Python <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5023.1389023179.18130.python-list@python.org>
Lines: 39
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:63289

On Tue, Jan 7, 2014 at 2:10 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
> On 01/05/2014 06:55 PM, Chris Angelico wrote:
>>
>>
>> It can't be both things. It's either bytes or it's text.
>
>
> Of course it can be:
>
> 0000000: 0372 0106 0000 0000 6100 1d00 0000 0000  .r......a.......
> 0000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 0000020: 4e41 4d45 0000 0000 0000 0043 0100 0000  NAME.......C....
> 0000030: 1900 0000 0000 0000 0000 0000 0000 0000  ................
> 0000040: 4147 4500 0000 0000 0000 004e 1a00 0000  AGE........N....
> 0000050: 0300 0000 0000 0000 0000 0000 0000 0000  ................
> 0000060: 0d1a 0a                                  ...
>
> And there we are, mixed bytes and ascii data.  As I said earlier, my example
> is minimal, but still very frustrating in that normal operations no longer
> work.  Incidentally, if you were thinking that NAME and AGE were part of the
> ascii text, you'd be wrong -- the field names are also encoded, as are the
> Character and Memo fields.

That's alternating between encoded text and non-text bytes. Each
individual piece is either text or non-text, not both. The ideal way
to manipulate it would most likely be a simple decode operation that
turns this into (probably) a dictionary, decoding both the
structure/layout and UTF-8 in a single operation. But a less ideal
(and more convenient) solution might be involving what's currently
under discussion elsewhere: a (possibly partial) percent-formatting or
.format() method for bytes.

None of this changes the fact that there are bytes used to
store/transmit stuff, and abstract concepts used to manipulate them.
Just like nobody expects to be able to write a dict to a file without
some form of encoding (pickle, JSON, whatever), you shouldn't expect
to write a character string without first turning it into bytes.

ChrisA