Groups > comp.lang.python > #63289 > unrolled thread

Re: "More About Unicode in Python 2 and 3"

Started by	Chris Angelico <rosuav@gmail.com>
First post	2014-01-07 02:46 +1100
Last post	2014-01-07 02:46 +1100
Articles	1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-07 02:46 +1100

#63289 — Re: "More About Unicode in Python 2 and 3"

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-07 02:46 +1100
Subject	Re: "More About Unicode in Python 2 and 3"
Message-ID	<mailman.5023.1389023179.18130.python-list@python.org>

On Tue, Jan 7, 2014 at 2:10 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
> On 01/05/2014 06:55 PM, Chris Angelico wrote:
>>
>>
>> It can't be both things. It's either bytes or it's text.
>
>
> Of course it can be:
>
> 0000000: 0372 0106 0000 0000 6100 1d00 0000 0000  .r......a.......
> 0000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 0000020: 4e41 4d45 0000 0000 0000 0043 0100 0000  NAME.......C....
> 0000030: 1900 0000 0000 0000 0000 0000 0000 0000  ................
> 0000040: 4147 4500 0000 0000 0000 004e 1a00 0000  AGE........N....
> 0000050: 0300 0000 0000 0000 0000 0000 0000 0000  ................
> 0000060: 0d1a 0a                                  ...
>
> And there we are, mixed bytes and ascii data.  As I said earlier, my example
> is minimal, but still very frustrating in that normal operations no longer
> work.  Incidentally, if you were thinking that NAME and AGE were part of the
> ascii text, you'd be wrong -- the field names are also encoded, as are the
> Character and Memo fields.

That's alternating between encoded text and non-text bytes. Each
individual piece is either text or non-text, not both. The ideal way
to manipulate it would most likely be a simple decode operation that
turns this into (probably) a dictionary, decoding both the
structure/layout and UTF-8 in a single operation. But a less ideal
(and more convenient) solution might be involving what's currently
under discussion elsewhere: a (possibly partial) percent-formatting or
.format() method for bytes.

None of this changes the fact that there are bytes used to
store/transmit stuff, and abstract concepts used to manipulate them.
Just like nobody expects to be able to write a dict to a file without
some form of encoding (pickle, JSON, whatever), you shouldn't expect
to write a character string without first turning it into bytes.

ChrisA

[toc] | [standalone]

csiph-web

Re: "More About Unicode in Python 2 and 3"

Contents

#63289 — Re: "More About Unicode in Python 2 and 3"