Groups > comp.lang.python > #63287 > unrolled thread

Re: "More About Unicode in Python 2 and 3"

Started by	Ethan Furman <ethan@stoneleaf.us>
First post	2014-01-06 07:10 -0800
Last post	2014-01-07 10:03 +1100
Articles	12 — 6 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: "More About Unicode in Python 2 and 3" Ethan Furman <ethan@stoneleaf.us> - 2014-01-06 07:10 -0800
    Re: "More About Unicode in Python 2 and 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-07 04:27 +1100
      Re: "More About Unicode in Python 2 and 3" Ethan Furman <ethan@stoneleaf.us> - 2014-01-06 10:34 -0800
        Re: "More About Unicode in Python 2 and 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-07 11:42 +1100
      Re: "More About Unicode in Python 2 and 3" Mark Janssen <dreamingforward@gmail.com> - 2014-01-06 13:30 -0600
      Re: "More About Unicode in Python 2 and 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-06 19:36 +0000
      Re: "More About Unicode in Python 2 and 3" Mark Janssen <dreamingforward@gmail.com> - 2014-01-06 13:44 -0600
        Re: "More About Unicode in Python 2 and 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-07 11:54 +1100
      Re: "More About Unicode in Python 2 and 3" Ned Batchelder <ned@nedbatchelder.com> - 2014-01-06 16:14 -0500
      Re: "More About Unicode in Python 2 and 3" Mark Janssen <dreamingforward@gmail.com> - 2014-01-06 15:23 -0600
      Re: "More About Unicode in Python 2 and 3" Mark Janssen <dreamingforward@gmail.com> - 2014-01-06 15:32 -0600
      Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-07 10:03 +1100

#63287 — Re: "More About Unicode in Python 2 and 3"

From	Ethan Furman <ethan@stoneleaf.us>
Date	2014-01-06 07:10 -0800
Subject	Re: "More About Unicode in Python 2 and 3"
Message-ID	<mailman.5022.1389022306.18130.python-list@python.org>

On 01/05/2014 06:37 PM, Dan Stromberg wrote:
>
> The argument seems to be "3.x doesn't work the way I'm accustomed to,
> so I'm not going to use it, and I'm going to shout about it until
> others agree with me."

The argument is that a very important, if small, subset a data manipulation become very painful in Py3.  Not impossible, 
and not difficult, but painful because the mental model and the contortions needed to get things to work don't sync up 
anymore.  Painful because Python is, at heart, a simple and elegant language, but with the use-case of embedded ascii in 
binary data that elegance went right out the window.

On 01/05/2014 06:55 PM, Chris Angelico wrote:
>
> It can't be both things. It's either bytes or it's text.

Of course it can be:

0000000: 0372 0106 0000 0000 6100 1d00 0000 0000  .r......a.......
0000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000020: 4e41 4d45 0000 0000 0000 0043 0100 0000  NAME.......C....
0000030: 1900 0000 0000 0000 0000 0000 0000 0000  ................
0000040: 4147 4500 0000 0000 0000 004e 1a00 0000  AGE........N....
0000050: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000060: 0d1a 0a                                  ...

And there we are, mixed bytes and ascii data.  As I said earlier, my example is minimal, but still very frustrating in 
that normal operations no longer work.  Incidentally, if you were thinking that NAME and AGE were part of the ascii 
text, you'd be wrong -- the field names are also encoded, as are the Character and Memo fields.

--
~Ethan~

[toc] | [next] | [standalone]

#63311

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-01-07 04:27 +1100
Message-ID	<52cae78d$0$29971$c3e8da3$5496439d@news.astraweb.com>
In reply to	#63287

Ethan Furman wrote:

> On 01/05/2014 06:37 PM, Dan Stromberg wrote:
>>
>> The argument seems to be "3.x doesn't work the way I'm accustomed to,
>> so I'm not going to use it, and I'm going to shout about it until
>> others agree with me."
> 
> The argument is that a very important, if small, subset a data
> manipulation become very painful in Py3.  Not impossible, and not
> difficult, but painful because the mental model and the contortions needed
> to get things to work don't sync up
> anymore.  Painful because Python is, at heart, a simple and elegant
> language, but with the use-case of embedded ascii in binary data that
> elegance went right out the window.
> 
> On 01/05/2014 06:55 PM, Chris Angelico wrote:
>>
>> It can't be both things. It's either bytes or it's text.
> 
> Of course it can be:
> 
> 0000000: 0372 0106 0000 0000 6100 1d00 0000 0000  .r......a.......
> 0000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 0000020: 4e41 4d45 0000 0000 0000 0043 0100 0000  NAME.......C....
> 0000030: 1900 0000 0000 0000 0000 0000 0000 0000  ................
> 0000040: 4147 4500 0000 0000 0000 004e 1a00 0000  AGE........N....
> 0000050: 0300 0000 0000 0000 0000 0000 0000 0000  ................
> 0000060: 0d1a 0a                                  ...
> 
> And there we are, mixed bytes and ascii data.  

Chris didn't say "bytes and ascii data", he said "bytes and TEXT".
Text != "ascii data", and the fact that some people apparently think it
does is pretty much the heart of the problem.

I see no mixed bytes and text. I see bytes. Since the above comes from a
file, it cannot be anything else but bytes. Do you think that a file that
happens to be a JPEG contains pixels? No. It contains bytes which, after
decoding, represents pixels. Same with text, ascii or otherwise.

Now, it is true that some of those bytes happen to fall into the same range
of values as ASCII-encoded text. They may even represent text after
decoding, but since we don't know what the file contents mean, we can't
know that. It might be a mere coincidence that the four bytes starting at
hex offset 40 is the C long 1095189760 which happens to look like "AGE"
with a null at the end. For historical reasons, your hexdump utility
performs that decoding step for you, which is why you can see "NAME"
and "AGE" in the right-hand block, but that doesn't mean the file contains
text. It contains bytes, some of which represents text after decoding.

If you (generic you) don't get that, you'll have a bad time. I mean *really*
get it, deep down in the bone. The long, bad habit of thinking as
ASCII-encoded bytes as text is the problem here. The average programmer has
years and years of experience thinking about decoding bytes to numbers and
back (just not by that name), so it doesn't lead to any cognitive
dissonance to think of hex 4147 4500 as either four bytes, two double-byte
ints, or a single four-byte int. But as soon as "text" comes into the
picture, the average programmer has equally many years of thinking that the
byte 41 "just is" the letter "A", and that's simply *wrong*.

> As I said earlier, my 
> example is minimal, but still very frustrating in
> that normal operations no longer work.  Incidentally, if you were thinking
> that NAME and AGE were part of the ascii text, you'd be wrong -- the field
> names are also encoded, as are the Character and Memo fields.

What Character and Memo fields? Are you trying to say that the NAME and AGE
are *not* actually ASCII text, but a mere coincidence, like my example of
1095189760? Or are you referring to the fact that they're actually encoded
as ASCII? If not, I have no idea what you are trying to say.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#63319

From	Ethan Furman <ethan@stoneleaf.us>
Date	2014-01-06 10:34 -0800
Message-ID	<mailman.5041.1389034608.18130.python-list@python.org>
In reply to	#63311

On 01/06/2014 09:27 AM, Steven D'Aprano wrote:
> Ethan Furman wrote:
>
> Chris didn't say "bytes and ascii data", he said "bytes and TEXT".
> Text != "ascii data", and the fact that some people apparently think it
> does is pretty much the heart of the problem.

The heart of a different problem, not this one.  The problem I refer to is that many binary formats have well-defined 
ascii-encoded text tidbits.  These tidbits were quite easy to work with in Py2, not difficult but not elegant in Py3, 
and even worse if you have to support both 2 and 3.

> Now, it is true that some of those bytes happen to fall into the same range
> of values as ASCII-encoded text. They may even represent text after
> decoding, but since we don't know what the file contents mean, we can't
> know that.

Of course we can -- we're the programmer, after all.  This is not a random bunch of bytes but a well defined format for 
storing data.

> It might be a mere coincidence that the four bytes starting at
> hex offset 40 is the C long 1095189760 which happens to look like "AGE"
> with a null at the end. For historical reasons, your hexdump utility
> performs that decoding step for you, which is why you can see "NAME"
> and "AGE" in the right-hand block, but that doesn't mean the file contains
> text. It contains bytes, some of which represents text after decoding.

As it happens, 'NAME' and 'AGE' are encoded, and will be decoded.  They could just as easily have contained tilde's, 
accents, umlauts, and other strange (to me) characters.  It's actually the 'C' and the 'N' that bug me (like I said, my 
example is minimal, especially compared to a network protocol).

And you're right -- it is easy to say FIELD_TYPE = slice(15,16), and it was also easy to say FIELD_TYPE = 15, but there 
is a critical difference -- can you spot it?

..
..
..
In case you didn't:  both work in Py2, only the slice version works (correctly) in Py3, but the worst part is why do I 
have to use a slice to take a single byte when a simple index should work?  Because the bytes type lies.  It shows, for 
example, b'\r\n\x12\x08N\x00' but when I try to access that N to see if this is a Numeric field I get:

--> b'\r\n\x12\x08N\x00'[4]
78

This is a cognitive dissonance that one does not expect in Python.

> If you (generic you) don't get that, you'll have a bad time. I mean *really*
> get it, deep down in the bone. The long, bad habit of thinking as
> ASCII-encoded bytes as text is the problem here.

Different problem.  The problem here is that bytes and byte literals don't compare equal.

> the average programmer has equally many years of thinking that the
> byte 41 "just is" the letter "A", and that's simply *wrong*.

Agreed.  But byte 41 != b'A', and that is equally wrong.

>> As I said earlier, my
>> example is minimal, but still very frustrating in
>> that normal operations no longer work.  Incidentally, if you were thinking
>> that NAME and AGE were part of the ascii text, you'd be wrong -- the field
>> names are also encoded, as are the Character and Memo fields.
>
> What Character and Memo fields? Are you trying to say that the NAME and AGE
> are *not* actually ASCII text, but a mere coincidence, like my example of
> 1095189760? Or are you referring to the fact that they're actually encoded
> as ASCII? If not, I have no idea what you are trying to say.

Yes, NAME and AGE are *not* ASCII text, but latin-1 encoded.  The C and the N are ASCII, meaningful as-is.  The actual 
data stored in a Character (NAME in this case) or Memo (not shown) field would also be latin-1 encoded.  (And before you 
ask, the encoding is stored in the file header.)

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#63384

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-01-07 11:42 +1100
Message-ID	<52cb4d82$0$29979$c3e8da3$5496439d@news.astraweb.com>
In reply to	#63319

Ethan Furman wrote:

> On 01/06/2014 09:27 AM, Steven D'Aprano wrote:
>> Ethan Furman wrote:
>>
>> Chris didn't say "bytes and ascii data", he said "bytes and TEXT".
>> Text != "ascii data", and the fact that some people apparently think it
>> does is pretty much the heart of the problem.
> 
> The heart of a different problem, not this one.  The problem I refer to is
> that many binary formats have well-defined
> ascii-encoded text tidbits.  These tidbits were quite easy to work with in
> Py2, not difficult but not elegant in Py3, and even worse if you have to
> support both 2 and 3.

Many things are more difficult if you have to support a large range of
versions. That's life, for a programmer.

>> Now, it is true that some of those bytes happen to fall into the same
>> range of values as ASCII-encoded text. They may even represent text after
>> decoding, but since we don't know what the file contents mean, we can't
>> know that.
> 
> Of course we can -- we're the programmer, after all.  This is not a random
> bunch of bytes but a well defined format for storing data.

No, you misunderstand me. *You* may know what the data represents, but *we*
don't, because you just drop a hex dump in our laps with no explanation.

>> It might be a mere coincidence that the four bytes starting at
>> hex offset 40 is the C long 1095189760 which happens to look like "AGE"
>> with a null at the end. For historical reasons, your hexdump utility
>> performs that decoding step for you, which is why you can see "NAME"
>> and "AGE" in the right-hand block, but that doesn't mean the file
>> contains text. It contains bytes, some of which represents text after
>> decoding.
> 
> As it happens, 'NAME' and 'AGE' are encoded, and will be decoded.

You're either saying something utterly trivial, or something utterly
profound, and I can't tell which.

Of course they are encoded. The file doesn't contain the letter "N", it
contains the byte 0x4E. So what are you actually trying to say?

> They could just as easily have contained tilde's,
> accents, umlauts, and other strange (to me) characters.

I'm especially confused here because tildes are including in the ASCII
character set. Here's one here: ~ 

> It's actually the 
> 'C' and the 'N' that bug me (like I said, my example is minimal,
> especially compared to a network protocol).
> 
> And you're right -- it is easy to say FIELD_TYPE = slice(15,16), and it
> was also easy to say FIELD_TYPE = 15, but there is a critical difference
> -- can you spot it?
> 
> ..
> ..
> ..
> In case you didn't:  both work in Py2, only the slice version works
> (correctly) in Py3,

I accept that using the slice is inelegant. But lots of things are inelegant
when you do them them wrong way. Treating your textual data as bytes is the
wrong way. You apparently know that that your data is encoded text, you
apparently know the encoding... so why don't you just decode it and treat
it as text instead of insisting on dealing with the raw bytes?

Are you worried about performance? I'd be sympathetic if you were writing
some low-level network protocol stuff where performance is vital, but you
keep saying that your application is "minimal", which I interpret as
performance not being critical. So what's the deal?

> but the worst part is why do I
> have to use a slice to take a single byte when a simple index should work?

I don't understand the rationale for having byte indexing return an int
instead of a one-byte substring. Especially since we still have a perfectly
good way to extract the numeric value from a one-byte byte-string:

py> ord(b'N')
78

> Because the bytes type lies.  It shows, for example, b'\r\n\x12\x08N\x00'
> but when I try to access that N to see if this is a Numeric field I get:
> 
> --> b'\r\n\x12\x08N\x00'[4]
> 78
> 
> This is a cognitive dissonance that one does not expect in Python.

Yes, I agree. I think it was a terrible mistake to have bytes continue to
pretend to be ASCII. Having this occur:

py> print(b'\x4E')
b'N'

does nothing but muddy the water. I think it would be too much to
disallowing using ASCII literals in byte strings, but we shouldn't
*display* byte strings as ASCII.

py> print(b'N')  # This would be better.
b'\x4E'

[...]
> Different problem.  The problem here is that bytes and byte literals don't
> compare equal.

Right! Now I get where you are coming from.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#63324

From	Mark Janssen <dreamingforward@gmail.com>
Date	2014-01-06 13:30 -0600
Message-ID	<mailman.5045.1389036656.18130.python-list@python.org>
In reply to	#63311

>> Chris didn't say "bytes and ascii data", he said "bytes and TEXT".
>> Text != "ascii data", and the fact that some people apparently think it
>> does is pretty much the heart of the problem.
>
> The heart of a different problem, not this one.  The problem I refer to is
> that many binary formats have well-defined ascii-encoded text tidbits.

Really?  If people are using binary with "well-defined ascii-encoded
tidbits", they're doing something wrong.  Perhaps you think escape
characters "\n" are "well defined tidbits", but YOU WOULD BE WRONG.
The purpose of binary is to keep things raw.  WTF?  You guys are so
strange.

>
>> If you (generic you) don't get that, you'll have a bad time. I mean
>> *really*
>> get it, deep down in the bone. The long, bad habit of thinking as
>> ASCII-encoded bytes as text is the problem here.

I think the whole forking community is confused at because of your own
arrogance.  Foo(l)s.

markj

[toc] | [prev] | [next] | [standalone]

#63325

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2014-01-06 19:36 +0000
Message-ID	<mailman.5046.1389037011.18130.python-list@python.org>
In reply to	#63311

On 06/01/2014 19:30, Mark Janssen wrote:
>>> Chris didn't say "bytes and ascii data", he said "bytes and TEXT".
>>> Text != "ascii data", and the fact that some people apparently think it
>>> does is pretty much the heart of the problem.
>>
>> The heart of a different problem, not this one.  The problem I refer to is
>> that many binary formats have well-defined ascii-encoded text tidbits.
>
> Really?  If people are using binary with "well-defined ascii-encoded
> tidbits", they're doing something wrong.  Perhaps you think escape
> characters "\n" are "well defined tidbits", but YOU WOULD BE WRONG.
> The purpose of binary is to keep things raw.  WTF?  You guys are so
> strange.
>
>>
>>> If you (generic you) don't get that, you'll have a bad time. I mean
>>> *really*
>>> get it, deep down in the bone. The long, bad habit of thinking as
>>> ASCII-encoded bytes as text is the problem here.
>
> I think the whole forking community is confused at because of your own
> arrogance.  Foo(l)s.
>
> markj
>

Looks like another bad batch, time to change your dealer again.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#63327

From	Mark Janssen <dreamingforward@gmail.com>
Date	2014-01-06 13:44 -0600
Message-ID	<mailman.5048.1389037462.18130.python-list@python.org>
In reply to	#63311

> Looks like another bad batch, time to change your dealer again.

??? Strange, when the debate hits bottom, accusations about doing
drugs come up.  This is like the third reference (and I don't even
drink alcohol).

mark

[toc] | [prev] | [next] | [standalone]

#63388

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-01-07 11:54 +1100
Message-ID	<52cb502d$0$30002$c3e8da3$5496439d@news.astraweb.com>
In reply to	#63327

Mark Janssen wrote:

>> Looks like another bad batch, time to change your dealer again.
> 
> ??? Strange, when the debate hits bottom, accusations about doing
> drugs come up.  This is like the third reference (and I don't even
> drink alcohol).

It is an oblique reference to the fact that your posts are incoherent and
confused. It is considered more socially polite to attribute that to
external substances such as drugs or alcohol (which you could, in
principle, do something about) than to explicitly say that you and your
views are disconnected from reality, i.e. crazy.

People aren't actually debating you. We've tried. You respond with insults
and don't give any evidence for your irrational assertions, so don't think
that this is a debate. Until you can (1) explain your thoughts in detail
rather than in vague terms that don't make sense, (2) demonstrate at least
a minimal level of competence rather than making utter n00b mistakes while
insisting that you know so much more than experts in the field, and (3)
give actual evidence for your assertions, this will not be a debate.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#63345

From	Ned Batchelder <ned@nedbatchelder.com>
Date	2014-01-06 16:14 -0500
Message-ID	<mailman.5063.1389042912.18130.python-list@python.org>
In reply to	#63311

On 1/6/14 2:30 PM, Mark Janssen wrote:
>>> Chris didn't say "bytes and ascii data", he said "bytes and TEXT".
>>> Text != "ascii data", and the fact that some people apparently think it
>>> does is pretty much the heart of the problem.
>>
>> The heart of a different problem, not this one.  The problem I refer to is
>> that many binary formats have well-defined ascii-encoded text tidbits.
>
> Really?  If people are using binary with "well-defined ascii-encoded
> tidbits", they're doing something wrong.  Perhaps you think escape
> characters "\n" are "well defined tidbits", but YOU WOULD BE WRONG.
> The purpose of binary is to keep things raw.  WTF?  You guys are so
> strange.
>
>>
>>> If you (generic you) don't get that, you'll have a bad time. I mean
>>> *really*
>>> get it, deep down in the bone. The long, bad habit of thinking as
>>> ASCII-encoded bytes as text is the problem here.
>
> I think the whole forking community is confused at because of your own
> arrogance.  Foo(l)s.
>
> markj
>

If you want to participate in this discussion, do so.  Calling people 
strange, arrogant, and fools with no technical content is just rude. 
Typing "YOU WOULD BE WRONG" in all caps doesn't count as technical content.

-- 
Ned Batchelder, http://nedbatchelder.com

[toc] | [prev] | [next] | [standalone]

#63347

From	Mark Janssen <dreamingforward@gmail.com>
Date	2014-01-06 15:23 -0600
Message-ID	<mailman.5065.1389043396.18130.python-list@python.org>
In reply to	#63311

>> Really?  If people are using binary with "well-defined ascii-encoded
>> tidbits", they're doing something wrong.  Perhaps you think escape
>> characters "\n" are "well defined tidbits", but YOU WOULD BE WRONG.
>> The purpose of binary is to keep things raw.  WTF?

> If you want to participate in this discussion, do so.  Calling people
> strange, arrogant, and fools with no technical content is just rude. Typing
> "YOU WOULD BE WRONG" in all caps doesn't count as technical content.

Ned -- IF

[toc] | [prev] | [next] | [standalone]

#63349

From	Mark Janssen <dreamingforward@gmail.com>
Date	2014-01-06 15:32 -0600
Message-ID	<mailman.5067.1389043963.18130.python-list@python.org>
In reply to	#63311

>> Really?  If people are using binary with "well-defined ascii-encoded
>> tidbits", they're doing something wrong.  Perhaps you think escape
>> characters "\n" are "well defined tidbits", but YOU WOULD BE WRONG.
>> The purpose of binary is to keep things raw.  WTF?
>
> If you want to participate in this discussion, do so.  Calling people
> strange, arrogant, and fools with no technical content is just rude. Typing
> "YOU WOULD BE WRONG" in all caps doesn't count as technical content.

Ned -- IF YOU'RE A REAL PERSON -- you will see that several words
prior to that declaration, you'll find (or be able to arrange) the
proposition: "Escape characters are well-defined tidbits of binary
data is FALSE".

Now that is a technical point that i'm saying is simply the "way
things are" coming from the mass of experience held by the OS
community and the C programming community which is responsible for
much of the world's computer systems.  Do you have an argument against
it, or do you piss off and argue against anything I say?? Perhaps I
said it too loudly, and I take responsibility for that, but don't
claim I'm not making a technical point which seems to be at the heart
of all the confusion regarding python/python3 and str/unicode/bytes.

mark

[toc] | [prev] | [next] | [standalone]

#63368

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-07 10:03 +1100
Message-ID	<mailman.5086.1389049410.18130.python-list@python.org>
In reply to	#63311

On Tue, Jan 7, 2014 at 8:32 AM, Mark Janssen <dreamingforward@gmail.com> wrote:
>>> Really?  If people are using binary with "well-defined ascii-encoded
>>> tidbits", they're doing something wrong.  Perhaps you think escape
>>> characters "\n" are "well defined tidbits", but YOU WOULD BE WRONG.
>>> The purpose of binary is to keep things raw.  WTF?
>>
>> If you want to participate in this discussion, do so.  Calling people
>> strange, arrogant, and fools with no technical content is just rude. Typing
>> "YOU WOULD BE WRONG" in all caps doesn't count as technical content.
>
> Ned -- [chomp verbiage]

Mark, please watch your citations. Several (all?) of your posts in
this thread have omitted the line(s) at the top saying who you're
quoting. Have a look at my post here, and then imagine how confused
Mark Lawrence would be if I hadn't made it clear that I wasn't
addressing him.

Thanks!

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Re: "More About Unicode in Python 2 and 3"

Contents

#63287 — Re: "More About Unicode in Python 2 and 3"

#63311

#63319

#63384

#63324

#63325

#63327

#63388

#63345

#63347

#63349

#63368