Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #63287 > unrolled thread
| Started by | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| First post | 2014-01-06 07:10 -0800 |
| Last post | 2014-01-07 10:03 +1100 |
| Articles | 12 — 6 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: "More About Unicode in Python 2 and 3" Ethan Furman <ethan@stoneleaf.us> - 2014-01-06 07:10 -0800
Re: "More About Unicode in Python 2 and 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-07 04:27 +1100
Re: "More About Unicode in Python 2 and 3" Ethan Furman <ethan@stoneleaf.us> - 2014-01-06 10:34 -0800
Re: "More About Unicode in Python 2 and 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-07 11:42 +1100
Re: "More About Unicode in Python 2 and 3" Mark Janssen <dreamingforward@gmail.com> - 2014-01-06 13:30 -0600
Re: "More About Unicode in Python 2 and 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-06 19:36 +0000
Re: "More About Unicode in Python 2 and 3" Mark Janssen <dreamingforward@gmail.com> - 2014-01-06 13:44 -0600
Re: "More About Unicode in Python 2 and 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-07 11:54 +1100
Re: "More About Unicode in Python 2 and 3" Ned Batchelder <ned@nedbatchelder.com> - 2014-01-06 16:14 -0500
Re: "More About Unicode in Python 2 and 3" Mark Janssen <dreamingforward@gmail.com> - 2014-01-06 15:23 -0600
Re: "More About Unicode in Python 2 and 3" Mark Janssen <dreamingforward@gmail.com> - 2014-01-06 15:32 -0600
Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-07 10:03 +1100
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2014-01-06 07:10 -0800 |
| Subject | Re: "More About Unicode in Python 2 and 3" |
| Message-ID | <mailman.5022.1389022306.18130.python-list@python.org> |
On 01/05/2014 06:37 PM, Dan Stromberg wrote: > > The argument seems to be "3.x doesn't work the way I'm accustomed to, > so I'm not going to use it, and I'm going to shout about it until > others agree with me." The argument is that a very important, if small, subset a data manipulation become very painful in Py3. Not impossible, and not difficult, but painful because the mental model and the contortions needed to get things to work don't sync up anymore. Painful because Python is, at heart, a simple and elegant language, but with the use-case of embedded ascii in binary data that elegance went right out the window. On 01/05/2014 06:55 PM, Chris Angelico wrote: > > It can't be both things. It's either bytes or it's text. Of course it can be: 0000000: 0372 0106 0000 0000 6100 1d00 0000 0000 .r......a....... 0000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000020: 4e41 4d45 0000 0000 0000 0043 0100 0000 NAME.......C.... 0000030: 1900 0000 0000 0000 0000 0000 0000 0000 ................ 0000040: 4147 4500 0000 0000 0000 004e 1a00 0000 AGE........N.... 0000050: 0300 0000 0000 0000 0000 0000 0000 0000 ................ 0000060: 0d1a 0a ... And there we are, mixed bytes and ascii data. As I said earlier, my example is minimal, but still very frustrating in that normal operations no longer work. Incidentally, if you were thinking that NAME and AGE were part of the ascii text, you'd be wrong -- the field names are also encoded, as are the Character and Memo fields. -- ~Ethan~
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-01-07 04:27 +1100 |
| Message-ID | <52cae78d$0$29971$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #63287 |
Ethan Furman wrote: > On 01/05/2014 06:37 PM, Dan Stromberg wrote: >> >> The argument seems to be "3.x doesn't work the way I'm accustomed to, >> so I'm not going to use it, and I'm going to shout about it until >> others agree with me." > > The argument is that a very important, if small, subset a data > manipulation become very painful in Py3. Not impossible, and not > difficult, but painful because the mental model and the contortions needed > to get things to work don't sync up > anymore. Painful because Python is, at heart, a simple and elegant > language, but with the use-case of embedded ascii in binary data that > elegance went right out the window. > > On 01/05/2014 06:55 PM, Chris Angelico wrote: >> >> It can't be both things. It's either bytes or it's text. > > Of course it can be: > > 0000000: 0372 0106 0000 0000 6100 1d00 0000 0000 .r......a....... > 0000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................ > 0000020: 4e41 4d45 0000 0000 0000 0043 0100 0000 NAME.......C.... > 0000030: 1900 0000 0000 0000 0000 0000 0000 0000 ................ > 0000040: 4147 4500 0000 0000 0000 004e 1a00 0000 AGE........N.... > 0000050: 0300 0000 0000 0000 0000 0000 0000 0000 ................ > 0000060: 0d1a 0a ... > > And there we are, mixed bytes and ascii data. Chris didn't say "bytes and ascii data", he said "bytes and TEXT". Text != "ascii data", and the fact that some people apparently think it does is pretty much the heart of the problem. I see no mixed bytes and text. I see bytes. Since the above comes from a file, it cannot be anything else but bytes. Do you think that a file that happens to be a JPEG contains pixels? No. It contains bytes which, after decoding, represents pixels. Same with text, ascii or otherwise. Now, it is true that some of those bytes happen to fall into the same range of values as ASCII-encoded text. They may even represent text after decoding, but since we don't know what the file contents mean, we can't know that. It might be a mere coincidence that the four bytes starting at hex offset 40 is the C long 1095189760 which happens to look like "AGE" with a null at the end. For historical reasons, your hexdump utility performs that decoding step for you, which is why you can see "NAME" and "AGE" in the right-hand block, but that doesn't mean the file contains text. It contains bytes, some of which represents text after decoding. If you (generic you) don't get that, you'll have a bad time. I mean *really* get it, deep down in the bone. The long, bad habit of thinking as ASCII-encoded bytes as text is the problem here. The average programmer has years and years of experience thinking about decoding bytes to numbers and back (just not by that name), so it doesn't lead to any cognitive dissonance to think of hex 4147 4500 as either four bytes, two double-byte ints, or a single four-byte int. But as soon as "text" comes into the picture, the average programmer has equally many years of thinking that the byte 41 "just is" the letter "A", and that's simply *wrong*. > As I said earlier, my > example is minimal, but still very frustrating in > that normal operations no longer work. Incidentally, if you were thinking > that NAME and AGE were part of the ascii text, you'd be wrong -- the field > names are also encoded, as are the Character and Memo fields. What Character and Memo fields? Are you trying to say that the NAME and AGE are *not* actually ASCII text, but a mere coincidence, like my example of 1095189760? Or are you referring to the fact that they're actually encoded as ASCII? If not, I have no idea what you are trying to say. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2014-01-06 10:34 -0800 |
| Message-ID | <mailman.5041.1389034608.18130.python-list@python.org> |
| In reply to | #63311 |
On 01/06/2014 09:27 AM, Steven D'Aprano wrote: > Ethan Furman wrote: > > Chris didn't say "bytes and ascii data", he said "bytes and TEXT". > Text != "ascii data", and the fact that some people apparently think it > does is pretty much the heart of the problem. The heart of a different problem, not this one. The problem I refer to is that many binary formats have well-defined ascii-encoded text tidbits. These tidbits were quite easy to work with in Py2, not difficult but not elegant in Py3, and even worse if you have to support both 2 and 3. > Now, it is true that some of those bytes happen to fall into the same range > of values as ASCII-encoded text. They may even represent text after > decoding, but since we don't know what the file contents mean, we can't > know that. Of course we can -- we're the programmer, after all. This is not a random bunch of bytes but a well defined format for storing data. > It might be a mere coincidence that the four bytes starting at > hex offset 40 is the C long 1095189760 which happens to look like "AGE" > with a null at the end. For historical reasons, your hexdump utility > performs that decoding step for you, which is why you can see "NAME" > and "AGE" in the right-hand block, but that doesn't mean the file contains > text. It contains bytes, some of which represents text after decoding. As it happens, 'NAME' and 'AGE' are encoded, and will be decoded. They could just as easily have contained tilde's, accents, umlauts, and other strange (to me) characters. It's actually the 'C' and the 'N' that bug me (like I said, my example is minimal, especially compared to a network protocol). And you're right -- it is easy to say FIELD_TYPE = slice(15,16), and it was also easy to say FIELD_TYPE = 15, but there is a critical difference -- can you spot it? .. .. .. In case you didn't: both work in Py2, only the slice version works (correctly) in Py3, but the worst part is why do I have to use a slice to take a single byte when a simple index should work? Because the bytes type lies. It shows, for example, b'\r\n\x12\x08N\x00' but when I try to access that N to see if this is a Numeric field I get: --> b'\r\n\x12\x08N\x00'[4] 78 This is a cognitive dissonance that one does not expect in Python. > If you (generic you) don't get that, you'll have a bad time. I mean *really* > get it, deep down in the bone. The long, bad habit of thinking as > ASCII-encoded bytes as text is the problem here. Different problem. The problem here is that bytes and byte literals don't compare equal. > the average programmer has equally many years of thinking that the > byte 41 "just is" the letter "A", and that's simply *wrong*. Agreed. But byte 41 != b'A', and that is equally wrong. >> As I said earlier, my >> example is minimal, but still very frustrating in >> that normal operations no longer work. Incidentally, if you were thinking >> that NAME and AGE were part of the ascii text, you'd be wrong -- the field >> names are also encoded, as are the Character and Memo fields. > > What Character and Memo fields? Are you trying to say that the NAME and AGE > are *not* actually ASCII text, but a mere coincidence, like my example of > 1095189760? Or are you referring to the fact that they're actually encoded > as ASCII? If not, I have no idea what you are trying to say. Yes, NAME and AGE are *not* ASCII text, but latin-1 encoded. The C and the N are ASCII, meaningful as-is. The actual data stored in a Character (NAME in this case) or Memo (not shown) field would also be latin-1 encoded. (And before you ask, the encoding is stored in the file header.) -- ~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-01-07 11:42 +1100 |
| Message-ID | <52cb4d82$0$29979$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #63319 |
Ethan Furman wrote: > On 01/06/2014 09:27 AM, Steven D'Aprano wrote: >> Ethan Furman wrote: >> >> Chris didn't say "bytes and ascii data", he said "bytes and TEXT". >> Text != "ascii data", and the fact that some people apparently think it >> does is pretty much the heart of the problem. > > The heart of a different problem, not this one. The problem I refer to is > that many binary formats have well-defined > ascii-encoded text tidbits. These tidbits were quite easy to work with in > Py2, not difficult but not elegant in Py3, and even worse if you have to > support both 2 and 3. Many things are more difficult if you have to support a large range of versions. That's life, for a programmer. >> Now, it is true that some of those bytes happen to fall into the same >> range of values as ASCII-encoded text. They may even represent text after >> decoding, but since we don't know what the file contents mean, we can't >> know that. > > Of course we can -- we're the programmer, after all. This is not a random > bunch of bytes but a well defined format for storing data. No, you misunderstand me. *You* may know what the data represents, but *we* don't, because you just drop a hex dump in our laps with no explanation. >> It might be a mere coincidence that the four bytes starting at >> hex offset 40 is the C long 1095189760 which happens to look like "AGE" >> with a null at the end. For historical reasons, your hexdump utility >> performs that decoding step for you, which is why you can see "NAME" >> and "AGE" in the right-hand block, but that doesn't mean the file >> contains text. It contains bytes, some of which represents text after >> decoding. > > As it happens, 'NAME' and 'AGE' are encoded, and will be decoded. You're either saying something utterly trivial, or something utterly profound, and I can't tell which. Of course they are encoded. The file doesn't contain the letter "N", it contains the byte 0x4E. So what are you actually trying to say? > They could just as easily have contained tilde's, > accents, umlauts, and other strange (to me) characters. I'm especially confused here because tildes are including in the ASCII character set. Here's one here: ~ > It's actually the > 'C' and the 'N' that bug me (like I said, my example is minimal, > especially compared to a network protocol). > > And you're right -- it is easy to say FIELD_TYPE = slice(15,16), and it > was also easy to say FIELD_TYPE = 15, but there is a critical difference > -- can you spot it? > > .. > .. > .. > In case you didn't: both work in Py2, only the slice version works > (correctly) in Py3, I accept that using the slice is inelegant. But lots of things are inelegant when you do them them wrong way. Treating your textual data as bytes is the wrong way. You apparently know that that your data is encoded text, you apparently know the encoding... so why don't you just decode it and treat it as text instead of insisting on dealing with the raw bytes? Are you worried about performance? I'd be sympathetic if you were writing some low-level network protocol stuff where performance is vital, but you keep saying that your application is "minimal", which I interpret as performance not being critical. So what's the deal? > but the worst part is why do I > have to use a slice to take a single byte when a simple index should work? I don't understand the rationale for having byte indexing return an int instead of a one-byte substring. Especially since we still have a perfectly good way to extract the numeric value from a one-byte byte-string: py> ord(b'N') 78 > Because the bytes type lies. It shows, for example, b'\r\n\x12\x08N\x00' > but when I try to access that N to see if this is a Numeric field I get: > > --> b'\r\n\x12\x08N\x00'[4] > 78 > > This is a cognitive dissonance that one does not expect in Python. Yes, I agree. I think it was a terrible mistake to have bytes continue to pretend to be ASCII. Having this occur: py> print(b'\x4E') b'N' does nothing but muddy the water. I think it would be too much to disallowing using ASCII literals in byte strings, but we shouldn't *display* byte strings as ASCII. py> print(b'N') # This would be better. b'\x4E' [...] > Different problem. The problem here is that bytes and byte literals don't > compare equal. Right! Now I get where you are coming from. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Mark Janssen <dreamingforward@gmail.com> |
|---|---|
| Date | 2014-01-06 13:30 -0600 |
| Message-ID | <mailman.5045.1389036656.18130.python-list@python.org> |
| In reply to | #63311 |
>> Chris didn't say "bytes and ascii data", he said "bytes and TEXT". >> Text != "ascii data", and the fact that some people apparently think it >> does is pretty much the heart of the problem. > > The heart of a different problem, not this one. The problem I refer to is > that many binary formats have well-defined ascii-encoded text tidbits. Really? If people are using binary with "well-defined ascii-encoded tidbits", they're doing something wrong. Perhaps you think escape characters "\n" are "well defined tidbits", but YOU WOULD BE WRONG. The purpose of binary is to keep things raw. WTF? You guys are so strange. > >> If you (generic you) don't get that, you'll have a bad time. I mean >> *really* >> get it, deep down in the bone. The long, bad habit of thinking as >> ASCII-encoded bytes as text is the problem here. I think the whole forking community is confused at because of your own arrogance. Foo(l)s. markj
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2014-01-06 19:36 +0000 |
| Message-ID | <mailman.5046.1389037011.18130.python-list@python.org> |
| In reply to | #63311 |
On 06/01/2014 19:30, Mark Janssen wrote: >>> Chris didn't say "bytes and ascii data", he said "bytes and TEXT". >>> Text != "ascii data", and the fact that some people apparently think it >>> does is pretty much the heart of the problem. >> >> The heart of a different problem, not this one. The problem I refer to is >> that many binary formats have well-defined ascii-encoded text tidbits. > > Really? If people are using binary with "well-defined ascii-encoded > tidbits", they're doing something wrong. Perhaps you think escape > characters "\n" are "well defined tidbits", but YOU WOULD BE WRONG. > The purpose of binary is to keep things raw. WTF? You guys are so > strange. > >> >>> If you (generic you) don't get that, you'll have a bad time. I mean >>> *really* >>> get it, deep down in the bone. The long, bad habit of thinking as >>> ASCII-encoded bytes as text is the problem here. > > I think the whole forking community is confused at because of your own > arrogance. Foo(l)s. > > markj > Looks like another bad batch, time to change your dealer again. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Mark Janssen <dreamingforward@gmail.com> |
|---|---|
| Date | 2014-01-06 13:44 -0600 |
| Message-ID | <mailman.5048.1389037462.18130.python-list@python.org> |
| In reply to | #63311 |
> Looks like another bad batch, time to change your dealer again. ??? Strange, when the debate hits bottom, accusations about doing drugs come up. This is like the third reference (and I don't even drink alcohol). mark
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-01-07 11:54 +1100 |
| Message-ID | <52cb502d$0$30002$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #63327 |
Mark Janssen wrote: >> Looks like another bad batch, time to change your dealer again. > > ??? Strange, when the debate hits bottom, accusations about doing > drugs come up. This is like the third reference (and I don't even > drink alcohol). It is an oblique reference to the fact that your posts are incoherent and confused. It is considered more socially polite to attribute that to external substances such as drugs or alcohol (which you could, in principle, do something about) than to explicitly say that you and your views are disconnected from reality, i.e. crazy. People aren't actually debating you. We've tried. You respond with insults and don't give any evidence for your irrational assertions, so don't think that this is a debate. Until you can (1) explain your thoughts in detail rather than in vague terms that don't make sense, (2) demonstrate at least a minimal level of competence rather than making utter n00b mistakes while insisting that you know so much more than experts in the field, and (3) give actual evidence for your assertions, this will not be a debate. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2014-01-06 16:14 -0500 |
| Message-ID | <mailman.5063.1389042912.18130.python-list@python.org> |
| In reply to | #63311 |
On 1/6/14 2:30 PM, Mark Janssen wrote: >>> Chris didn't say "bytes and ascii data", he said "bytes and TEXT". >>> Text != "ascii data", and the fact that some people apparently think it >>> does is pretty much the heart of the problem. >> >> The heart of a different problem, not this one. The problem I refer to is >> that many binary formats have well-defined ascii-encoded text tidbits. > > Really? If people are using binary with "well-defined ascii-encoded > tidbits", they're doing something wrong. Perhaps you think escape > characters "\n" are "well defined tidbits", but YOU WOULD BE WRONG. > The purpose of binary is to keep things raw. WTF? You guys are so > strange. > >> >>> If you (generic you) don't get that, you'll have a bad time. I mean >>> *really* >>> get it, deep down in the bone. The long, bad habit of thinking as >>> ASCII-encoded bytes as text is the problem here. > > I think the whole forking community is confused at because of your own > arrogance. Foo(l)s. > > markj > If you want to participate in this discussion, do so. Calling people strange, arrogant, and fools with no technical content is just rude. Typing "YOU WOULD BE WRONG" in all caps doesn't count as technical content. -- Ned Batchelder, http://nedbatchelder.com
[toc] | [prev] | [next] | [standalone]
| From | Mark Janssen <dreamingforward@gmail.com> |
|---|---|
| Date | 2014-01-06 15:23 -0600 |
| Message-ID | <mailman.5065.1389043396.18130.python-list@python.org> |
| In reply to | #63311 |
>> Really? If people are using binary with "well-defined ascii-encoded >> tidbits", they're doing something wrong. Perhaps you think escape >> characters "\n" are "well defined tidbits", but YOU WOULD BE WRONG. >> The purpose of binary is to keep things raw. WTF? > If you want to participate in this discussion, do so. Calling people > strange, arrogant, and fools with no technical content is just rude. Typing > "YOU WOULD BE WRONG" in all caps doesn't count as technical content. Ned -- IF
[toc] | [prev] | [next] | [standalone]
| From | Mark Janssen <dreamingforward@gmail.com> |
|---|---|
| Date | 2014-01-06 15:32 -0600 |
| Message-ID | <mailman.5067.1389043963.18130.python-list@python.org> |
| In reply to | #63311 |
>> Really? If people are using binary with "well-defined ascii-encoded >> tidbits", they're doing something wrong. Perhaps you think escape >> characters "\n" are "well defined tidbits", but YOU WOULD BE WRONG. >> The purpose of binary is to keep things raw. WTF? > > If you want to participate in this discussion, do so. Calling people > strange, arrogant, and fools with no technical content is just rude. Typing > "YOU WOULD BE WRONG" in all caps doesn't count as technical content. Ned -- IF YOU'RE A REAL PERSON -- you will see that several words prior to that declaration, you'll find (or be able to arrange) the proposition: "Escape characters are well-defined tidbits of binary data is FALSE". Now that is a technical point that i'm saying is simply the "way things are" coming from the mass of experience held by the OS community and the C programming community which is responsible for much of the world's computer systems. Do you have an argument against it, or do you piss off and argue against anything I say?? Perhaps I said it too loudly, and I take responsibility for that, but don't claim I'm not making a technical point which seems to be at the heart of all the confusion regarding python/python3 and str/unicode/bytes. mark
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-07 10:03 +1100 |
| Message-ID | <mailman.5086.1389049410.18130.python-list@python.org> |
| In reply to | #63311 |
On Tue, Jan 7, 2014 at 8:32 AM, Mark Janssen <dreamingforward@gmail.com> wrote: >>> Really? If people are using binary with "well-defined ascii-encoded >>> tidbits", they're doing something wrong. Perhaps you think escape >>> characters "\n" are "well defined tidbits", but YOU WOULD BE WRONG. >>> The purpose of binary is to keep things raw. WTF? >> >> If you want to participate in this discussion, do so. Calling people >> strange, arrogant, and fools with no technical content is just rude. Typing >> "YOU WOULD BE WRONG" in all caps doesn't count as technical content. > > Ned -- [chomp verbiage] Mark, please watch your citations. Several (all?) of your posts in this thread have omitted the line(s) at the top saying who you're quoting. Have a look at my post here, and then imagine how confused Mark Lawrence would be if I hadn't made it clear that I wasn't addressing him. Thanks! ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web