Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #59510 > unrolled thread
| Started by | Robin Becker <robin@reportlab.com> |
|---|---|
| First post | 2013-11-15 11:28 +0000 |
| Last post | 2013-11-15 14:23 -0500 |
| Articles | 20 on this page of 31 — 14 participants |
Back to article view | Back to comp.lang.python
python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 11:28 +0000
Re: python 3.3 repr Ned Batchelder <ned@nedbatchelder.com> - 2013-11-15 03:38 -0800
Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 12:16 +0000
Re: python 3.3 repr Ned Batchelder <ned@nedbatchelder.com> - 2013-11-15 05:54 -0800
Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 14:29 +0000
Re: python 3.3 repr Serhiy Storchaka <storchaka@gmail.com> - 2013-11-15 16:40 +0200
Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 14:52 +0000
Re: python 3.3 repr Roy Smith <roy@panix.com> - 2013-11-15 09:25 -0500
Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 14:43 +0000
Re: python 3.3 repr Ned Batchelder <ned@nedbatchelder.com> - 2013-11-15 07:08 -0800
Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 15:39 +0000
Re: python 3.3 repr Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-11-15 16:49 +0100
Re: python 3.3 repr Chris Angelico <rosuav@gmail.com> - 2013-11-16 03:01 +1100
Re: python 3.3 repr Neil Cerutti <neilc@norwich.edu> - 2013-11-15 17:47 +0000
Re: python 3.3 repr Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-16 01:09 +0000
Re: python 3.3 repr Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-15 17:10 +0000
Re: python 3.3 repr Chris Angelico <rosuav@gmail.com> - 2013-11-16 04:29 +1100
Re: python 3.3 repr Cousin Stanley <cousinstanley@gmail.com> - 2013-11-15 10:45 -0700
Re: python 3.3 repr Joel Goldstick <joel.goldstick@gmail.com> - 2013-11-15 09:50 -0500
Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 15:03 +0000
Re: python 3.3 repr Joel Goldstick <joel.goldstick@gmail.com> - 2013-11-15 10:07 -0500
Re: python 3.3 repr Chris Angelico <rosuav@gmail.com> - 2013-11-16 02:08 +1100
Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 15:18 +0000
Re: python 3.3 repr Roy Smith <roy@panix.com> - 2013-11-15 10:32 -0500
Re: python 3.3 repr William Ray Wing <wrw@mac.com> - 2013-11-15 11:30 -0500
Re: python 3.3 repr Zero Piraeus <z@etiol.net> - 2013-11-15 14:06 -0300
Re: python 3.3 repr Chris Angelico <rosuav@gmail.com> - 2013-11-16 04:11 +1100
Re: python 3.3 repr Serhiy Storchaka <storchaka@gmail.com> - 2013-11-15 19:37 +0200
Re: python 3.3 repr Gene Heskett <gheskett@wdtv.com> - 2013-11-15 11:36 -0500
Re: python 3.3 repr Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-15 17:58 +0000
Re: python 3.3 repr Gene Heskett <gheskett@wdtv.com> - 2013-11-15 14:23 -0500
Page 1 of 2 [1] 2 Next page →
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2013-11-15 11:28 +0000 |
| Subject | python 3.3 repr |
| Message-ID | <mailman.2646.1384514912.18130.python-list@python.org> |
I'm trying to understand what's going on with this simple program
if __name__=='__main__':
print("repr=%s" % repr(u'\xc1'))
print("%%r=%r" % u'\xc1')
On my windows XP box this fails miserably if run directly at a terminal
C:\tmp> \Python33\python.exe bang.py
Traceback (most recent call last):
File "bang.py", line 2, in <module>
print("repr=%s" % repr(u'\xc1'))
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6:
character maps to <undefined>
If I run the program redirected into a file then no error occurs and the the
result looks like this
C:\tmp>cat fff
repr='┴'
%r='┴'
and if I run it into a pipe it works as though into a file.
It seems that repr thinks it can render u'\xc1' directly which is a problem
since print then seems to want to convert that to cp437 if directed into a terminal.
I find the idea that print knows what it's printing to a bit dangerous, but it's
the repr behaviour that strikes me as bad.
What is responsible for defining the repr function's 'printable' so that repr
would give me say an Ascii rendering?
-confused-ly yrs-
Robin Becker
[toc] | [next] | [standalone]
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2013-11-15 03:38 -0800 |
| Message-ID | <b6db8982-feac-4036-8ec4-2dc720d41a4b@googlegroups.com> |
| In reply to | #59510 |
On Friday, November 15, 2013 6:28:15 AM UTC-5, Robin Becker wrote:
> I'm trying to understand what's going on with this simple program
>
> if __name__=='__main__':
> print("repr=%s" % repr(u'\xc1'))
> print("%%r=%r" % u'\xc1')
>
> On my windows XP box this fails miserably if run directly at a terminal
>
> C:\tmp> \Python33\python.exe bang.py
> Traceback (most recent call last):
> File "bang.py", line 2, in <module>
> print("repr=%s" % repr(u'\xc1'))
> File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
> return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6:
> character maps to <undefined>
>
> If I run the program redirected into a file then no error occurs and the the
> result looks like this
>
> C:\tmp>cat fff
> repr='┴'
> %r='┴'
>
> and if I run it into a pipe it works as though into a file.
>
> It seems that repr thinks it can render u'\xc1' directly which is a problem
> since print then seems to want to convert that to cp437 if directed into a terminal.
>
> I find the idea that print knows what it's printing to a bit dangerous, but it's
> the repr behaviour that strikes me as bad.
>
> What is responsible for defining the repr function's 'printable' so that repr
> would give me say an Ascii rendering?
> -confused-ly yrs-
> Robin Becker
In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string.
--Ned.
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2013-11-15 12:16 +0000 |
| Message-ID | <mailman.2648.1384517826.18130.python-list@python.org> |
| In reply to | #59511 |
On 15/11/2013 11:38, Ned Batchelder wrote: .......... > > In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string. > > --Ned. > thanks for this, edoesn't make the split across python2 - 3 any easier. -- Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2013-11-15 05:54 -0800 |
| Message-ID | <edbfd521-c595-453e-a019-8d39c79437fb@googlegroups.com> |
| In reply to | #59513 |
On Friday, November 15, 2013 7:16:52 AM UTC-5, Robin Becker wrote:
> On 15/11/2013 11:38, Ned Batchelder wrote:
> ..........
> >
> > In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string.
> >
> > --Ned.
> >
>
> thanks for this, edoesn't make the split across python2 - 3 any easier.
> --
> Robin Becker
No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this:
try:
repr = ascii
except NameError:
pass
and then use repr throughout.
--Ned.
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2013-11-15 14:29 +0000 |
| Message-ID | <mailman.2656.1384525772.18130.python-list@python.org> |
| In reply to | #59520 |
On 15/11/2013 13:54, Ned Batchelder wrote: ......... > > No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this: > > try: > repr = ascii > except NameError: > pass .... yes I tried that, but it doesn't affect %r which is inlined in unicodeobject.c, for me it seems easier to fix windows to use something like a standard encoding of utf8 ie cp65001, but that's quite hard to do globally. It seems sitecustomize is too late to set os.environ['PYTHONIOENCODING'], perhaps I can stuff that into one of the global environment vars and have it work for all python invocations. -- Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Serhiy Storchaka <storchaka@gmail.com> |
|---|---|
| Date | 2013-11-15 16:40 +0200 |
| Message-ID | <mailman.2658.1384526450.18130.python-list@python.org> |
| In reply to | #59520 |
15.11.13 15:54, Ned Batchelder написав(ла):
> No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this:
>
> try:
> repr = ascii
> except NameError:
> pass
>
> and then use repr throughout.
Or rather
try:
ascii
except NameError:
ascii = repr
and then use ascii throughout.
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2013-11-15 14:52 +0000 |
| Message-ID | <mailman.2663.1384527134.18130.python-list@python.org> |
| In reply to | #59520 |
On 15/11/2013 14:40, Serhiy Storchaka wrote: ...... >> and then use repr throughout. > > Or rather > > try: > ascii > except NameError: > ascii = repr > > and then use ascii throughout. > > apparently you can import ascii from future_builtins and the print() function is available as from __future__ import print_function nothing fixes all those %r formats to be %a though :( -- Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-11-15 09:25 -0500 |
| Message-ID | <mailman.2655.1384525556.18130.python-list@python.org> |
| In reply to | #59511 |
[Multipart message — attachments visible in raw view] — view raw
In article <b6db8982-feac-4036-8ec4-2dc720d41a4b@googlegroups.com>,
Ned Batchelder <ned@nedbatchelder.com> wrote:
> In Python3, repr() will return a Unicode string, and will preserve existing
> Unicode characters in its arguments. This has been controversial. To get
> the Python 2 behavior of a pure-ascii representation, there is the new
> builtin ascii(), and a corresponding %a format string.
I'm still stuck on Python 2, and while I can understand the controversy ("It breaks my Python 2 code!"), this seems like the right thing to have done. In Python 2, unicode is an add-on. One of the big design drivers in Python 3 was to make unicode the standard.
The idea behind repr() is to provide a "just plain text" representation of an object. In P2, "just plain text" means ascii, so escaping non-ascii characters makes sense. In P3, "just plain text" means unicode, so escaping non-ascii characters no longer makes sense.
Some of us have been doing this long enough to remember when "just plain text" meant only a single case of the alphabet (and a subset of ascii punctuation). On an ASR-33, your C program would print like:
MAIN() \(
PRINTF("HELLO, ASCII WORLD");
\)
because ASR-33's didn't have curly braces (or lower case).
Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those.
--
Roy Smith
roy@panix.com
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2013-11-15 14:43 +0000 |
| Message-ID | <mailman.2660.1384526610.18130.python-list@python.org> |
| In reply to | #59511 |
..........
> I'm still stuck on Python 2, and while I can understand the controversy ("It breaks my Python 2 code!"), this seems like the right thing to have done. In Python 2, unicode is an add-on. One of the big design drivers in Python 3 was to make unicode the standard.
>
> The idea behind repr() is to provide a "just plain text" representation of an object. In P2, "just plain text" means ascii, so escaping non-ascii characters makes sense. In P3, "just plain text" means unicode, so escaping non-ascii characters no longer makes sense.
>
unfortunately the word 'printable' got into the definition of repr; it's clear
that printability is not the same as unicode at least as far as the print
function is concerned. In my opinion it would have been better to leave the old
behaviour as that would have eased the compatibility.
The python gods don't count that sort of thing as important enough so we get the
mess that is the python2/3 split. ReportLab has to do both so it's a real issue;
in addition swapping the str - unicode pair to bytes str doesn't help one's
mental models either :(
Things went wrong when utf8 was not adopted as the standard encoding thus
requiring two string types, it would have been easier to have a len function to
count bytes as before and a glyphlen to count glyphs. Now as I understand it we
have a complicated mess under the hood for unicode objects so they have a
variable representation to approximate an 8 bit representation when suitable etc
etc etc.
> Some of us have been doing this long enough to remember when "just plain text" meant only a single case of the alphabet (and a subset of ascii punctuation). On an ASR-33, your C program would print like:
>
> MAIN() \(
> PRINTF("HELLO, ASCII WORLD");
> \)
>
> because ASR-33's didn't have curly braces (or lower case).
>
> Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those.
>
.....
I can certainly remember those days, how we cried and laughed when 8 bits became
popular.
--
Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2013-11-15 07:08 -0800 |
| Message-ID | <0d383a3c-247f-4b6a-9a18-7e7fadeb6047@googlegroups.com> |
| In reply to | #59526 |
On Friday, November 15, 2013 9:43:17 AM UTC-5, Robin Becker wrote: > Things went wrong when utf8 was not adopted as the standard encoding thus > requiring two string types, it would have been easier to have a len function to > count bytes as before and a glyphlen to count glyphs. Now as I understand it we > have a complicated mess under the hood for unicode objects so they have a > variable representation to approximate an 8 bit representation when suitable etc > etc etc. > Dealing with bytes and Unicode is complicated, and the 2->3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault. However you store Unicode code points, they are different than bytes, and it is complex having to deal with both. You can't somehow make the dichotomy go away, you can only choose where you want to think about it. --Ned. > -- > Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2013-11-15 15:39 +0000 |
| Message-ID | <mailman.2671.1384529961.18130.python-list@python.org> |
| In reply to | #59533 |
......... > > Dealing with bytes and Unicode is complicated, and the 2->3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault. However you store Unicode code points, they are different than bytes, and it is complex having to deal with both. You can't somehow make the dichotomy go away, you can only choose where you want to think about it. > > --Ned. ....... I don't think that's what I said; the flexible representation is just an added complexity that has come about because of the wish to store strings in a compact way. The requirement for such complexity is the unicode type itself (especially the storage requirements) which necessitated some remedial action. There's no point in fighting the change to using unicode. The type wasn't required for any technical reason as other languages didn't go this route and are reasonably ok, but there's no doubt the change made things more difficult. -- Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Antoon Pardon <antoon.pardon@rece.vub.ac.be> |
|---|---|
| Date | 2013-11-15 16:49 +0100 |
| Message-ID | <mailman.2673.1384530577.18130.python-list@python.org> |
| In reply to | #59533 |
Op 15-11-13 16:39, Robin Becker schreef: > ......... >> >> Dealing with bytes and Unicode is complicated, and the 2->3 transition >> is not easy, but let's please not spread the misunderstanding that >> somehow the Flexible String Representation is at fault. However you >> store Unicode code points, they are different than bytes, and it is >> complex having to deal with both. You can't somehow make the >> dichotomy go away, you can only choose where you want to think about it. >> >> --Ned. > ....... > I don't think that's what I said; the flexible representation is just an > added complexity ... No it is not, at least not for python programmers. (It of course is for the python implementors). The python programmer doesn't have to care about the flexible representation, just as the python programmer doesn't have to care about the internal reprensentation of (long) integers. It is an implemantation detail that is mostly ignorable. -- Antoon Pardon
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-11-16 03:01 +1100 |
| Message-ID | <mailman.2674.1384531302.18130.python-list@python.org> |
| In reply to | #59533 |
On Sat, Nov 16, 2013 at 2:39 AM, Robin Becker <robin@reportlab.com> wrote: >> Dealing with bytes and Unicode is complicated, and the 2->3 transition is >> not easy, but let's please not spread the misunderstanding that somehow the >> Flexible String Representation is at fault. However you store Unicode code >> points, they are different than bytes, and it is complex having to deal with >> both. You can't somehow make the dichotomy go away, you can only choose >> where you want to think about it. >> >> --Ned. > > ....... > I don't think that's what I said; the flexible representation is just an > added complexity that has come about because of the wish to store strings in > a compact way. The requirement for such complexity is the unicode type > itself (especially the storage requirements) which necessitated some > remedial action. > > There's no point in fighting the change to using unicode. The type wasn't > required for any technical reason as other languages didn't go this route > and are reasonably ok, but there's no doubt the change made things more > difficult. There's no perceptible difference between a 3.2 wide build and the 3.3 flexible representation. (Differences with narrow builds are bugs, and have now been fixed.) As far as your script's concerned, Python 3.3 always stores strings in UTF-32, four bytes per character. It just happens to be way more efficient on memory, most of the time. Other languages _have_ gone for at least some sort of Unicode support. Unfortunately quite a few have done a half-way job and use UTF-16 as their internal representation. That means there's no difference between U+0012, U+0123, and U+1234, but U+12345 suddenly gets handled differently. ECMAScript actually specifies the perverse behaviour of treating codepoints >U+FFFF as two elements in a string, because it's just too costly to change. There are a small number of languages that guarantee correct Unicode handling. I believe bash scripts get this right (though I haven't tested; string manipulation in bash isn't nearly as rich as a proper text parsing language, so I don't dig into it much); Pike is a very Python-like language, and PEP 393 made Python even more Pike-like, because Pike's string has been variable width for as long as I've known it. A handful of other languages also guarantee UTF-32 semantics. All of them are really easy to work with; instead of writing your code and then going "Oh, I wonder what'll happen if I give this thing weird characters?", you just write your code, safe in the knowledge that there is no such thing as a "weird character" (except for a few in the ASCII set... you may find that code breaks if given a newline in the middle of something, or maybe the slash confuses you). Definitely don't fight the change to Unicode, because it's not a change at all... it's just fixing what was buggy. You already had a difference between bytes and characters, you just thought you could ignore it. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2013-11-15 17:47 +0000 |
| Message-ID | <ben50lFd19tU2@mid.individual.net> |
| In reply to | #59544 |
On 2013-11-15, Chris Angelico <rosuav@gmail.com> wrote: > Other languages _have_ gone for at least some sort of Unicode > support. Unfortunately quite a few have done a half-way job and > use UTF-16 as their internal representation. That means there's > no difference between U+0012, U+0123, and U+1234, but U+12345 > suddenly gets handled differently. ECMAScript actually > specifies the perverse behaviour of treating codepoints >U+FFFF > as two elements in a string, because it's just too costly to > change. The unicode support I'm learning in Go is, "Everything is utf-8, right? RIGHT?!?" It also has the interesting behavior that indexing strings retrieves bytes, while iterating over them results in a sequence of runes. It comes with support for no encodings save utf-8 (natively) and utf-16 (if you work at it). Is that really enough? -- Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-11-16 01:09 +0000 |
| Message-ID | <5286c5c8$0$29975$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #59556 |
On Fri, 15 Nov 2013 17:47:01 +0000, Neil Cerutti wrote: > The unicode support I'm learning in Go is, "Everything is utf-8, right? > RIGHT?!?" It also has the interesting behavior that indexing strings > retrieves bytes, while iterating over them results in a sequence of > runes. > > It comes with support for no encodings save utf-8 (natively) and utf-16 > (if you work at it). Is that really enough? Only if you never need to handle data created by other applications. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-11-15 17:10 +0000 |
| Message-ID | <52865594$0$29975$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #59526 |
On Fri, 15 Nov 2013 14:43:17 +0000, Robin Becker wrote:
> Things went wrong when utf8 was not adopted as the standard encoding
> thus requiring two string types, it would have been easier to have a len
> function to count bytes as before and a glyphlen to count glyphs. Now as
> I understand it we have a complicated mess under the hood for unicode
> objects so they have a variable representation to approximate an 8 bit
> representation when suitable etc etc etc.
No no no! Glyphs are *pictures*, you know the little blocks of pixels
that you see on your monitor or printed on a page. Before you can count
glyphs in a string, you need to know which typeface ("font") is being
used, since fonts generally lack glyphs for some code points.
[Aside: there's another complication. Some fonts define alternate glyphs
for the same code point, so that the design of (say) the letter "a" may
vary within the one string according to whatever typographical rules the
font supports and the application calls. So the question is, when you
"count glyphs", should you count "a" and "alternate a" as a single glyph
or two?]
You don't actually mean count glyphs, you mean counting code points
(think characters, only with some complications that aren't important for
the purposes of this discussion).
UTF-8 is utterly unsuited for in-memory storage of text strings, I don't
care how many languages (Go, Haskell?) make that mistake. When you're
dealing with text strings, the fundamental unit is the character, not the
byte. Why do you care how many bytes a text string has? If you really
need to know how much memory an object is using, that's where you use
sys.getsizeof(), not len().
We don't say len({42: None}) to discover that the dict requires 136
bytes, why would you use len("heåvy") to learn that it uses 23 bytes?
UTF-8 is variable width encoding, which means it's *rubbish* for the in-
memory representation of strings. Counting characters is slow. Slicing is
slow. If you have mutable strings, deleting or inserting characters is
slow. Every operation has to effectively start at the beginning of the
string and count forward, lest it split bytes in the middle of a UTF
unit. Or worse, the language doesn't give you any protection from this at
all, so rather than slow string routines you have unsafe string routines,
and it's your responsibility to detect UTF boundaries yourself.
In case you aren't familiar with what I'm talking about, here's an
example using Python 3.2, starting with a Unicode string and treating it
as UTF-8 bytes:
py> u = "heåvy"
py> s = u.encode('utf-8')
py> for c in s:
... print(chr(c))
...
h
e
Ã
¥
v
y
"Ã¥"? It didn't take long to get moji-bake in our output, and all I did
was print the (byte) string one "character" at a time. It gets worse: we
can easily end up with invalid UTF-8:
py> a, b = s[:len(s)//2], s[len(s)//2:] # split the string in half
py> a.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2:
unexpected end of data
py> b.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0:
invalid start byte
No, UTF-8 is okay for writing to files, but it's not suitable for text
strings. The in-memory representation of text strings should be constant
width, based on characters not bytes, and should prevent the caller from
accidentally ending up with moji-bake or invalid strings.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-11-16 04:29 +1100 |
| Message-ID | <mailman.2679.1384536561.18130.python-list@python.org> |
| In reply to | #59549 |
On Sat, Nov 16, 2013 at 4:10 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > No, UTF-8 is okay for writing to files, but it's not suitable for text > strings. Correction: It's _great_ for writing to files (and other fundamentally byte-oriented streams, like network connections). Does a superb job as the default encoding for all sorts of situations. But, as you say, it sucks if you want to find the Nth character. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Cousin Stanley <cousinstanley@gmail.com> |
|---|---|
| Date | 2013-11-15 10:45 -0700 |
| Message-ID | <l65mkk$f31$1@dont-email.me> |
| In reply to | #59549 |
> ....
> We don't say len({42: None}) to discover
> that the dict requires 136 bytes,
> why would you use len("heåvy")
> to learn that it uses 23 bytes ?
> ....
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
illustrate the difference in length of python objects
and the size of their system storage
"""
import sys
s = "heåvy"
d = { 42 : None }
print
print ' s : %s' % s
print ' len( s ) : %d' % len( s )
print ' sys.getsizeof( s ) : %s ' % sys.getsizeof( s )
print
print
print ' d : ' , d
print ' len( d ) : %d' % len( d )
print ' sys.getsizeof( d ) : %d ' % sys.getsizeof( d )
--
Stanley C. Kitching
Human Being
Phoenix, Arizona
[toc] | [prev] | [next] | [standalone]
| From | Joel Goldstick <joel.goldstick@gmail.com> |
|---|---|
| Date | 2013-11-15 09:50 -0500 |
| Message-ID | <mailman.2661.1384527032.18130.python-list@python.org> |
| In reply to | #59511 |
>> Some of us have been doing this long enough to remember when "just plain
>> text" meant only a single case of the alphabet (and a subset of ascii
>> punctuation). On an ASR-33, your C program would print like:
>>
>> MAIN() \(
>> PRINTF("HELLO, ASCII WORLD");
>> \)
>>
>> because ASR-33's didn't have curly braces (or lower case).
>>
>> Having P3's repr() escape non-ascii characters today makes about as much
>> sense as expecting P2's repr() to escape curly braces (and vertical bars,
>> and a few others) because not every terminal can print those.
>>
> .....
> I can certainly remember those days, how we cried and laughed when 8 bits
> became popular.
>
Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?).
;). That eighth bit sure was less confusing than codepoint
translations
> --
> Robin Becker
> --
> https://mail.python.org/mailman/listinfo/python-list
--
Joel Goldstick
http://joelgoldstick.com
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2013-11-15 15:03 +0000 |
| Message-ID | <mailman.2664.1384527843.18130.python-list@python.org> |
| In reply to | #59511 |
........... >> became popular. >> > Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?). > ;). That eighth bit sure was less confusing than codepoint > translations no we had 6 bits in 60 bit words as I recall; extracting the nth character involved division by 6; smart people did tricks with inverted multiplications etc etc :( -- Robin Becker
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.python
csiph-web