Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #59510 > unrolled thread

python 3.3 repr

Started byRobin Becker <robin@reportlab.com>
First post2013-11-15 11:28 +0000
Last post2013-11-15 14:23 -0500
Articles 20 on this page of 31 — 14 participants

Back to article view | Back to comp.lang.python


Contents

  python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 11:28 +0000
    Re: python 3.3 repr Ned Batchelder <ned@nedbatchelder.com> - 2013-11-15 03:38 -0800
      Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 12:16 +0000
        Re: python 3.3 repr Ned Batchelder <ned@nedbatchelder.com> - 2013-11-15 05:54 -0800
          Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 14:29 +0000
          Re: python 3.3 repr Serhiy Storchaka <storchaka@gmail.com> - 2013-11-15 16:40 +0200
          Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 14:52 +0000
      Re: python 3.3 repr Roy Smith <roy@panix.com> - 2013-11-15 09:25 -0500
      Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 14:43 +0000
        Re: python 3.3 repr Ned Batchelder <ned@nedbatchelder.com> - 2013-11-15 07:08 -0800
          Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 15:39 +0000
          Re: python 3.3 repr Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-11-15 16:49 +0100
          Re: python 3.3 repr Chris Angelico <rosuav@gmail.com> - 2013-11-16 03:01 +1100
            Re: python 3.3 repr Neil Cerutti <neilc@norwich.edu> - 2013-11-15 17:47 +0000
              Re: python 3.3 repr Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-16 01:09 +0000
        Re: python 3.3 repr Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-15 17:10 +0000
          Re: python 3.3 repr Chris Angelico <rosuav@gmail.com> - 2013-11-16 04:29 +1100
          Re: python 3.3 repr Cousin Stanley <cousinstanley@gmail.com> - 2013-11-15 10:45 -0700
      Re: python 3.3 repr Joel Goldstick <joel.goldstick@gmail.com> - 2013-11-15 09:50 -0500
      Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 15:03 +0000
      Re: python 3.3 repr Joel Goldstick <joel.goldstick@gmail.com> - 2013-11-15 10:07 -0500
      Re: python 3.3 repr Chris Angelico <rosuav@gmail.com> - 2013-11-16 02:08 +1100
      Re: python 3.3 repr Robin Becker <robin@reportlab.com> - 2013-11-15 15:18 +0000
      Re: python 3.3 repr Roy Smith <roy@panix.com> - 2013-11-15 10:32 -0500
      Re: python 3.3 repr William Ray Wing <wrw@mac.com> - 2013-11-15 11:30 -0500
      Re: python 3.3 repr Zero Piraeus <z@etiol.net> - 2013-11-15 14:06 -0300
      Re: python 3.3 repr Chris Angelico <rosuav@gmail.com> - 2013-11-16 04:11 +1100
      Re: python 3.3 repr Serhiy Storchaka <storchaka@gmail.com> - 2013-11-15 19:37 +0200
    Re: python 3.3 repr Gene Heskett <gheskett@wdtv.com> - 2013-11-15 11:36 -0500
    Re: python 3.3 repr Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-15 17:58 +0000
    Re: python 3.3 repr Gene Heskett <gheskett@wdtv.com> - 2013-11-15 14:23 -0500

Page 1 of 2  [1] 2  Next page →


#59510 — python 3.3 repr

FromRobin Becker <robin@reportlab.com>
Date2013-11-15 11:28 +0000
Subjectpython 3.3 repr
Message-ID<mailman.2646.1384514912.18130.python-list@python.org>
I'm trying to understand what's going on with this simple program

if __name__=='__main__':
	print("repr=%s" % repr(u'\xc1'))
	print("%%r=%r" % u'\xc1')

On my windows XP box this fails miserably if run directly at a terminal

C:\tmp> \Python33\python.exe bang.py
Traceback (most recent call last):
   File "bang.py", line 2, in <module>
     print("repr=%s" % repr(u'\xc1'))
   File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
     return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6: 
character maps to <undefined>

If I run the program redirected into a file then no error occurs and the the 
result looks like this

C:\tmp>cat fff
repr='┴'
%r='┴'

and if I run it into a pipe it works as though into a file.

It seems that repr thinks it can render u'\xc1' directly which is a problem 
since print then seems to want to convert that to cp437 if directed into a terminal.

I find the idea that print knows what it's printing to a bit dangerous, but it's 
the repr behaviour that strikes me as bad.

What is responsible for defining the repr function's 'printable' so that repr 
would give me say an Ascii rendering?
-confused-ly yrs-
Robin Becker

[toc] | [next] | [standalone]


#59511

FromNed Batchelder <ned@nedbatchelder.com>
Date2013-11-15 03:38 -0800
Message-ID<b6db8982-feac-4036-8ec4-2dc720d41a4b@googlegroups.com>
In reply to#59510
On Friday, November 15, 2013 6:28:15 AM UTC-5, Robin Becker wrote:
> I'm trying to understand what's going on with this simple program
> 
> if __name__=='__main__':
> 	print("repr=%s" % repr(u'\xc1'))
> 	print("%%r=%r" % u'\xc1')
> 
> On my windows XP box this fails miserably if run directly at a terminal
> 
> C:\tmp> \Python33\python.exe bang.py
> Traceback (most recent call last):
>    File "bang.py", line 2, in <module>
>      print("repr=%s" % repr(u'\xc1'))
>    File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
>      return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6: 
> character maps to <undefined>
> 
> If I run the program redirected into a file then no error occurs and the the 
> result looks like this
> 
> C:\tmp>cat fff
> repr='┴'
> %r='┴'
> 
> and if I run it into a pipe it works as though into a file.
> 
> It seems that repr thinks it can render u'\xc1' directly which is a problem 
> since print then seems to want to convert that to cp437 if directed into a terminal.
> 
> I find the idea that print knows what it's printing to a bit dangerous, but it's 
> the repr behaviour that strikes me as bad.
> 
> What is responsible for defining the repr function's 'printable' so that repr 
> would give me say an Ascii rendering?
> -confused-ly yrs-
> Robin Becker

In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments.  This has been controversial.  To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string.

--Ned.

[toc] | [prev] | [next] | [standalone]


#59513

FromRobin Becker <robin@reportlab.com>
Date2013-11-15 12:16 +0000
Message-ID<mailman.2648.1384517826.18130.python-list@python.org>
In reply to#59511
On 15/11/2013 11:38, Ned Batchelder wrote:
..........
>
> In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments.  This has been controversial.  To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string.
>
> --Ned.
>

thanks for this, edoesn't make the split across python2 - 3 any easier.
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#59520

FromNed Batchelder <ned@nedbatchelder.com>
Date2013-11-15 05:54 -0800
Message-ID<edbfd521-c595-453e-a019-8d39c79437fb@googlegroups.com>
In reply to#59513
On Friday, November 15, 2013 7:16:52 AM UTC-5, Robin Becker wrote:
> On 15/11/2013 11:38, Ned Batchelder wrote:
> ..........
> >
> > In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments.  This has been controversial.  To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string.
> >
> > --Ned.
> >
> 
> thanks for this, edoesn't make the split across python2 - 3 any easier.
> -- 
> Robin Becker

No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway.  You could do this:

    try:
        repr = ascii
    except NameError:
        pass

and then use repr throughout.

--Ned.

[toc] | [prev] | [next] | [standalone]


#59524

FromRobin Becker <robin@reportlab.com>
Date2013-11-15 14:29 +0000
Message-ID<mailman.2656.1384525772.18130.python-list@python.org>
In reply to#59520
On 15/11/2013 13:54, Ned Batchelder wrote:
.........
>
> No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway.  You could do this:
>
>      try:
>          repr = ascii
>      except NameError:
>          pass
....
yes I tried that, but it doesn't affect %r which is inlined in unicodeobject.c, 
for me it seems easier to fix windows to use something like a standard encoding 
of utf8 ie cp65001, but that's quite hard to do globally. It seems sitecustomize 
is too late to set os.environ['PYTHONIOENCODING'], perhaps I can stuff that into 
one of the global environment vars and have it work for all python invocations.
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#59525

FromSerhiy Storchaka <storchaka@gmail.com>
Date2013-11-15 16:40 +0200
Message-ID<mailman.2658.1384526450.18130.python-list@python.org>
In reply to#59520
15.11.13 15:54, Ned Batchelder написав(ла):
> No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway.  You could do this:
>
>      try:
>          repr = ascii
>      except NameError:
>          pass
>
> and then use repr throughout.

Or rather

     try:
         ascii
     except NameError:
         ascii = repr

and then use ascii throughout.

[toc] | [prev] | [next] | [standalone]


#59528

FromRobin Becker <robin@reportlab.com>
Date2013-11-15 14:52 +0000
Message-ID<mailman.2663.1384527134.18130.python-list@python.org>
In reply to#59520
On 15/11/2013 14:40, Serhiy Storchaka wrote:
......


>> and then use repr throughout.
>
> Or rather
>
>      try:
>          ascii
>      except NameError:
>          ascii = repr
>
> and then use ascii throughout.
>
>

apparently you can import ascii from future_builtins and the print() function is 
available as

from __future__ import print_function

nothing fixes all those %r formats to be %a though :(
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#59523

FromRoy Smith <roy@panix.com>
Date2013-11-15 09:25 -0500
Message-ID<mailman.2655.1384525556.18130.python-list@python.org>
In reply to#59511

[Multipart message — attachments visible in raw view] — view raw

In article <b6db8982-feac-4036-8ec4-2dc720d41a4b@googlegroups.com>,
Ned Batchelder <ned@nedbatchelder.com> wrote:

> In Python3, repr() will return a Unicode string, and will preserve existing 
> Unicode characters in its arguments.  This has been controversial.  To get 
> the Python 2 behavior of a pure-ascii representation, there is the new 
> builtin ascii(), and a corresponding %a format string.

I'm still stuck on Python 2, and while I can understand the controversy ("It breaks my Python 2 code!"), this seems like the right thing to have done.  In Python 2, unicode is an add-on.  One of the big design drivers in Python 3 was to make unicode the standard.

The idea behind repr() is to provide a "just plain text" representation of an object.  In P2, "just plain text" means ascii, so escaping non-ascii characters makes sense.  In P3, "just plain text" means unicode, so escaping non-ascii characters no longer makes sense.

Some of us have been doing this long enough to remember when "just plain text" meant only a single case of the alphabet (and a subset of ascii punctuation).  On an ASR-33, your C program would print like:

MAIN() \(
	PRINTF("HELLO, ASCII WORLD");
\)

because ASR-33's didn't have curly braces (or lower case).

Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those.

--
Roy Smith
roy@panix.com

[toc] | [prev] | [next] | [standalone]


#59526

FromRobin Becker <robin@reportlab.com>
Date2013-11-15 14:43 +0000
Message-ID<mailman.2660.1384526610.18130.python-list@python.org>
In reply to#59511
..........
> I'm still stuck on Python 2, and while I can understand the controversy ("It breaks my Python 2 code!"), this seems like the right thing to have done.  In Python 2, unicode is an add-on.  One of the big design drivers in Python 3 was to make unicode the standard.
>
> The idea behind repr() is to provide a "just plain text" representation of an object.  In P2, "just plain text" means ascii, so escaping non-ascii characters makes sense.  In P3, "just plain text" means unicode, so escaping non-ascii characters no longer makes sense.
>

unfortunately the word 'printable' got into the definition of repr; it's clear 
that printability is not the same as unicode at least as far as the print 
function is concerned. In my opinion it would have been better to leave the old 
behaviour as that would have eased the compatibility.

The python gods don't count that sort of thing as important enough so we get the 
mess that is the python2/3 split. ReportLab has to do both so it's a real issue; 
in addition swapping the str - unicode pair to bytes str doesn't help one's 
mental models either :(

Things went wrong when utf8 was not adopted as the standard encoding thus 
requiring two string types, it would have been easier to have a len function to 
count bytes as before and a glyphlen to count glyphs. Now as I understand it we 
have a complicated mess under the hood for unicode objects so they have a 
variable representation to approximate an 8 bit representation when suitable etc 
etc etc.

> Some of us have been doing this long enough to remember when "just plain text" meant only a single case of the alphabet (and a subset of ascii punctuation).  On an ASR-33, your C program would print like:
>
> MAIN() \(
> 	PRINTF("HELLO, ASCII WORLD");
> \)
>
> because ASR-33's didn't have curly braces (or lower case).
>
> Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those.
>
.....
I can certainly remember those days, how we cried and laughed when 8 bits became 
popular.
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#59533

FromNed Batchelder <ned@nedbatchelder.com>
Date2013-11-15 07:08 -0800
Message-ID<0d383a3c-247f-4b6a-9a18-7e7fadeb6047@googlegroups.com>
In reply to#59526
On Friday, November 15, 2013 9:43:17 AM UTC-5, Robin Becker wrote:
> Things went wrong when utf8 was not adopted as the standard encoding thus 
> requiring two string types, it would have been easier to have a len function to 
> count bytes as before and a glyphlen to count glyphs. Now as I understand it we 
> have a complicated mess under the hood for unicode objects so they have a 
> variable representation to approximate an 8 bit representation when suitable etc 
> etc etc.
> 

Dealing with bytes and Unicode is complicated, and the 2->3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault.  However you store Unicode code points, they are different than bytes, and it is complex having to deal with both.  You can't somehow make the dichotomy go away, you can only choose where you want to think about it.

--Ned.

> -- 
> Robin Becker

[toc] | [prev] | [next] | [standalone]


#59539

FromRobin Becker <robin@reportlab.com>
Date2013-11-15 15:39 +0000
Message-ID<mailman.2671.1384529961.18130.python-list@python.org>
In reply to#59533
.........
>
> Dealing with bytes and Unicode is complicated, and the 2->3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault.  However you store Unicode code points, they are different than bytes, and it is complex having to deal with both.  You can't somehow make the dichotomy go away, you can only choose where you want to think about it.
>
> --Ned.
.......
I don't think that's what I said; the flexible representation is just an added 
complexity that has come about because of the wish to store strings in a compact 
way. The requirement for such complexity is the unicode type itself (especially 
the storage requirements) which necessitated some remedial action.

There's no point in fighting the change to using unicode. The type wasn't 
required for any technical reason as other languages didn't go this route and 
are reasonably ok, but there's no doubt the change made things more difficult.
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#59541

FromAntoon Pardon <antoon.pardon@rece.vub.ac.be>
Date2013-11-15 16:49 +0100
Message-ID<mailman.2673.1384530577.18130.python-list@python.org>
In reply to#59533
Op 15-11-13 16:39, Robin Becker schreef:
> .........
>>
>> Dealing with bytes and Unicode is complicated, and the 2->3 transition
>> is not easy, but let's please not spread the misunderstanding that
>> somehow the Flexible String Representation is at fault.  However you
>> store Unicode code points, they are different than bytes, and it is
>> complex having to deal with both.  You can't somehow make the
>> dichotomy go away, you can only choose where you want to think about it.
>>
>> --Ned.
> .......
> I don't think that's what I said; the flexible representation is just an
> added complexity ...

No it is not, at least not for python programmers. (It of course is for
the python implementors). The python programmer doesn't have to care
about the flexible representation, just as the python programmer doesn't
have to care about the internal reprensentation of (long) integers. It
is an implemantation detail that is mostly ignorable.

-- 
Antoon Pardon

[toc] | [prev] | [next] | [standalone]


#59544

FromChris Angelico <rosuav@gmail.com>
Date2013-11-16 03:01 +1100
Message-ID<mailman.2674.1384531302.18130.python-list@python.org>
In reply to#59533
On Sat, Nov 16, 2013 at 2:39 AM, Robin Becker <robin@reportlab.com> wrote:
>> Dealing with bytes and Unicode is complicated, and the 2->3 transition is
>> not easy, but let's please not spread the misunderstanding that somehow the
>> Flexible String Representation is at fault.  However you store Unicode code
>> points, they are different than bytes, and it is complex having to deal with
>> both.  You can't somehow make the dichotomy go away, you can only choose
>> where you want to think about it.
>>
>> --Ned.
>
> .......
> I don't think that's what I said; the flexible representation is just an
> added complexity that has come about because of the wish to store strings in
> a compact way. The requirement for such complexity is the unicode type
> itself (especially the storage requirements) which necessitated some
> remedial action.
>
> There's no point in fighting the change to using unicode. The type wasn't
> required for any technical reason as other languages didn't go this route
> and are reasonably ok, but there's no doubt the change made things more
> difficult.

There's no perceptible difference between a 3.2 wide build and the 3.3
flexible representation. (Differences with narrow builds are bugs, and
have now been fixed.) As far as your script's concerned, Python 3.3
always stores strings in UTF-32, four bytes per character. It just
happens to be way more efficient on memory, most of the time.

Other languages _have_ gone for at least some sort of Unicode support.
Unfortunately quite a few have done a half-way job and use UTF-16 as
their internal representation. That means there's no difference
between U+0012, U+0123, and U+1234, but U+12345 suddenly gets handled
differently. ECMAScript actually specifies the perverse behaviour of
treating codepoints >U+FFFF as two elements in a string, because it's
just too costly to change.

There are a small number of languages that guarantee correct Unicode
handling. I believe bash scripts get this right (though I haven't
tested; string manipulation in bash isn't nearly as rich as a proper
text parsing language, so I don't dig into it much); Pike is a very
Python-like language, and PEP 393 made Python even more Pike-like,
because Pike's string has been variable width for as long as I've
known it. A handful of other languages also guarantee UTF-32
semantics. All of them are really easy to work with; instead of
writing your code and then going "Oh, I wonder what'll happen if I
give this thing weird characters?", you just write your code, safe in
the knowledge that there is no such thing as a "weird character"
(except for a few in the ASCII set... you may find that code breaks if
given a newline in the middle of something, or maybe the slash
confuses you).

Definitely don't fight the change to Unicode, because it's not a
change at all... it's just fixing what was buggy. You already had a
difference between bytes and characters, you just thought you could
ignore it.

ChrisA

[toc] | [prev] | [next] | [standalone]


#59556

FromNeil Cerutti <neilc@norwich.edu>
Date2013-11-15 17:47 +0000
Message-ID<ben50lFd19tU2@mid.individual.net>
In reply to#59544
On 2013-11-15, Chris Angelico <rosuav@gmail.com> wrote:
> Other languages _have_ gone for at least some sort of Unicode
> support. Unfortunately quite a few have done a half-way job and
> use UTF-16 as their internal representation. That means there's
> no difference between U+0012, U+0123, and U+1234, but U+12345
> suddenly gets handled differently. ECMAScript actually
> specifies the perverse behaviour of treating codepoints >U+FFFF
> as two elements in a string, because it's just too costly to
> change.

The unicode support I'm learning in Go is, "Everything is utf-8,
right? RIGHT?!?" It also has the interesting behavior that
indexing strings retrieves bytes, while iterating over them
results in a sequence of runes.

It comes with support for no encodings save utf-8 (natively) and
utf-16 (if you work at it). Is that really enough?

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#59579

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-11-16 01:09 +0000
Message-ID<5286c5c8$0$29975$c3e8da3$5496439d@news.astraweb.com>
In reply to#59556
On Fri, 15 Nov 2013 17:47:01 +0000, Neil Cerutti wrote:

> The unicode support I'm learning in Go is, "Everything is utf-8, right?
> RIGHT?!?" It also has the interesting behavior that indexing strings
> retrieves bytes, while iterating over them results in a sequence of
> runes.
> 
> It comes with support for no encodings save utf-8 (natively) and utf-16
> (if you work at it). Is that really enough?

Only if you never need to handle data created by other applications.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#59549

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-11-15 17:10 +0000
Message-ID<52865594$0$29975$c3e8da3$5496439d@news.astraweb.com>
In reply to#59526
On Fri, 15 Nov 2013 14:43:17 +0000, Robin Becker wrote:

> Things went wrong when utf8 was not adopted as the standard encoding
> thus requiring two string types, it would have been easier to have a len
> function to count bytes as before and a glyphlen to count glyphs. Now as
> I understand it we have a complicated mess under the hood for unicode
> objects so they have a variable representation to approximate an 8 bit
> representation when suitable etc etc etc.

No no no! Glyphs are *pictures*, you know the little blocks of pixels 
that you see on your monitor or printed on a page. Before you can count 
glyphs in a string, you need to know which typeface ("font") is being 
used, since fonts generally lack glyphs for some code points.

[Aside: there's another complication. Some fonts define alternate glyphs 
for the same code point, so that the design of (say) the letter "a" may 
vary within the one string according to whatever typographical rules the 
font supports and the application calls. So the question is, when you 
"count glyphs", should you count "a" and "alternate a" as a single glyph 
or two?]

You don't actually mean count glyphs, you mean counting code points 
(think characters, only with some complications that aren't important for 
the purposes of this discussion).

UTF-8 is utterly unsuited for in-memory storage of text strings, I don't 
care how many languages (Go, Haskell?) make that mistake. When you're 
dealing with text strings, the fundamental unit is the character, not the 
byte. Why do you care how many bytes a text string has? If you really 
need to know how much memory an object is using, that's where you use 
sys.getsizeof(), not len().

We don't say len({42: None}) to discover that the dict requires 136 
bytes, why would you use len("heåvy") to learn that it uses 23 bytes?

UTF-8 is variable width encoding, which means it's *rubbish* for the in-
memory representation of strings. Counting characters is slow. Slicing is 
slow. If you have mutable strings, deleting or inserting characters is 
slow. Every operation has to effectively start at the beginning of the 
string and count forward, lest it split bytes in the middle of a UTF 
unit. Or worse, the language doesn't give you any protection from this at 
all, so rather than slow string routines you have unsafe string routines, 
and it's your responsibility to detect UTF boundaries yourself. 

In case you aren't familiar with what I'm talking about, here's an 
example using Python 3.2, starting with a Unicode string and treating it 
as UTF-8 bytes:

py> u = "heåvy"
py> s = u.encode('utf-8')
py> for c in s:
...     print(chr(c))
...
h
e
Ã
¥
v
y


"Ã¥"? It didn't take long to get moji-bake in our output, and all I did 
was print the (byte) string one "character" at a time. It gets worse: we 
can easily end up with invalid UTF-8:

py> a, b = s[:len(s)//2], s[len(s)//2:]  # split the string in half
py> a.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2: 
unexpected end of data
py> b.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: 
invalid start byte


No, UTF-8 is okay for writing to files, but it's not suitable for text 
strings. The in-memory representation of text strings should be constant 
width, based on characters not bytes, and should prevent the caller from 
accidentally ending up with moji-bake or invalid strings.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#59552

FromChris Angelico <rosuav@gmail.com>
Date2013-11-16 04:29 +1100
Message-ID<mailman.2679.1384536561.18130.python-list@python.org>
In reply to#59549
On Sat, Nov 16, 2013 at 4:10 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> No, UTF-8 is okay for writing to files, but it's not suitable for text
> strings.

Correction: It's _great_ for writing to files (and other fundamentally
byte-oriented streams, like network connections). Does a superb job as
the default encoding for all sorts of situations. But, as you say, it
sucks if you want to find the Nth character.

ChrisA

[toc] | [prev] | [next] | [standalone]


#59555

FromCousin Stanley <cousinstanley@gmail.com>
Date2013-11-15 10:45 -0700
Message-ID<l65mkk$f31$1@dont-email.me>
In reply to#59549
> ....
> We don't say len({42: None}) to discover 
> that the dict requires 136 bytes, 
> why would you use len("heåvy") 
> to learn that it uses 23 bytes ?
> ....

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
    illustrate the difference in length of python objects
    and the size of their system storage
"""

import sys

s = "heåvy"

d = { 42 :  None }

print
print '                   s :  %s' % s
print '            len( s ) :  %d' % len( s )
print '  sys.getsizeof( s ) :  %s ' % sys.getsizeof( s )
print
print
print '                   d : ' , d
print '            len( d ) :  %d' % len( d )
print '  sys.getsizeof( d ) :  %d ' % sys.getsizeof( d )


-- 
Stanley C. Kitching
Human Being
Phoenix, Arizona

[toc] | [prev] | [next] | [standalone]


#59527

FromJoel Goldstick <joel.goldstick@gmail.com>
Date2013-11-15 09:50 -0500
Message-ID<mailman.2661.1384527032.18130.python-list@python.org>
In reply to#59511
>> Some of us have been doing this long enough to remember when "just plain
>> text" meant only a single case of the alphabet (and a subset of ascii
>> punctuation).  On an ASR-33, your C program would print like:
>>
>> MAIN() \(
>>         PRINTF("HELLO, ASCII WORLD");
>> \)
>>
>> because ASR-33's didn't have curly braces (or lower case).
>>
>> Having P3's repr() escape non-ascii characters today makes about as much
>> sense as expecting P2's repr() to escape curly braces (and vertical bars,
>> and a few others) because not every terminal can print those.
>>
> .....
> I can certainly remember those days, how we cried and laughed when 8 bits
> became popular.
>
Really? you cried and laughed over 7 vs. 8 bits?  That's lovely (?).
;).  That eighth bit sure was less confusing than codepoint
translations


> --
> Robin Becker
> --
> https://mail.python.org/mailman/listinfo/python-list



-- 
Joel Goldstick
http://joelgoldstick.com

[toc] | [prev] | [next] | [standalone]


#59531

FromRobin Becker <robin@reportlab.com>
Date2013-11-15 15:03 +0000
Message-ID<mailman.2664.1384527843.18130.python-list@python.org>
In reply to#59511
...........
>> became popular.
>>
> Really? you cried and laughed over 7 vs. 8 bits?  That's lovely (?).
> ;).  That eighth bit sure was less confusing than codepoint
> translations


no we had 6 bits in 60 bit words as I recall; extracting the nth character 
involved division by 6; smart people did tricks with inverted multiplications 
etc etc  :(
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.python


csiph-web