Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #27204 > unrolled thread

How do I display unicode value stored in a string variable using ord()

Started byCharles Jensen <hopefullycharles@gmail.com>
First post2012-08-16 15:09 -0700
Last post2012-08-20 17:20 -0400
Articles 20 on this page of 145 — 26 participants

Back to article view | Back to comp.lang.python


Contents

  How do I display unicode value stored in a string variable using ord() Charles Jensen <hopefullycharles@gmail.com> - 2012-08-16 15:09 -0700
    Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-17 08:20 +1000
    Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-16 18:47 -0400
    Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-16 19:59 -0400
      Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
        Re: How do I display unicode value stored in a string variable using ord() Jerry Hill <malaclypse2@gmail.com> - 2012-08-17 14:21 -0400
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
            Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 16:55 -0400
            Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 23:30 -0400
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 04:10 +0000
                Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:18 -0600
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 03:59 +0000
      Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
    Re: How do I display unicode value stored in a string variable using ord() Alister <alister.ware@ntlworld.com> - 2012-08-17 06:30 +0000
    Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 01:09 -0700
      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 12:27 +0000
        Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 08:07 -0700
          Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 16:25 +0100
          Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 01:36 +1000
          Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:51 -0600
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 02:57 +1000
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 18:28 +0100
                Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
                  Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:34 +0100
                    Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:35 +0000
                      New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Peter Otten <__peter__@web.de> - 2012-08-19 09:43 +0200
                        Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:56 +0000
                          Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 02:24 -0700
                          Re: New internal string format in 3.3 Peter Otten <__peter__@web.de> - 2012-08-19 11:37 +0200
                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
                              Re: New internal string format in 3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:33 +0000
                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
                              Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:26 +1000
                                Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
                                  Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:29 -0400
                                    Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
                                      Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 14:46 +0100
                                        Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
                                        Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
                                          Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 15:48 +0100
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
                                          Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:48 -0400
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
                                              Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:09 +0100
                                              Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-20 07:50 +1000
                                              Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-19 23:38 -0600
                                                Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-20 09:17 -0400
                                                  Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-20 22:18 -0600
                                                    Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-21 07:48 -0400
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
                                      Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:56 -0400
                                    Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
                                  Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:35 -0400
                                Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:30 +0000
                Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 16:09 -0400
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 23:12 -0400
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:33 +0000
              Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 11:50 -0600
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 11:20 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:31 -0600
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 12:23 -0700
                Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:16 +0000
              Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:46 -0600
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 17:59 +0000
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:30 -0700
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:45 +0100
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:13 +0000
            Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-18 11:40 -0700
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:50 +0100
              Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 13:22 -0700
                Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 22:37 +0100
        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 11:26 -0700
          Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:59 +0100
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 07:17 +0000
          Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 10:46 +1000
            Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:11 -0700
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 12:19 +1000
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:35 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:01 +1000
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 20:10 -0700
                      Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:31 +1000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 22:58 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:01 +0000
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:11 -0700
                      Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 18:24 +1000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:44 -0700
                          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 01:54 -0700
                            Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 11:46 +0100
                            Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 12:31 -0400
                      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 10:51 +0000
                        Re: How do I display unicode value stored in a string variable using ord() Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-21 17:03 +1000
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:09 +0000
            Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:04 -0700
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:25 +0000
                Re: How do I display unicode value stored in a string variable using ord() DJC <djc@news.invalid> - 2012-08-19 17:32 +0200
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:34 -0400
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 10:48 -0700
                  Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 11:11 -0700
                    Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:50 +0100
                    Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:59 -0400
                    Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-19 23:13 -0700
                  Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:15 +0000
                    Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-19 16:42 -0700
                      Re: Abuse of Big Oh notation Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-08-20 09:24 +0100
                        Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 09:01 -0700
                          Re: Abuse of Big Oh notation Chris Angelico <rosuav@gmail.com> - 2012-08-21 02:09 +1000
                          Re: Abuse of Big Oh notation Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-20 11:12 -0600
                            Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 12:29 -0700
                              Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:16 -0700
                              Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:20 -0700
                            Re: Abuse of Big Oh notation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-21 09:53 +0000
                        Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
                          Re: Abuse of Big Oh notation Ned Deily <nad@acm.org> - 2012-08-20 18:19 -0700
                          Abuse of subject, was Re: Abuse of Big Oh notation Peter Otten <__peter__@web.de> - 2012-08-21 09:52 +0200
                            Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
                            Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
                        Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Hans Mulder <hansmu@xs4all.nl> - 2012-08-22 20:53 +0200
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 08:42 +1000
                Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-19 19:24 -0400
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 04:21 +0000
                    Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-20 00:44 -0400
                      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 05:56 +0000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 23:24 -0700
                    Re: How do I display unicode value stored in a string variable using ord() Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-20 12:58 -0400
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 20:35 -0400
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 14:07 +1000
            Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:13 +0100
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:19 +1000
                Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:49 +0100
        Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 18:03 +0100
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 10:33 -0700
            Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:04 +0100
          Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-19 14:05 -0400
            Re: How do I display unicode value stored in a string variable usingord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:18 +0100
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:31 +0000
          Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:03 -0400
          Re: How do I display unicode value stored in a string variable using ord() 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-19 17:32 -0700
          Re: How do I display unicode value stored in a string variable using ord() Piet van Oostrum <piet@vanoostrum.org> - 2012-08-20 17:20 -0400

Page 1 of 8  [1] 2 3 4 5 6 7 8  Next page →


#27204 — How do I display unicode value stored in a string variable using ord()

FromCharles Jensen <hopefullycharles@gmail.com>
Date2012-08-16 15:09 -0700
SubjectHow do I display unicode value stored in a string variable using ord()
Message-ID<f801e06f-f7b2-4aca-b352-66856a939746@googlegroups.com>
Everyone knows that the python command

     ord(u'…')

will output the number 8230 which is the unicode character for the horizontal ellipsis.

How would I use ord() to find the unicode value of a string stored in a variable?  

So the following 2 lines of code will give me the ascii value of the variable a.  How do I specify ord to give me the unicode value of a?

     a = '…'
     ord(a)

[toc] | [next] | [standalone]


#27205

FromChris Angelico <rosuav@gmail.com>
Date2012-08-17 08:20 +1000
Message-ID<mailman.3397.1345155618.4697.python-list@python.org>
In reply to#27204
On Fri, Aug 17, 2012 at 8:09 AM, Charles Jensen
<hopefullycharles@gmail.com> wrote:
> How would I use ord() to find the unicode value of a string stored in a variable?
>
> So the following 2 lines of code will give me the ascii value of the variable a.  How do I specify ord to give me the unicode value of a?
>
>      a = '…'
>      ord(a)

I presume you're talking about Python 2, because in Python 3 your
string variable is a Unicode string and will behave as you describe
above.

You'll need to look into what the encoding is, and figure it out from there.

ChrisA

[toc] | [prev] | [next] | [standalone]


#27210

FromDave Angel <d@davea.name>
Date2012-08-16 18:47 -0400
Message-ID<mailman.3401.1345157258.4697.python-list@python.org>
In reply to#27204
On 08/16/2012 06:09 PM, Charles Jensen wrote:
> Everyone knows that the python command
>
>      ord(u'…')
>
> will output the number 8230 which is the unicode character for the horizontal ellipsis.
>
> How would I use ord() to find the unicode value of a string stored in a variable?  
>
> So the following 2 lines of code will give me the ascii value of the variable a.  How do I specify ord to give me the unicode value of a?
>
>      a = '…'
>      ord(a)

You omitted the print statement.  You also didn't specify what version
of Python you're using;  I'll assume Python 2.x because in Python 3.x,
the u"xx" notation would have been a syntax error.

To get the ord of a unicode variable, you do it the same as a unicode
literal:

       a = u"j"         #note: for this to work reliably, you probably
need the correct Unicode declaration in line 2 of the file
       print ord(a)

But if you have a byte string containing some binary bits, and you want
to get a unicode character value out of it, you'll need to explicitly
convert it to unicode.

First, decide what method the byte string was encoded.  If you specify
the wrong encoding, you'll likely to get an exception, or maybe just a
nonsense answer.

       a = "\xc1\xc1"            #I just made this value up;  it's not
valid utf8
       b = a.decode("utf-8")
       print ord(b)



-- 

DaveA

[toc] | [prev] | [next] | [standalone]


#27215

FromTerry Reedy <tjreedy@udel.edu>
Date2012-08-16 19:59 -0400
Message-ID<mailman.3406.1345161591.4697.python-list@python.org>
In reply to#27204
a = '…'
print(ord(a))
 >>>
8230
Most things with unicode are easier in 3.x, and some are even better in 
3.3. The current beta is good enough for most informal work. 3.3.0 will 
be out in a month.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#27248

Fromwxjmfauth@gmail.com
Date2012-08-17 10:49 -0700
Message-ID<a6c030b2-25da-47a2-97b5-1e349394d762@googlegroups.com>
In reply to#27215
Le vendredi 17 août 2012 01:59:31 UTC+2, Terry Reedy a écrit :
> a = '…'
> 
> print(ord(a))
> 
>  >>>
> 
> 8230
> 
> Most things with unicode are easier in 3.x, and some are even better in 
> 
> 3.3. The current beta is good enough for most informal work. 3.3.0 will 
> 
> be out in a month.
> 
> 
> 
> -- 
> 
> Terry Jan Reedy

Slightly off topic.

The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
is one of these characters existing in the cp1252, mac-roman
coding schemes and not in iso-8859-1 (latin-1) and obviously
not in ascii. It causes Py3.3 to work a few 100% slower
than Py<3.3 versions due to the flexible string representation
(ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).

>>> '…'.encode('cp1252')
b'\x85'
>>> '…'.encode('mac-roman')
b'\xc9'
>>> '…'.encode('iso-8859-1') # latin-1
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
in position 0: ordinal not in range(256)

If one could neglect this (typographically important) glyph, what
to say about the characters of the European scripts (languages)
present in cp1252 or in mac-roman but not in latin-1 (eg. the
French script/language)?

Very nice. Python 2 was built for ascii user, now Python 3 is
*optimized* for, let say, ascii user!

The future is bright for Python. French users are better
served with Apple or MS products, simply because these
corporates know you can not write French with iso-8859-1.

PS When "TeX" moved from the ascii encoding to iso-8859-1
and the so called Cork encoding, "they" know this and provided
all the complementary packages to circumvent this. It was
in 199? (Python was not even born).

Ditto for the foundries (Adobe, Linotype, ...)

jmf

[toc] | [prev] | [next] | [standalone]


#27254

FromJerry Hill <malaclypse2@gmail.com>
Date2012-08-17 14:21 -0400
Message-ID<mailman.3422.1345227697.4697.python-list@python.org>
In reply to#27248
On Fri, Aug 17, 2012 at 1:49 PM,  <wxjmfauth@gmail.com> wrote:
> The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
> is one of these characters existing in the cp1252, mac-roman
> coding schemes and not in iso-8859-1 (latin-1) and obviously
> not in ascii. It causes Py3.3 to work a few 100% slower
> than Py<3.3 versions due to the flexible string representation
> (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
>
>>>> '…'.encode('cp1252')
> b'\x85'
>>>> '…'.encode('mac-roman')
> b'\xc9'
>>>> '…'.encode('iso-8859-1') # latin-1
> Traceback (most recent call last):
>   File "<eta last command>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
> in position 0: ordinal not in range(256)
>
> If one could neglect this (typographically important) glyph, what
> to say about the characters of the European scripts (languages)
> present in cp1252 or in mac-roman but not in latin-1 (eg. the
> French script/language)?

So... python should change the longstanding definition of the latin-1
character set?  This isn't some sort of python limitation, it's just
the reality of legacy encodings that actually exist in the real world.


> Very nice. Python 2 was built for ascii user, now Python 3 is
> *optimized* for, let say, ascii user!
>
> The future is bright for Python. French users are better
> served with Apple or MS products, simply because these
> corporates know you can not write French with iso-8859-1.
>
> PS When "TeX" moved from the ascii encoding to iso-8859-1
> and the so called Cork encoding, "they" know this and provided
> all the complementary packages to circumvent this. It was
> in 199? (Python was not even born).
>
> Ditto for the foundries (Adobe, Linotype, ...)


I don't understand what any of this has to do with Python.  Just
output your text in UTF-8 like any civilized person in the 21st
century, and none of that is a problem at all.  Python make that easy.
 It also makes it easy to interoperate with older encodings if you
have to.

-- 
Jerry

[toc] | [prev] | [next] | [standalone]


#27256

Fromwxjmfauth@gmail.com
Date2012-08-17 11:45 -0700
Message-ID<mailman.3423.1345229106.4697.python-list@python.org>
In reply to#27254
Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
> On Fri, Aug 17, 2012 at 1:49 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
> 
> > is one of these characters existing in the cp1252, mac-roman
> 
> > coding schemes and not in iso-8859-1 (latin-1) and obviously
> 
> > not in ascii. It causes Py3.3 to work a few 100% slower
> 
> > than Py<3.3 versions due to the flexible string representation
> 
> > (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
> 
> >
> 
> >>>> '…'.encode('cp1252')
> 
> > b'\x85'
> 
> >>>> '…'.encode('mac-roman')
> 
> > b'\xc9'
> 
> >>>> '…'.encode('iso-8859-1') # latin-1
> 
> > Traceback (most recent call last):
> 
> >   File "<eta last command>", line 1, in <module>
> 
> > UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
> 
> > in position 0: ordinal not in range(256)
> 
> >
> 
> > If one could neglect this (typographically important) glyph, what
> 
> > to say about the characters of the European scripts (languages)
> 
> > present in cp1252 or in mac-roman but not in latin-1 (eg. the
> 
> > French script/language)?
> 
> 
> 
> So... python should change the longstanding definition of the latin-1
> 
> character set?  This isn't some sort of python limitation, it's just
> 
> the reality of legacy encodings that actually exist in the real world.
> 
> 
> 
> 
> 
> > Very nice. Python 2 was built for ascii user, now Python 3 is
> 
> > *optimized* for, let say, ascii user!
> 
> >
> 
> > The future is bright for Python. French users are better
> 
> > served with Apple or MS products, simply because these
> 
> > corporates know you can not write French with iso-8859-1.
> 
> >
> 
> > PS When "TeX" moved from the ascii encoding to iso-8859-1
> 
> > and the so called Cork encoding, "they" know this and provided
> 
> > all the complementary packages to circumvent this. It was
> 
> > in 199? (Python was not even born).
> 
> >
> 
> > Ditto for the foundries (Adobe, Linotype, ...)
> 
> 
> 
> 
> 
> I don't understand what any of this has to do with Python.  Just
> 
> output your text in UTF-8 like any civilized person in the 21st
> 
> century, and none of that is a problem at all.  Python make that easy.
> 
>  It also makes it easy to interoperate with older encodings if you
> 
> have to.
> 

Sorry, you missed the point.

My comment had nothing to do with the code source coding,
the coding of a Python "string" in the code source or with
the display of a Python3 <str>.
I wrote about the *internal* Python "coding", the
way Python keeps "strings" in memory. See PEP 393.

jmf

[toc] | [prev] | [next] | [standalone]


#27257

Fromwxjmfauth@gmail.com
Date2012-08-17 11:45 -0700
Message-ID<253ddd61-4bb5-4f46-b58c-525e55b27558@googlegroups.com>
In reply to#27254
Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
> On Fri, Aug 17, 2012 at 1:49 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
> 
> > is one of these characters existing in the cp1252, mac-roman
> 
> > coding schemes and not in iso-8859-1 (latin-1) and obviously
> 
> > not in ascii. It causes Py3.3 to work a few 100% slower
> 
> > than Py<3.3 versions due to the flexible string representation
> 
> > (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
> 
> >
> 
> >>>> '…'.encode('cp1252')
> 
> > b'\x85'
> 
> >>>> '…'.encode('mac-roman')
> 
> > b'\xc9'
> 
> >>>> '…'.encode('iso-8859-1') # latin-1
> 
> > Traceback (most recent call last):
> 
> >   File "<eta last command>", line 1, in <module>
> 
> > UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
> 
> > in position 0: ordinal not in range(256)
> 
> >
> 
> > If one could neglect this (typographically important) glyph, what
> 
> > to say about the characters of the European scripts (languages)
> 
> > present in cp1252 or in mac-roman but not in latin-1 (eg. the
> 
> > French script/language)?
> 
> 
> 
> So... python should change the longstanding definition of the latin-1
> 
> character set?  This isn't some sort of python limitation, it's just
> 
> the reality of legacy encodings that actually exist in the real world.
> 
> 
> 
> 
> 
> > Very nice. Python 2 was built for ascii user, now Python 3 is
> 
> > *optimized* for, let say, ascii user!
> 
> >
> 
> > The future is bright for Python. French users are better
> 
> > served with Apple or MS products, simply because these
> 
> > corporates know you can not write French with iso-8859-1.
> 
> >
> 
> > PS When "TeX" moved from the ascii encoding to iso-8859-1
> 
> > and the so called Cork encoding, "they" know this and provided
> 
> > all the complementary packages to circumvent this. It was
> 
> > in 199? (Python was not even born).
> 
> >
> 
> > Ditto for the foundries (Adobe, Linotype, ...)
> 
> 
> 
> 
> 
> I don't understand what any of this has to do with Python.  Just
> 
> output your text in UTF-8 like any civilized person in the 21st
> 
> century, and none of that is a problem at all.  Python make that easy.
> 
>  It also makes it easy to interoperate with older encodings if you
> 
> have to.
> 

Sorry, you missed the point.

My comment had nothing to do with the code source coding,
the coding of a Python "string" in the code source or with
the display of a Python3 <str>.
I wrote about the *internal* Python "coding", the
way Python keeps "strings" in memory. See PEP 393.

jmf

[toc] | [prev] | [next] | [standalone]


#27265

FromDave Angel <d@davea.name>
Date2012-08-17 16:55 -0400
Message-ID<mailman.3431.1345236951.4697.python-list@python.org>
In reply to#27257
On 08/17/2012 02:45 PM, wxjmfauth@gmail.com wrote:
> Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
>> <SNIP>
>>
>> I don't understand what any of this has to do with Python.  Just
>>
>> output your text in UTF-8 like any civilized person in the 21st
>>
>> century, and none of that is a problem at all.  Python make that easy.
>>
>>  It also makes it easy to interoperate with older encodings if you
>>
>> have to.
>>
> Sorry, you missed the point.
>
> My comment had nothing to do with the code source coding,
> the coding of a Python "string" in the code source or with
> the display of a Python3 <str>.
> I wrote about the *internal* Python "coding", the
> way Python keeps "strings" in memory. See PEP 393.
>
> jmf

The internal coding described in PEP 393 has nothing to do with latin-1
encoding.  So what IS your point?  Make it clearly, without all the
snide side-comments.



-- 

DaveA

[toc] | [prev] | [next] | [standalone]


#27279

FromDave Angel <d@davea.name>
Date2012-08-17 23:30 -0400
Message-ID<mailman.3440.1345260650.4697.python-list@python.org>
In reply to#27257
On 08/17/2012 08:21 PM, Ian Kelly wrote:
> On Aug 17, 2012 2:58 PM, "Dave Angel" <d@davea.name> wrote:
>> The internal coding described in PEP 393 has nothing to do with latin-1
>> encoding.
> It certainly does. PEP 393 provides for Unicode strings to be represented
> internally as any of Latin-1, UCS-2, or UCS-4, whichever is smallest and
> sufficient to contain the data. I understand the complaint to be that while
> the change is great for strings that happen to fit in Latin-1, it is less
> efficient than previous versions for strings that do not.

That's not the way I interpreted the PEP 393.  It takes a pure unicode
string, finds the largest code point in that string, and chooses 1, 2 or
4 bytes for every character, based on how many bits it'd take for that
largest code point.   Further i read it to mean that only 00 bytes would
be dropped in the process, no other bytes would be changed.   I take it
as a coincidence that it happens to match latin-1;  that's the way
Unicode happened historically, and is not Python's fault.  Am I reading
it wrong?

I also figure this is going to be more space efficient than Python 3.2
for any string which had a max code point of 65535 or less (in Windows),
or 4billion or less (in real systems).  So unless French has code points
over 64k, I can't figure that anything is lost.

I have no idea about the times involved, so i wanted a more specific
complaint.

> I don't know how much merit there is to this claim. It would seem to me
> that even in non-western locales, most strings are likely to be Latin-1 or
> even ASCII, e.g.  class and attribute and function names.
>
>

The jmfauth rant I was responding to was saying that French isn't
efficiently encoded, and that performance of some vague operations were
somehow reduced by several fold.  I was just trying to get him to be
more specific.



-- 

DaveA

[toc] | [prev] | [next] | [standalone]


#27281

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-08-18 04:10 +0000
Message-ID<502f15b5$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to#27279
On Fri, 17 Aug 2012 23:30:22 -0400, Dave Angel wrote:

> On 08/17/2012 08:21 PM, Ian Kelly wrote:
>> On Aug 17, 2012 2:58 PM, "Dave Angel" <d@davea.name> wrote:
>>> The internal coding described in PEP 393 has nothing to do with
>>> latin-1 encoding.
>> It certainly does. PEP 393 provides for Unicode strings to be
>> represented internally as any of Latin-1, UCS-2, or UCS-4, whichever is
>> smallest and sufficient to contain the data. 

Unicode strings are not represented as Latin-1 internally. Latin-1 is a 
byte encoding, not a unicode internal format. Perhaps you mean to say 
that they are represented as a single byte format?

>> I understand the complaint
>> to be that while the change is great for strings that happen to fit in
>> Latin-1, it is less efficient than previous versions for strings that
>> do not.
> 
> That's not the way I interpreted the PEP 393.  It takes a pure unicode
> string, finds the largest code point in that string, and chooses 1, 2 or
> 4 bytes for every character, based on how many bits it'd take for that
> largest code point.

That's how I interpret it too.


> Further i read it to mean that only 00 bytes would
> be dropped in the process, no other bytes would be changed.

Just to clarify, you aren't talking about the \0 character, but only to 
extraneous "padding" 00 bytes.


> I also figure this is going to be more space efficient than Python 3.2
> for any string which had a max code point of 65535 or less (in Windows),
> or 4billion or less (in real systems).  So unless French has code points
> over 64k, I can't figure that anything is lost.

I think that on narrow builds, it won't make terribly much difference. 
The big savings are for wide builds.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#27297

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-08-18 09:18 -0600
Message-ID<mailman.3452.1345303152.4697.python-list@python.org>
In reply to#27281
(Resending this to the list because I previously sent it only to
Steven by mistake.  Also showing off a case where top-posting is
reasonable, since this bit requires no context. :-)

On Sat, Aug 18, 2012 at 1:41 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
>
> On Aug 17, 2012 10:17 PM, "Steven D&apos;Aprano"
> <steve+comp.lang.python@pearwood.info> wrote:
>>
>> Unicode strings are not represented as Latin-1 internally. Latin-1 is a
>> byte encoding, not a unicode internal format. Perhaps you mean to say
>> that they are represented as a single byte format?
>
> They are represented as a single-byte format that happens to be equivalent
> to Latin-1, because Latin-1 is a proper subset of Unicode; every character
> representable in Latin-1 has a byte value equal to its Unicode codepoint.
> This talk of whether it's a byte encoding or a 1-byte Unicode representation
> is then just semantics. Even the PEP refers to the 1-byte representation as
> Latin-1.
>
>>
>> >> I understand the complaint
>> >> to be that while the change is great for strings that happen to fit in
>> >> Latin-1, it is less efficient than previous versions for strings that
>> >> do not.
>> >
>> > That's not the way I interpreted the PEP 393.  It takes a pure unicode
>> > string, finds the largest code point in that string, and chooses 1, 2 or
>> > 4 bytes for every character, based on how many bits it'd take for that
>> > largest code point.
>>
>> That's how I interpret it too.
>
> I don't see how this is any different from what I described. Using all 4
> bytes of the code point, you get UCS-4. Truncating to 2 bytes, you get
> UCS-2. Truncating to 1 byte, you get Latin-1.

[toc] | [prev] | [next] | [standalone]


#27280

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-08-18 03:59 +0000
Message-ID<502f1333$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to#27257
On Fri, 17 Aug 2012 11:45:02 -0700, wxjmfauth wrote:

> Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
>> On Fri, Aug 17, 2012 at 1:49 PM,  <wxjmfauth@gmail.com> wrote:
>> 
>> > The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
>> > is one of these characters existing in the cp1252, mac-roman
>> > coding schemes and not in iso-8859-1 (latin-1) and obviously
>> > not in ascii. It causes Py3.3 to work a few 100% slower
>> > than Py<3.3 versions due to the flexible string representation
>> > (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
[...]
> Sorry, you missed the point.
> 
> My comment had nothing to do with the code source coding, the coding of
> a Python "string" in the code source or with the display of a Python3
> <str>.
> I wrote about the *internal* Python "coding", the way Python keeps
> "strings" in memory. See PEP 393.


The PEP does not support your claim that flexible string storage is 100% 
to 1000% slower. It claims 1% - 30% slowdown, with a saving of up to 60% 
of the memory used for strings.

I don't really understand what message you are trying to give here. Are 
you saying that PEP 393 is a good thing or a bad thing?

In Python 1.x, there was no support for Unicode at all. You could only 
work with pure byte strings. Support for non-ascii characters like … ∞ é ñ
£ π Ж ش was purely by accident -- if your terminal happened to be set to 
an encoding that supported a character, and you happened to use the 
appropriate byte value, you might see the character you wanted.

In Python 2.2, Python gained support for Unicode. You could now guarantee 
support for any Unicode character in the Basic Multilingual Plane (BMP) 
by writing your strings using the u"..." style. In Python 3, you no 
longer need the leading U, all strings are unicode.

But there is a problem: if your Python interpreter is a "narrow build", 
it *only* supports Unicode characters in the BMP. When Python is a "wide 
build", compiled with support for the additional character planes, then 
strings take much more memory, even if they are in the BMP, or are simple 
ASCII strings.

PEP 393 fixes this problem and gets rid of the distinction between narrow 
and wide builds. From Python 3.3 onwards, all Python compilers will have 
the same support for unicode, rather than most being BMP-only. Each 
individual string's internal storage will use only as many bytes-per-
character as needed to store the largest character in the string.

This will save a lot of memory for those using mostly ASCII or Latin-1 
but a few multibyte characters. While the increased complexity causes a 
small slowdown, the increased functionality makes it well worthwhile.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#27250

Fromwxjmfauth@gmail.com
Date2012-08-17 10:49 -0700
Message-ID<mailman.3421.1345226504.4697.python-list@python.org>
In reply to#27215
Le vendredi 17 août 2012 01:59:31 UTC+2, Terry Reedy a écrit :
> a = '…'
> 
> print(ord(a))
> 
>  >>>
> 
> 8230
> 
> Most things with unicode are easier in 3.x, and some are even better in 
> 
> 3.3. The current beta is good enough for most informal work. 3.3.0 will 
> 
> be out in a month.
> 
> 
> 
> -- 
> 
> Terry Jan Reedy

Slightly off topic.

The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
is one of these characters existing in the cp1252, mac-roman
coding schemes and not in iso-8859-1 (latin-1) and obviously
not in ascii. It causes Py3.3 to work a few 100% slower
than Py<3.3 versions due to the flexible string representation
(ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).

>>> '…'.encode('cp1252')
b'\x85'
>>> '…'.encode('mac-roman')
b'\xc9'
>>> '…'.encode('iso-8859-1') # latin-1
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
in position 0: ordinal not in range(256)

If one could neglect this (typographically important) glyph, what
to say about the characters of the European scripts (languages)
present in cp1252 or in mac-roman but not in latin-1 (eg. the
French script/language)?

Very nice. Python 2 was built for ascii user, now Python 3 is
*optimized* for, let say, ascii user!

The future is bright for Python. French users are better
served with Apple or MS products, simply because these
corporates know you can not write French with iso-8859-1.

PS When "TeX" moved from the ascii encoding to iso-8859-1
and the so called Cork encoding, "they" know this and provided
all the complementary packages to circumvent this. It was
in 199? (Python was not even born).

Ditto for the foundries (Adobe, Linotype, ...)

jmf

[toc] | [prev] | [next] | [standalone]


#27223

FromAlister <alister.ware@ntlworld.com>
Date2012-08-17 06:30 +0000
Message-ID<lylXr.960568$gC5.364193@fx10.am4>
In reply to#27204
On Thu, 16 Aug 2012 15:09:47 -0700, Charles Jensen wrote:

> Everyone knows that the python command
> 
>      ord(u'…')
> 
> will output the number 8230 which is the unicode character for the
> horizontal ellipsis.
> 
> How would I use ord() to find the unicode value of a string stored in a
> variable?
> 
> So the following 2 lines of code will give me the ascii value of the
> variable a.  How do I specify ord to give me the unicode value of a?
> 
>      a = '…' ord(a)





the same way you did in your original example by defining the string ass 
unicode
a=u'...' ord(a)
-- 
Keep on keepin' on.

[toc] | [prev] | [next] | [standalone]


#27288

Fromwxjmfauth@gmail.com
Date2012-08-18 01:09 -0700
Message-ID<308df2af-abe7-4043-b199-0a39f440e0ab@googlegroups.com>
In reply to#27204
>>> sys.version
'3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
>>> timeit.timeit("('ab…' * 1000).replace('…', '……')")
37.32762490493721
timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
0.8158757139801764

>>> sys.version
'3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32 bit 
(Intel)]'
>>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
61.919225272152346
>>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
1.2918679017971044

timeit.timeit("('ab…' * 10).replace('…', '€…')")
1.2484133226156757

* I intuitively and empirically noticed, this happens for
cp1252 or mac-roman characters and not characters which are
elements of the latin-1 coding scheme.

* Bad luck, such characters are usual characters in French scripts
(and in some other European language).

* I do not recall the extreme cases I found. Believe me, when
I'm speaking about a few 100%, I do not lie.

My take of the subject.

This is a typical Python desease. Do not solve a problem, but
find a way, a workaround, which is expecting to solve a problem
and which finally solves nothing. As far as I know, to break
the "BMP limit", the tools are here. They are called utf-8 or
ucs-4/utf-32.

One day, I fell on very, very old mail message, dating at the
time of the introduction of the unicode type in Python 2.
If I recall correctly it was from Victor Stinner. He wrote
something like this "Let's go with ucs-4, and the problems
are solved for ever". He was so right.

I'm spying the dev-list since years, my feeling is that
there is always a latent and permanent conflict between
"ascii users" and "non ascii users" (see the unicode
literal reintroduction).

Please, do not get me wrong. As a non-computer scientist,
I'm very happy with Python. If I try to take a distant
eye, I became more and more sceptical.

PS Py3.3b2 is still crashing, silently exiting, with
cp65001.

jmf

[toc] | [prev] | [next] | [standalone]


#27291

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-08-18 12:27 +0000
Message-ID<502f8a2a$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to#27288
On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:

>>>> sys.version
> '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
>>>> timeit.timeit("('ab…' * 1000).replace('…', '……')")
> 37.32762490493721
> timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 0.8158757139801764
> 
>>>> sys.version
> '3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32
> bit (Intel)]'
>>>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
> 61.919225272152346

"imeit"?

It is hard to take your results seriously when you have so obviously 
edited your timing results, not just copied and pasted them.


Here are my results, on my laptop running Debian Linux. First, testing on 
Python 3.2:

steve@runes:~$ python3.2 -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 50.2 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 45.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 51.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 47.6 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 45.9 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 57.5 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
10000 loops, best of 3: 49.7 usec per loop


As you can see, the timing results are all consistently around 50 
microseconds per loop, regardless of which characters I use, whether they 
are in Latin-1 or not. The differences between one test and another are 
not meaningful.


Now I do them again using Python 3.3:

steve@runes:~$ python3.3 -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 64.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 67.8 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 66 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 67.6 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 68.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 67.9 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
10000 loops, best of 3: 66.9 usec per loop

The results are all consistently around 67 microseconds. So Python's  
string handling is about 30% slower in the examples show here.

If you can consistently replicate a 100% to 1000% slowdown in string 
handling, please report it as a performance bug:


http://bugs.python.org/

Don't forget to report your operating system.



> My take of the subject.
> 
> This is a typical Python desease. Do not solve a problem, but find a
> way, a workaround, which is expecting to solve a problem and which
> finally solves nothing. As far as I know, to break the "BMP limit", the
> tools are here. They are called utf-8 or ucs-4/utf-32.

The problem with UCS-4 is that every character requires four bytes. 
Every. Single. One.

So under UCS-4, the pure-ascii string "hello world" takes 44 bytes plus 
the object overhead. Under UCS-2, it takes half that space: 22 bytes, but 
of course UCS-2 can only represent characters in the BMP. A pure ASCII 
string would only take 11 bytes, but we're not going back to pure ASCII.

(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters 
using two code points. This is fragile and doesn't work very well, 
because string-handling methods can break the surrogate pairs apart, 
leaving you with invalid unicode string. Not good.)

The difference between 44 bytes and 22 bytes for one little string is not 
very important, but when you double the memory required for every single 
string it becomes huge. Remember that every class, function and method 
has a name, which is a string; every attribute and variable has a name, 
all strings; functions and classes have doc strings, all strings. Strings 
are used everywhere in Python, and doubling the memory needed by Python 
means that it will perform worse.

With PEP 393, each Python string will be stored in the most efficient 
format possible:

- if it only contains ASCII characters, it will be stored using 1 byte 
per character;

- if it only contains characters in the BMP, it will be stored using 
UCS-2 (2 bytes per character);

- if it contains non-BMP characters, the string will be stored using 
UCS-4 (4 bytes per character).



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#27296

Fromwxjmfauth@gmail.com
Date2012-08-18 08:07 -0700
Message-ID<d575737d-c1e3-47db-9c7b-10fe0300cba7@googlegroups.com>
In reply to#27291
Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
> [...]
> The problem with UCS-4 is that every character requires four bytes. 
> [...]

I'm aware of this (and all the blah blah blah you are
explaining). This always the same song. Memory.

Let me ask. Is Python an 'american" product for us-users
or is it a tool for everybody [*]?
Is there any reason why non ascii users are somehow penalized
compared to ascii users?

This flexible string representation is a regression (ascii users
or not).

I recognize in practice the real impact is for many users
closed to zero (including me) but I have shown (I think) that
this flexible representation is, by design, not as optimal
as it is supposed to be. This is in my mind the relevant point.

[*] This not even true, if we consider the €uro currency
symbol used all around the world (banking, accounting
applications).

jmf

[toc] | [prev] | [next] | [standalone]


#27299

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2012-08-18 16:25 +0100
Message-ID<mailman.3453.1345303500.4697.python-list@python.org>
In reply to#27296
On 18/08/2012 16:07, wxjmfauth@gmail.com wrote:
> Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>> [...]
>> The problem with UCS-4 is that every character requires four bytes.
>> [...]
>
> I'm aware of this (and all the blah blah blah you are
> explaining). This always the same song. Memory.
>
> Let me ask. Is Python an 'american" product for us-users
> or is it a tool for everybody [*]?
> Is there any reason why non ascii users are somehow penalized
> compared to ascii users?
>
> This flexible string representation is a regression (ascii users
> or not).
>
> I recognize in practice the real impact is for many users
> closed to zero (including me) but I have shown (I think) that
> this flexible representation is, by design, not as optimal
> as it is supposed to be. This is in my mind the relevant point.
>
> [*] This not even true, if we consider the €uro currency
> symbol used all around the world (banking, accounting
> applications).
>
> jmf
>

Sorry but you've got me completely baffled.  Could you please explain in 
words of one syllable or less so I can attempt to grasp what the hell 
you're on about?

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]


#27301

FromChris Angelico <rosuav@gmail.com>
Date2012-08-19 01:36 +1000
Message-ID<mailman.3454.1345304165.4697.python-list@python.org>
In reply to#27296
On Sun, Aug 19, 2012 at 1:07 AM,  <wxjmfauth@gmail.com> wrote:
> I'm aware of this (and all the blah blah blah you are
> explaining). This always the same song. Memory.
>
> Let me ask. Is Python an 'american" product for us-users
> or is it a tool for everybody [*]?
> Is there any reason why non ascii users are somehow penalized
> compared to ascii users?

Regardless of your own native language, "len" is the name of a popular
Python function. And "dict" is a well-used class. Both those names are
representable in ASCII, even if every quoted string in your code
requires more bytes to store.

And memory usage has significance in many other areas, too. CPU cache
utilization turns a space saving into a time saving. That's why
structure packing still exists, even though member alignment has other
advantages.

You'd be amazed how many non-USA strings still fit inside seven bits,
too. Are you appending a space to something? Splitting on newlines?
You'll have lots of strings that are going now to be space-optimized.
Of course, the performance gains from shortening some of the strings
may be offset by costs when comparing one-byte and multi-byte strings,
but presumably that's all been gone into in great detail elsewhere.

ChrisA

[toc] | [prev] | [next] | [standalone]


Page 1 of 8  [1] 2 3 4 5 6 7 8  Next page →

Back to top | Article view | comp.lang.python


csiph-web