Groups > comp.lang.python > #27204 > unrolled thread

How do I display unicode value stored in a string variable using ord()

Started by	Charles Jensen <hopefullycharles@gmail.com>
First post	2012-08-16 15:09 -0700
Last post	2012-08-20 17:20 -0400
Articles	20 on this page of 145 — 26 participants

Back to article view | Back to comp.lang.python

  How do I display unicode value stored in a string variable using ord() Charles Jensen <hopefullycharles@gmail.com> - 2012-08-16 15:09 -0700
    Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-17 08:20 +1000
    Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-16 18:47 -0400
    Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-16 19:59 -0400
      Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
        Re: How do I display unicode value stored in a string variable using ord() Jerry Hill <malaclypse2@gmail.com> - 2012-08-17 14:21 -0400
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
            Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 16:55 -0400
            Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 23:30 -0400
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 04:10 +0000
                Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:18 -0600
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 03:59 +0000
      Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
    Re: How do I display unicode value stored in a string variable using ord() Alister <alister.ware@ntlworld.com> - 2012-08-17 06:30 +0000
    Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 01:09 -0700
      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 12:27 +0000
        Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 08:07 -0700
          Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 16:25 +0100
          Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 01:36 +1000
          Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:51 -0600
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 02:57 +1000
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 18:28 +0100
                Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
                  Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:34 +0100
                    Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:35 +0000
                      New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Peter Otten <__peter__@web.de> - 2012-08-19 09:43 +0200
                        Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:56 +0000
                          Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 02:24 -0700
                          Re: New internal string format in 3.3 Peter Otten <__peter__@web.de> - 2012-08-19 11:37 +0200
                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
                              Re: New internal string format in 3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:33 +0000
                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
                              Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:26 +1000
                                Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
                                  Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:29 -0400
                                    Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
                                      Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 14:46 +0100
                                        Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
                                        Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
                                          Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 15:48 +0100
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
                                          Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:48 -0400
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
                                              Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:09 +0100
                                              Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-20 07:50 +1000
                                              Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-19 23:38 -0600
                                                Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-20 09:17 -0400
                                                  Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-20 22:18 -0600
                                                    Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-21 07:48 -0400
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
                                      Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:56 -0400
                                    Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
                                  Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:35 -0400
                                Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:30 +0000
                Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 16:09 -0400
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 23:12 -0400
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:33 +0000
              Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 11:50 -0600
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 11:20 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:31 -0600
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 12:23 -0700
                Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:16 +0000
              Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:46 -0600
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 17:59 +0000
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:30 -0700
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:45 +0100
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:13 +0000
            Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-18 11:40 -0700
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:50 +0100
              Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 13:22 -0700
                Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 22:37 +0100
        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 11:26 -0700
          Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:59 +0100
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 07:17 +0000
          Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 10:46 +1000
            Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:11 -0700
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 12:19 +1000
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:35 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:01 +1000
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 20:10 -0700
                      Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:31 +1000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 22:58 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:01 +0000
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:11 -0700
                      Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 18:24 +1000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:44 -0700
                          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 01:54 -0700
                            Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 11:46 +0100
                            Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 12:31 -0400
                      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 10:51 +0000
                        Re: How do I display unicode value stored in a string variable using ord() Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-21 17:03 +1000
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:09 +0000
            Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:04 -0700
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:25 +0000
                Re: How do I display unicode value stored in a string variable using ord() DJC <djc@news.invalid> - 2012-08-19 17:32 +0200
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:34 -0400
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 10:48 -0700
                  Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 11:11 -0700
                    Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:50 +0100
                    Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:59 -0400
                    Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-19 23:13 -0700
                  Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:15 +0000
                    Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-19 16:42 -0700
                      Re: Abuse of Big Oh notation Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-08-20 09:24 +0100
                        Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 09:01 -0700
                          Re: Abuse of Big Oh notation Chris Angelico <rosuav@gmail.com> - 2012-08-21 02:09 +1000
                          Re: Abuse of Big Oh notation Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-20 11:12 -0600
                            Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 12:29 -0700
                              Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:16 -0700
                              Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:20 -0700
                            Re: Abuse of Big Oh notation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-21 09:53 +0000
                        Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
                          Re: Abuse of Big Oh notation Ned Deily <nad@acm.org> - 2012-08-20 18:19 -0700
                          Abuse of subject, was Re: Abuse of Big Oh notation Peter Otten <__peter__@web.de> - 2012-08-21 09:52 +0200
                            Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
                            Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
                        Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Hans Mulder <hansmu@xs4all.nl> - 2012-08-22 20:53 +0200
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 08:42 +1000
                Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-19 19:24 -0400
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 04:21 +0000
                    Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-20 00:44 -0400
                      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 05:56 +0000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 23:24 -0700
                    Re: How do I display unicode value stored in a string variable using ord() Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-20 12:58 -0400
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 20:35 -0400
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 14:07 +1000
            Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:13 +0100
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:19 +1000
                Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:49 +0100
        Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 18:03 +0100
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 10:33 -0700
            Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:04 +0100
          Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-19 14:05 -0400
            Re: How do I display unicode value stored in a string variable usingord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:18 +0100
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:31 +0000
          Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:03 -0400
          Re: How do I display unicode value stored in a string variable using ord() 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-19 17:32 -0700
          Re: How do I display unicode value stored in a string variable using ord() Piet van Oostrum <piet@vanoostrum.org> - 2012-08-20 17:20 -0400

Page 3 of 8 — ← Prev page 1 2 [3] 4 5 6 7 8 Next page →

#27393 — Re: New internal string format in 3.3

From	wxjmfauth@gmail.com
Date	2012-08-19 07:09 -0700
Subject	Re: New internal string format in 3.3
Message-ID	<f6de81c6-2965-42dd-a789-0770a019c038@googlegroups.com>
In reply to	#27391

Le dimanche 19 août 2012 15:46:34 UTC+2, Mark Lawrence a écrit :
> On 19/08/2012 13:59, wxjmfauth@gmail.com wrote:
> 
> > Le dimanche 19 aoï¿½t 2012 14:29:17 UTC+2, Dave Angel a ï¿½crit :
> 
> >> On 08/19/2012 08:14 AM, wxjmfauth@gmail.com wrote:
> 
> >>
> 
> >>> Le dimanche 19 aoï¿½t 2012 12:26:44 UTC+2, Chris Angelico a ï¿½crit :
> 
> >>
> 
> >>>> On Sun, Aug 19, 2012 at 8:19 PM,  <wxjmfauth@gmail.com> wrote:
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>>>> This is precicely the weak point of this flexible
> 
> >>
> 
> >>>>> representation. It uses latin-1 and latin-1 is for
> 
> >>
> 
> >>>>> most users simply unusable.
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>>> No, it uses Unicode, and as an optimization, attempts to store the
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>>> codepoints in less than four bytes for most strings. The fact that a
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>>> one-byte storage format happens to look like latin-1 is rather
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>>> coincidental.
> 
> >>
> 
> >>>>
> 
> >>
> 
> >>> And this this is the common basic mistake. You do not push your
> 
> >>
> 
> >>> argumentation far enough. A character may "fall" accidentally in a latin-1.
> 
> >>
> 
> >>> The problem lies in these european characters, which can not fall in this
> 
> >>
> 
> >>> coding. This *is* the cause of the negative side effects.
> 
> >>
> 
> >>> If you are using a correct coding scheme, like cp1252, mac-roman or
> 
> >>
> 
> >>> iso-8859-15, you will never see such a negative side effect.
> 
> >>
> 
> >>> Again, the problem is not the result, the encoded character. The critical
> 
> >>
> 
> >>> part is the character which may cause this side effect.
> 
> >>
> 
> >>> You should think "character set" and not encoded "code point", considering
> 
> >>
> 
> >>> this kind of expression has a sense in 8-bits coding scheme.
> 
> >>
> 
> >>>
> 
> >>
> 
> >>> jmf
> 
> >>
> 
> >>
> 
> >>
> 
> >> But that choice was made decades ago when Unicode picked its second 128
> 
> >>
> 
> >> characters.  The internal form used in this PEP is simply the low-order
> 
> >>
> 
> >> byte of the Unicode code point.  Trying to scan the string deciding if
> 
> >>
> 
> >> converting to cp1252 (for example) would be a much more expensive
> 
> >>
> 
> >> operation than seeing how many bytes it'd take for the largest code point.
> 
> >>
> 
> >>
> 
> >
> 
> > You are absoletely right. (I'm quite comfortable with Unicode).
> 
> > If Python wish to perpetuate this, lets call it, design mistake
> 
> > or ennoyement, it will continue to live with problems.
> 
> 
> 
> Please give a precise description of the design mistake and what you 
> 
> would do to correct it.
> 
> 
> 
> >
> 
> > People (tools) who chose pure utf-16 or utf-32 are not suffering
> 
> > from this issue.
> 
> >
> 
> > *My* final comment on this thread.
> 
> >
> 
> > In August 2012, after 20 years of development, Python is not
> 
> > able to display a piece of text correctly on a Windows console
> 
> > (eg cp65001).
> 
> 
> 
> Examples please.
> 
> 
> 
> >
> 
> > I downloaded the go language, zero experience, I did not succeed
> 
> > to display incorrecly a piece of text. (This is by the way *the*
> 
> > reason why I tested it). Where the problems are coming from, I have
> 
> > no idea.
> 
> >
> 
> > I find this situation quite comic. Python is able to
> 
> > produce this:
> 
> >
> 
> >>>> (1.1).hex()
> 
> > '0x1.199999999999ap+0'
> 
> >
> 
> > but it is not able to display a piece of text!
> 
> 
> 
> So you keep saying, but when asked for examples or evidence nothing gets 
> 
> produced.
> 
> 
> 
> >
> 
> > Try to convince end users IEEE 754 is more important than the
> 
> > ability to read/wirite a piece a text, a 6-years kid has learned
> 
> > at school :-)
> 
> >
> 
> > (I'm not suffering from this kind of effect, as a Windows user,
> 
> > I'm always working via gui, it still remains, the problem exists.
> 
> 
> 
> Windows is a law unto itself.  Its problems are hardly specific to Python.
> 
> 
> 
> >
> 
> > Regards,
> 
> > jmf
> 
> >
> 
> 
> 
> Now two or three times you've said you're going but have come back.  If 
> 
> you come again could you please provide examples and or evidence of what 
> 
> you're on about, because you still have me baffled.
> 
> 
> 
> -- 
> 
> Cheers.
> 
> 
> 
> Mark Lawrence.

Yesterday, I went to bed.
More seriously.

I can not give you more numbers than those I gave.
As a end user, I noticed and experimented my random tests
are always slower in Py3.3 than in Py3.2 on my Windows platform.

It is up to you, the core developers to give an explanation
about this behaviour.

As I understand a little bit the coding of the characters,
I pointed out, this is most probably due to this flexible
string representation (with arguments appearing randomly
in the misc. messages, mainly latin-1).

I can not do more.

(I stupidly spoke about factors 0.1 to ..., you should
read of course, 1.1,  to ...)

jmf

[toc] | [prev] | [next] | [standalone]

#27394 — Re: New internal string format in 3.3

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-08-19 15:48 +0100
Subject	Re: New internal string format in 3.3
Message-ID	<mailman.3504.1345387683.4697.python-list@python.org>
In reply to	#27393

On 19/08/2012 15:09, wxjmfauth@gmail.com wrote:

>
> I can not give you more numbers than those I gave.
> As a end user, I noticed and experimented my random tests
> are always slower in Py3.3 than in Py3.2 on my Windows platform.

Once again you refuse to supply anything to back up what you say.

>
> It is up to you, the core developers to give an explanation
> about this behaviour.

Core developers cannot give an explanation for something that doesn't 
exist, except in your imagination.  Unless you can produce the evidence 
that supports your claims, including details of OS, benchmarks used and 
so on and so forth.

>
> As I understand a little bit the coding of the characters,
> I pointed out, this is most probably due to this flexible
> string representation (with arguments appearing randomly
> in the misc. messages, mainly latin-1).
>
> I can not do more.
>
> (I stupidly spoke about factors 0.1 to ..., you should
> read of course, 1.1,  to ...)
>
> jmf
>

I suspect that I'll be dead and buried long before you can produce 
anything concrete in the way of evidence.  I've thrown down the gauntlet 
several times, do you now have the courage to pick it up, or are you 
going to resort to the FUD approach that you've been using throughout 
this thread?

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#27396 — Re: New internal string format in 3.3

From	wxjmfauth@gmail.com
Date	2012-08-19 09:19 -0700
Subject	Re: New internal string format in 3.3
Message-ID	<mailman.3507.1345393173.4697.python-list@python.org>
In reply to	#27394

Le dimanche 19 août 2012 16:48:48 UTC+2, Mark Lawrence a écrit :
> On 19/08/2012 15:09, wxjmfauth@gmail.com wrote:
> 
> 
> 
> >
> 
> > I can not give you more numbers than those I gave.
> 
> > As a end user, I noticed and experimented my random tests
> 
> > are always slower in Py3.3 than in Py3.2 on my Windows platform.
> 
> 
> 
> Once again you refuse to supply anything to back up what you say.
> 
> 
> 
> >
> 
> > It is up to you, the core developers to give an explanation
> 
> > about this behaviour.
> 
> 
> 
> Core developers cannot give an explanation for something that doesn't 
> 
> exist, except in your imagination.  Unless you can produce the evidence 
> 
> that supports your claims, including details of OS, benchmarks used and 
> 
> so on and so forth.
> 
> 
> 
> >
> 
> > As I understand a little bit the coding of the characters,
> 
> > I pointed out, this is most probably due to this flexible
> 
> > string representation (with arguments appearing randomly
> 
> > in the misc. messages, mainly latin-1).
> 
> >
> 
> > I can not do more.
> 
> >
> 
> > (I stupidly spoke about factors 0.1 to ..., you should
> 
> > read of course, 1.1,  to ...)
> 
> >
> 
> > jmf
> 
> >
> 
> 
> 
> I suspect that I'll be dead and buried long before you can produce 
> 
> anything concrete in the way of evidence.  I've thrown down the gauntlet 
> 
> several times, do you now have the courage to pick it up, or are you 
> 
> going to resort to the FUD approach that you've been using throughout 
> 
> this thread?
> 
> 
> 
> -- 
> 
> Cheers.
> 
> 
> 
> Mark Lawrence.

I do not remember the tests I'have done at the 1st alpha release
time. It was with an interactive interpreter. I precisely pay
attention to test these chars you can find in the range 128..256
in all 8-bits coding schemes. Chars I suspected to be problematic.

Here a short test again, a random single test, the first
idea coming in my mind.

Py 3.2.3
>>> timeit.timeit("('aœ€'*100).replace('a', 'œ€é')")
4.99396356635981

Py 3.3b2
>>> timeit.timeit("('aœ€'*100).replace('a', 'œ€é')")
7.560455708007855

Maybe, not so demonstative. It shows at least, we
are far away from the 10-30% "annouced".

>>> 7.56 / 5
1.512
>>> 5 / (7.56 - 5) * 100
195.31250000000003


jmf

[toc] | [prev] | [next] | [standalone]

#27397 — Re: New internal string format in 3.3

From	wxjmfauth@gmail.com
Date	2012-08-19 09:19 -0700
Subject	Re: New internal string format in 3.3
Message-ID	<dafd57a5-6070-4ff2-9fe3-b3816e1e43b3@googlegroups.com>
In reply to	#27394

Le dimanche 19 août 2012 16:48:48 UTC+2, Mark Lawrence a écrit :
> On 19/08/2012 15:09, wxjmfauth@gmail.com wrote:
> 
> 
> 
> >
> 
> > I can not give you more numbers than those I gave.
> 
> > As a end user, I noticed and experimented my random tests
> 
> > are always slower in Py3.3 than in Py3.2 on my Windows platform.
> 
> 
> 
> Once again you refuse to supply anything to back up what you say.
> 
> 
> 
> >
> 
> > It is up to you, the core developers to give an explanation
> 
> > about this behaviour.
> 
> 
> 
> Core developers cannot give an explanation for something that doesn't 
> 
> exist, except in your imagination.  Unless you can produce the evidence 
> 
> that supports your claims, including details of OS, benchmarks used and 
> 
> so on and so forth.
> 
> 
> 
> >
> 
> > As I understand a little bit the coding of the characters,
> 
> > I pointed out, this is most probably due to this flexible
> 
> > string representation (with arguments appearing randomly
> 
> > in the misc. messages, mainly latin-1).
> 
> >
> 
> > I can not do more.
> 
> >
> 
> > (I stupidly spoke about factors 0.1 to ..., you should
> 
> > read of course, 1.1,  to ...)
> 
> >
> 
> > jmf
> 
> >
> 
> 
> 
> I suspect that I'll be dead and buried long before you can produce 
> 
> anything concrete in the way of evidence.  I've thrown down the gauntlet 
> 
> several times, do you now have the courage to pick it up, or are you 
> 
> going to resort to the FUD approach that you've been using throughout 
> 
> this thread?
> 
> 
> 
> -- 
> 
> Cheers.
> 
> 
> 
> Mark Lawrence.

I do not remember the tests I'have done at the 1st alpha release
time. It was with an interactive interpreter. I precisely pay
attention to test these chars you can find in the range 128..256
in all 8-bits coding schemes. Chars I suspected to be problematic.

Here a short test again, a random single test, the first
idea coming in my mind.

Py 3.2.3
>>> timeit.timeit("('aœ€'*100).replace('a', 'œ€é')")
4.99396356635981

Py 3.3b2
>>> timeit.timeit("('aœ€'*100).replace('a', 'œ€é')")
7.560455708007855

Maybe, not so demonstative. It shows at least, we
are far away from the 10-30% "annouced".

>>> 7.56 / 5
1.512
>>> 5 / (7.56 - 5) * 100
195.31250000000003


jmf

[toc] | [prev] | [next] | [standalone]

#27406 — Re: New internal string format in 3.3

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-08-19 13:48 -0400
Subject	Re: New internal string format in 3.3
Message-ID	<mailman.3512.1345398545.4697.python-list@python.org>
In reply to	#27393

On 8/19/2012 10:09 AM, wxjmfauth@gmail.com wrote:

> I can not give you more numbers than those I gave.
> As a end user, I noticed and experimented my random tests
> are always slower in Py3.3 than in Py3.2 on my Windows platform.

And I gave other examples where 3.3 is *faster* on my Windows, which you 
have thus far not even acknowledged, let alone try.

> It is up to you, the core developers to give an explanation
> about this behaviour.

System variation, unimportance of sub-microsecond variations, and 
attention to more important issues.

Other developer say 3.3 is generally faster on their sy
stems (OSX 10.8, and unspecified). To talk about speed sensibly, one 
must run the full stringbench.py benchmark and real applications on 
multiple Windows, *nix, and Mac systems. Python is not optimized for 
your particular current computer.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#27408 — Re: New internal string format in 3.3

From	wxjmfauth@gmail.com
Date	2012-08-19 10:51 -0700
Subject	Re: New internal string format in 3.3
Message-ID	<5570714c-59e7-4149-b2bd-89d7628774e3@googlegroups.com>
In reply to	#27406

Just for the story.

Five minutes after a closed my interactive interpreters windows,
the day I tested this stuff. I though:
"Too bad I did not noted the extremely bad cases I found, I'm pretty
sure, this problem will arrive on the table".

jmf

[toc] | [prev] | [next] | [standalone]

#27414 — Re: New internal string format in 3.3

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-08-19 19:09 +0100
Subject	Re: New internal string format in 3.3
Message-ID	<mailman.3518.1345399568.4697.python-list@python.org>
In reply to	#27408

On 19/08/2012 18:51, wxjmfauth@gmail.com wrote:
> Just for the story.
>
> Five minutes after a closed my interactive interpreters windows,
> the day I tested this stuff. I though:
> "Too bad I did not noted the extremely bad cases I found, I'm pretty
> sure, this problem will arrive on the table".
>
> jmf
>

How convenient.

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#27435 — Re: New internal string format in 3.3

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-20 07:50 +1000
Subject	Re: New internal string format in 3.3
Message-ID	<mailman.3528.1345413053.4697.python-list@python.org>
In reply to	#27408

On Mon, Aug 20, 2012 at 4:09 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
> On 19/08/2012 18:51, wxjmfauth@gmail.com wrote:
>>
>> Just for the story.
>>
>> Five minutes after a closed my interactive interpreters windows,
>> the day I tested this stuff. I though:
>> "Too bad I did not noted the extremely bad cases I found, I'm pretty
>> sure, this problem will arrive on the table".
>
> How convenient.

Not really. Even if he HAD copied-and-pasted those worst-cases, it'd
prove nothing. Maybe his system just chose to glitch right then. It's
always possible to find statistical outliers that take way way longer
than everything else.

Watch this. Python 3.2 on Windows is optimized for adding 1 to numbers.

C:\Documents and Settings\M>\python32\python -m timeit -r 1 "a=1+1"
10000000 loops, best of 1: 0.0654 usec per loop

C:\Documents and Settings\M>\python32\python -m timeit -r 1 "a=1+1"
10000000 loops, best of 1: 0.0654 usec per loop

C:\Documents and Settings\M>\python32\python -m timeit -r 1 "a=1+1"
10000000 loops, best of 1: 0.0654 usec per loop

C:\Documents and Settings\M>\python32\python -m timeit -r 1 "a=1+2"
10000000 loops, best of 1: 0.0711 usec per loop

Now, as long as I don't tell you that during the last test I had quite
a few other processes running, including VLC playing a movie and two
Python processes running "while True: pass", this will look like a
significant performance difference. So now, I'm justified in
complaining about how suboptimal Python is when adding 2 to a number,
which I can assure you is a VERY common case.

ChrisA

[toc] | [prev] | [next] | [standalone]

#27462 — Re: New internal string format in 3.3

From	Michael Torrie <torriem@gmail.com>
Date	2012-08-19 23:38 -0600
Subject	Re: New internal string format in 3.3
Message-ID	<mailman.3538.1345442498.4697.python-list@python.org>
In reply to	#27408

On 08/19/2012 11:51 AM, wxjmfauth@gmail.com wrote:
> Five minutes after a closed my interactive interpreters windows,
> the day I tested this stuff. I though:
> "Too bad I did not noted the extremely bad cases I found, I'm pretty
> sure, this problem will arrive on the table".

Reading through this thread (which is entertaining), I am reminded of
the old saying, "premature optimization is the root of all evil." This
"problem" that you have discovered, if fixed the way you propose,
(4-byte USC-4 representation internally always) would be just such a
premature optimization.  It would come at a high cost with very little
real-world impact.

As others have made abundantly clear, the overhead of changing internal
string representations is a cost that's only manifest during the
creation of the immutable string object.  If your code is doing a lot of
operations on immutable strings, which by definition creates new
immutable string objects, then the real speed problem is in your
algorithm.  If you are working on a string as if it were a buffer, doing
many searches, replaces, etc, then you need to work on an object
designed for IO, such as io.StringIO.  If implemented half correctly, I
imagine that StringIO uses internally the widest possible character
representation in the buffer.  I could be wrong here.

As to your other problem, Python generally tries to follow unicode
encoding rules to the letter.  Thus if a piece of text cannot be
represented in the character set of the terminal, then Python will
properly err out.  Other languages you have tried, likely fudge it
somehow.  Display what they can, or something similar.  In general the
Windows command window is an outdated thing that no serious programmer
can rely on to display unicode text.  Use a proper GUI api, or use a
better terminal that can handle utf-8.

The TLDR version: You're right that converting string representations
internally incurs overhead, but if your program is slow because of this
you're doing it wrong.  It's not symptomatic of some python disease.

[toc] | [prev] | [next] | [standalone]

#27488 — Re: New internal string format in 3.3

From	Roy Smith <roy@panix.com>
Date	2012-08-20 09:17 -0400
Subject	Re: New internal string format in 3.3
Message-ID	<roy-87956C.09170720082012@news.panix.com>
In reply to	#27462

In article <mailman.3538.1345442498.4697.python-list@python.org>,
 Michael Torrie <torriem@gmail.com> wrote:

> Python generally tries to follow unicode
> encoding rules to the letter.  Thus if a piece of text cannot be
> represented in the character set of the terminal, then Python will
> properly err out.  Other languages you have tried, likely fudge it
> somehow.  

And if you want the "fudge it somehow" behavior (which is often very 
useful!), there's always http://pypi.python.org/pypi/Unidecode/

[toc] | [prev] | [next] | [standalone]

#27546 — Re: New internal string format in 3.3

From	Michael Torrie <torriem@gmail.com>
Date	2012-08-20 22:18 -0600
Subject	Re: New internal string format in 3.3
Message-ID	<mailman.3587.1345522727.4697.python-list@python.org>
In reply to	#27488

On 08/20/2012 07:17 AM, Roy Smith wrote:
> In article <mailman.3538.1345442498.4697.python-list@python.org>,
>  Michael Torrie <torriem@gmail.com> wrote:
> 
>> Python generally tries to follow unicode
>> encoding rules to the letter.  Thus if a piece of text cannot be
>> represented in the character set of the terminal, then Python will
>> properly err out.  Other languages you have tried, likely fudge it
>> somehow.  
> 
> And if you want the "fudge it somehow" behavior (which is often very 
> useful!), there's always http://pypi.python.org/pypi/Unidecode/

Sweet tip, thanks!  I often want to process text that has smart quotes,
emdashes, etc, and convert them to plain old ascii quotes, dashes,
ticks, etc.  This will do that for me without resorting to a bunch of
regexes.  Bravo.

[toc] | [prev] | [next] | [standalone]

#27564 — Re: New internal string format in 3.3

From	Roy Smith <roy@panix.com>
Date	2012-08-21 07:48 -0400
Subject	Re: New internal string format in 3.3
Message-ID	<roy-EF0527.07485121082012@news.panix.com>
In reply to	#27546

In article <mailman.3587.1345522727.4697.python-list@python.org>,
 Michael Torrie <torriem@gmail.com> wrote:

> > And if you want the "fudge it somehow" behavior (which is often very 
> > useful!), there's always http://pypi.python.org/pypi/Unidecode/
> 
> Sweet tip, thanks!  I often want to process text that has smart quotes,
> emdashes, etc, and convert them to plain old ascii quotes, dashes,
> ticks, etc.  This will do that for me without resorting to a bunch of
> regexes.  Bravo.

Yup, that's one of the things it's good for.  We mostly use it to help 
map search terms, i.e. if you search for "beyonce", you're probably 
expecting it to match "Beyoncé".

We also special-case some weird stuff like "kesha" matching "ke$ha", but 
we have to hand-code those.

[toc] | [prev] | [next] | [standalone]

#27409 — Re: New internal string format in 3.3

From	wxjmfauth@gmail.com
Date	2012-08-19 10:51 -0700
Subject	Re: New internal string format in 3.3
Message-ID	<mailman.3514.1345398671.4697.python-list@python.org>
In reply to	#27406

Just for the story.

Five minutes after a closed my interactive interpreters windows,
the day I tested this stuff. I though:
"Too bad I did not noted the extremely bad cases I found, I'm pretty
sure, this problem will arrive on the table".

jmf

[toc] | [prev] | [next] | [standalone]

#27410 — Re: New internal string format in 3.3

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-08-19 13:56 -0400
Subject	Re: New internal string format in 3.3
Message-ID	<mailman.3515.1345399008.4697.python-list@python.org>
In reply to	#27387

On 8/19/2012 8:59 AM, wxjmfauth@gmail.com wrote:

> In August 2012, after 20 years of development, Python is not able to
> display a piece of text correctly on a Windows console (eg cp65001).

cp65001 is known to not work right. It has been very frustrating. Bug 
Microsoft about it, and indeed their whole policy of still dividing the 
world into code page regions, even in their next version, instead of 
moving toward unicode and utf-8, at least as an option.

> I downloaded the go language, zero experience, I did not succeed to
> display incorrecly a piece of text. (This is by the way *the* reason
> why I tested it). Where the problems are coming from, I have no
> idea.

If go can display all unicode chars on a Windows console, perhaps you 
can do some research and find out how they do so. Then we could consider 
copying it.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#27388 — Re: New internal string format in 3.3

From	wxjmfauth@gmail.com
Date	2012-08-19 05:59 -0700
Subject	Re: New internal string format in 3.3
Message-ID	<mailman.3500.1345381786.4697.python-list@python.org>
In reply to	#27384

Le dimanche 19 août 2012 14:29:17 UTC+2, Dave Angel a écrit :
> On 08/19/2012 08:14 AM, wxjmfauth@gmail.com wrote:
> 
> > Le dimanche 19 aoï¿½t 2012 12:26:44 UTC+2, Chris Angelico a ï¿½crit :
> 
> >> On Sun, Aug 19, 2012 at 8:19 PM,  <wxjmfauth@gmail.com> wrote:
> 
> >>
> 
> >>> This is precicely the weak point of this flexible
> 
> >>> representation. It uses latin-1 and latin-1 is for
> 
> >>> most users simply unusable.
> 
> >>
> 
> >>
> 
> >> No, it uses Unicode, and as an optimization, attempts to store the
> 
> >>
> 
> >> codepoints in less than four bytes for most strings. The fact that a
> 
> >>
> 
> >> one-byte storage format happens to look like latin-1 is rather
> 
> >>
> 
> >> coincidental.
> 
> >>
> 
> > And this this is the common basic mistake. You do not push your
> 
> > argumentation far enough. A character may "fall" accidentally in a latin-1.
> 
> > The problem lies in these european characters, which can not fall in this
> 
> > coding. This *is* the cause of the negative side effects.
> 
> > If you are using a correct coding scheme, like cp1252, mac-roman or
> 
> > iso-8859-15, you will never see such a negative side effect.
> 
> > Again, the problem is not the result, the encoded character. The critical
> 
> > part is the character which may cause this side effect.
> 
> > You should think "character set" and not encoded "code point", considering
> 
> > this kind of expression has a sense in 8-bits coding scheme.
> 
> >
> 
> > jmf
> 
> 
> 
> But that choice was made decades ago when Unicode picked its second 128
> 
> characters.  The internal form used in this PEP is simply the low-order
> 
> byte of the Unicode code point.  Trying to scan the string deciding if
> 
> converting to cp1252 (for example) would be a much more expensive
> 
> operation than seeing how many bytes it'd take for the largest code point.
> 
> 

You are absoletely right. (I'm quite comfortable with Unicode).
If Python wish to perpetuate this, lets call it, design mistake
or ennoyement, it will continue to live with problems.

People (tools) who chose pure utf-16 or utf-32 are not suffering
from this issue.

*My* final comment on this thread.

In August 2012, after 20 years of development, Python is not
able to display a piece of text correctly on a Windows console
(eg cp65001).

I downloaded the go language, zero experience, I did not succeed
to display incorrecly a piece of text. (This is by the way *the*
reason why I tested it). Where the problems are coming from, I have
no idea.

I find this situation quite comic. Python is able to
produce this:

>>> (1.1).hex()
'0x1.199999999999ap+0'

but it is not able to display a piece of text!

Try to convince end users IEEE 754 is more important than the
ability to read/wirite a piece a text, a 6-years kid has learned
at school :-)

(I'm not suffering from this kind of effect, as a Windows user,
I'm always working via gui, it still remains, the problem exists.

Regards,
jmf

[toc] | [prev] | [next] | [standalone]

#27385 — Re: New internal string format in 3.3

From	Dave Angel <d@davea.name>
Date	2012-08-19 08:35 -0400
Subject	Re: New internal string format in 3.3
Message-ID	<mailman.3498.1345379751.4697.python-list@python.org>
In reply to	#27382

(pardon the resend, but I accidentally omitted a couple of words)
On 08/19/2012 08:14 AM, wxjmfauth@gmail.com wrote:
> Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :
>> <SNIP>
>>
>>
>> No, it uses Unicode, and as an optimization, attempts to store the
>> codepoints in less than four bytes for most strings. The fact that a
>> one-byte storage format happens to look like latin-1 is rather
>> coincidental.
>>
> And this this is the common basic mistake. You do not push your
> argumentation far enough. A character may "fall" accidentally in a latin-1.
> The problem lies in these european characters, which can not fall in this
> coding. This *is* the cause of the negative side effects.
> If you are using a correct coding scheme, like cp1252, mac-roman or
> iso-8859-15, you will never see such a negative side effect.
> Again, the problem is not the result, the encoded character. The critical
> part is the character which may cause this side effect.
> You should think "character set" and not encoded "code point", considering
> this kind of expression has a sense in 8-bits coding scheme.
>
> jmf

But that choice was made decades ago when Unicode picked its second 128
characters.  The internal form used in this PEP is simply the low-order
byte of the Unicode code point.  Trying to scan the string deciding if
converting to cp1252 (for example) would work, would be a much more
expensive operation than seeing how many bytes it'd take for the largest
code point.

The 8 bit form is used if all the code points are less than 256.  That
is a simple description, and simple code.  As several people have said,
the fact that this byte matches on of the DECODED forms is coincidence.

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#27383 — Re: New internal string format in 3.3

From	wxjmfauth@gmail.com
Date	2012-08-19 05:14 -0700
Subject	Re: New internal string format in 3.3
Message-ID	<mailman.3496.1345378464.4697.python-list@python.org>
In reply to	#27375

Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :
> On Sun, Aug 19, 2012 at 8:19 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > This is precicely the weak point of this flexible
> 
> > representation. It uses latin-1 and latin-1 is for
> 
> > most users simply unusable.
> 
> 
> 
> No, it uses Unicode, and as an optimization, attempts to store the
> 
> codepoints in less than four bytes for most strings. The fact that a
> 
> one-byte storage format happens to look like latin-1 is rather
> 
> coincidental.
> 

And this this is the common basic mistake. You do not push your
argumentation far enough. A character may "fall" accidentally in a latin-1.
The problem lies in these european characters, which can not fall in this
coding. This *is* the cause of the negative side effects.
If you are using a correct coding scheme, like cp1252, mac-roman or
iso-8859-15, you will never see such a negative side effect.
Again, the problem is not the result, the encoded character. The critical
part is the character which may cause this side effect.
You should think "character set" and not encoded "code point", considering
this kind of expression has a sense in 8-bits coding scheme.

jmf

[toc] | [prev] | [next] | [standalone]

#27351

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-19 06:30 +0000
Message-ID	<5030881d$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27320

On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote:

> As I understand (I think) the undelying mechanism, I can only say, it is
> not a surprise that it happens.
> 
> Imagine an editor, I type an "a", internally the text is saved as ascii,
> then I type en "é", the text can only be saved in at least latin-1. Then
> I enter an "€", the text become an internal ucs-4 "string". The remove
> the "€" and so on.

Firstly, that is not what Python does. For starters, € is in the BMP, and 
so is nearly every character you're ever going to use unless you are 
Asian or a historian using some obscure ancient script. NONE of the 
examples you have shown in your emails have included 4-byte characters, 
they have all been ASCII or UCS-2.

You are suffering from a misunderstanding about what is going on and 
misinterpreting what you have seen.

In *both* Python 3.2 and 3.3, both é and € are represented by two bytes. 
That will not change. There is a tiny amount of fixed overhead for 
strings, and that overhead is slightly different between the versions, 
but you'll never notice the difference.

Secondly, how a text editor or word processor chooses to store the text 
that you type is not the same as how Python does it. A text editor is not 
going to be creating a new immutable string after every key press. That 
will be slow slow SLOW. The usual way is to keep a buffer for each 
paragraph, and add and subtract characters from the buffer.

> Intuitively I expect there is some kind slow down between all these
> "strings" conversion.

Your intuition is wrong. Strings are not converted from ASCII to USC-2 to 
USC-4 on the fly, they are converted once, when the string is created.

The tests we ran earlier, e.g.:

('ab…' * 1000).replace('…', 'œ…')

show the *worst possible case* for the new string handling, because all 
we do is create new strings. First we create a string 'ab…', then we 
create another string 'ab…'*1000, then we create two new strings '…' and 
'œ…', and finally we call replace and create yet another new string.

But in real applications, once you have created a string, you don't just 
immediately create a new one and throw the old one away. You likely do 
work with that string:

steve@runes:~$ python3.2 -m timeit "s = 'abcœ…'*1000; n = len(s); flag = 
s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.41 usec per loop

steve@runes:~$ python3.3 -m timeit "s = 'abcœ…'*1000; n = len(s); flag = 
s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.29 usec per loop

Once you start doing *real work* with the strings, the overhead of 
deciding whether they should be stored using 1, 2 or 4 bytes begins to 
fade into the noise.

> When I tested this flexible representation, a few months ago, at the
> first alpha release. This is precisely what, I tested. String
> manipulations which are forcing this internal change and I concluded the
> result is not brillant. Realy, a factor 0.n up to 10.

Like I said, if you really think that there is a significant, repeatable 
slow-down on Windows, report it as a bug.

> Does any body know a way to get the size of the internal "string" in
> bytes? 

sys.getsizeof(some_string)

steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10030
steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10038

As I said, there is a *tiny* overhead difference. But identifiers will 
generally be smaller:

steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
(size.__name__))"
48
steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
(size.__name__))"
34

You can check the object overhead by looking at the size of the empty 
string.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27321

From	wxjmfauth@gmail.com
Date	2012-08-18 11:05 -0700
Message-ID	<mailman.3467.1345313116.4697.python-list@python.org>
In reply to	#27314

Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
> 
> Proof that is acceptable to everybody please, not just yourself.
> 
> 
I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.

Intuitively I expect there is some kind slow down between
all these "strings" conversion.

When I tested this flexible representation, a few months
ago, at the first alpha release. This is precisely what,
I tested. String manipulations which are forcing this internal
change and I concluded the result is not brillant. Realy,
a factor 0.n up to 10.

This are simply my conclusions.

Related question.

Does any body know a way to get the size of the internal
"string" in bytes? In the narrow or wide build it is easy,
I can encode with the "unicode_internal" codec. In Py 3.3, 
I attempted to toy with sizeof and stuct, but without
success.

jmf

[toc] | [prev] | [next] | [standalone]

#27329

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-08-18 16:09 -0400
Message-ID	<mailman.3472.1345320581.4697.python-list@python.org>
In reply to	#27310

On 8/18/2012 12:38 PM, wxjmfauth@gmail.com wrote:
> Sorry guys, I'm not stupid (I think). I can open IDLE with
> Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
> always slower. Period.

You have not tried enough tests ;-).

On my Win7-64 system:
from timeit import timeit

print(timeit(" 'a'*10000 "))
3.3.0b2: .5
3.2.3: .8

print(timeit("c in a", "c  = '…'; a = 'a'*10000"))
3.3: .05 (independent of len(a)!)
3.2: 5.8  100 times slower! Increase len(a) and the ratio can be made as 
high as one wants!

print(timeit("a.encode()", "a = 'a'*1000"))
3.2: 1.5
3.3:  .26

Similar with encoding='utf-8' added to call.

Jim, please stop the ranting. It does not help improve Python. utf-32 is 
not a panacea; it has problems of time, space, and system compatibility 
(Windows and others). Victor Stinner, whatever he may have once thought 
and said, put a *lot* of effort into making the new implementation both 
correct and fast.

On your replace example
 >>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
 > 61.919225272152346
 >>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
 > 1.2918679017971044

I do not see the point of changing both length and replacement. For me, 
the time is about the same for either replacement. I do see about the 
same slowdown ratio for 3.3 versus 3.2 I also see it for pure search 
without replacement.

print(timeit("c in a", "c  = '…'; a = 'a'*1000+c"))
# .6 in 3.2.3, 1.2 in 3.3.0

This does not make sense to me and I will ask about it.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

Page 3 of 8 — ← Prev page 1 2 [3] 4 5 6 7 8 Next page →

csiph-web

How do I display unicode value stored in a string variable using ord()

Contents

#27393 — Re: New internal string format in 3.3

#27394 — Re: New internal string format in 3.3

#27396 — Re: New internal string format in 3.3

#27397 — Re: New internal string format in 3.3

#27406 — Re: New internal string format in 3.3

#27408 — Re: New internal string format in 3.3

#27414 — Re: New internal string format in 3.3

#27435 — Re: New internal string format in 3.3

#27462 — Re: New internal string format in 3.3

#27488 — Re: New internal string format in 3.3

#27546 — Re: New internal string format in 3.3

#27564 — Re: New internal string format in 3.3

#27409 — Re: New internal string format in 3.3

#27410 — Re: New internal string format in 3.3

#27388 — Re: New internal string format in 3.3

#27385 — Re: New internal string format in 3.3

#27383 — Re: New internal string format in 3.3

#27351

#27321

#27329