Groups > comp.lang.python > #27204 > unrolled thread

How do I display unicode value stored in a string variable using ord()

Started by	Charles Jensen <hopefullycharles@gmail.com>
First post	2012-08-16 15:09 -0700
Last post	2012-08-20 17:20 -0400
Articles	20 on this page of 145 — 26 participants

Back to article view | Back to comp.lang.python

  How do I display unicode value stored in a string variable using ord() Charles Jensen <hopefullycharles@gmail.com> - 2012-08-16 15:09 -0700
    Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-17 08:20 +1000
    Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-16 18:47 -0400
    Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-16 19:59 -0400
      Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
        Re: How do I display unicode value stored in a string variable using ord() Jerry Hill <malaclypse2@gmail.com> - 2012-08-17 14:21 -0400
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
            Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 16:55 -0400
            Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 23:30 -0400
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 04:10 +0000
                Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:18 -0600
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 03:59 +0000
      Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
    Re: How do I display unicode value stored in a string variable using ord() Alister <alister.ware@ntlworld.com> - 2012-08-17 06:30 +0000
    Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 01:09 -0700
      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 12:27 +0000
        Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 08:07 -0700
          Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 16:25 +0100
          Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 01:36 +1000
          Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:51 -0600
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 02:57 +1000
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 18:28 +0100
                Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
                  Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:34 +0100
                    Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:35 +0000
                      New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Peter Otten <__peter__@web.de> - 2012-08-19 09:43 +0200
                        Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:56 +0000
                          Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 02:24 -0700
                          Re: New internal string format in 3.3 Peter Otten <__peter__@web.de> - 2012-08-19 11:37 +0200
                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
                              Re: New internal string format in 3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:33 +0000
                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
                              Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:26 +1000
                                Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
                                  Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:29 -0400
                                    Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
                                      Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 14:46 +0100
                                        Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
                                        Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
                                          Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 15:48 +0100
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
                                          Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:48 -0400
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
                                              Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:09 +0100
                                              Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-20 07:50 +1000
                                              Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-19 23:38 -0600
                                                Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-20 09:17 -0400
                                                  Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-20 22:18 -0600
                                                    Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-21 07:48 -0400
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
                                      Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:56 -0400
                                    Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
                                  Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:35 -0400
                                Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:30 +0000
                Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 16:09 -0400
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 23:12 -0400
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:33 +0000
              Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 11:50 -0600
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 11:20 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:31 -0600
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 12:23 -0700
                Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:16 +0000
              Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:46 -0600
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 17:59 +0000
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:30 -0700
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:45 +0100
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:13 +0000
            Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-18 11:40 -0700
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:50 +0100
              Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 13:22 -0700
                Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 22:37 +0100
        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 11:26 -0700
          Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:59 +0100
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 07:17 +0000
          Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 10:46 +1000
            Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:11 -0700
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 12:19 +1000
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:35 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:01 +1000
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 20:10 -0700
                      Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:31 +1000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 22:58 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:01 +0000
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:11 -0700
                      Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 18:24 +1000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:44 -0700
                          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 01:54 -0700
                            Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 11:46 +0100
                            Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 12:31 -0400
                      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 10:51 +0000
                        Re: How do I display unicode value stored in a string variable using ord() Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-21 17:03 +1000
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:09 +0000
            Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:04 -0700
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:25 +0000
                Re: How do I display unicode value stored in a string variable using ord() DJC <djc@news.invalid> - 2012-08-19 17:32 +0200
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:34 -0400
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 10:48 -0700
                  Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 11:11 -0700
                    Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:50 +0100
                    Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:59 -0400
                    Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-19 23:13 -0700
                  Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:15 +0000
                    Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-19 16:42 -0700
                      Re: Abuse of Big Oh notation Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-08-20 09:24 +0100
                        Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 09:01 -0700
                          Re: Abuse of Big Oh notation Chris Angelico <rosuav@gmail.com> - 2012-08-21 02:09 +1000
                          Re: Abuse of Big Oh notation Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-20 11:12 -0600
                            Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 12:29 -0700
                              Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:16 -0700
                              Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:20 -0700
                            Re: Abuse of Big Oh notation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-21 09:53 +0000
                        Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
                          Re: Abuse of Big Oh notation Ned Deily <nad@acm.org> - 2012-08-20 18:19 -0700
                          Abuse of subject, was Re: Abuse of Big Oh notation Peter Otten <__peter__@web.de> - 2012-08-21 09:52 +0200
                            Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
                            Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
                        Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Hans Mulder <hansmu@xs4all.nl> - 2012-08-22 20:53 +0200
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 08:42 +1000
                Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-19 19:24 -0400
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 04:21 +0000
                    Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-20 00:44 -0400
                      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 05:56 +0000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 23:24 -0700
                    Re: How do I display unicode value stored in a string variable using ord() Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-20 12:58 -0400
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 20:35 -0400
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 14:07 +1000
            Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:13 +0100
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:19 +1000
                Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:49 +0100
        Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 18:03 +0100
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 10:33 -0700
            Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:04 +0100
          Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-19 14:05 -0400
            Re: How do I display unicode value stored in a string variable usingord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:18 +0100
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:31 +0000
          Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:03 -0400
          Re: How do I display unicode value stored in a string variable using ord() 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-19 17:32 -0700
          Re: How do I display unicode value stored in a string variable using ord() Piet van Oostrum <piet@vanoostrum.org> - 2012-08-20 17:20 -0400

Page 4 of 8 — ← Prev page 1 2 3 [4] 5 6 7 8 Next page →

#27344

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-08-18 23:12 -0400
Message-ID	<mailman.3482.1345345997.4697.python-list@python.org>
In reply to	#27310

On 8/18/2012 4:09 PM, Terry Reedy wrote:

> print(timeit("c in a", "c  = '…'; a = 'a'*1000+c"))
> # .6 in 3.2.3, 1.2 in 3.3.0
>
> This does not make sense to me and I will ask about it.

I did ask on pydef list and paraphrased responses include:
1. 'My system gives opposite ratios.'
2. 'With a default of 1000000 repetitions in a loop, the reported times 
are microseconds per operation and thus not practically significant.'
3. 'There is a stringbench.py with a large number of such micro benchmarks.'

I believe there are also whole-application benchmarks that try to mimic 
real-world mixtures of operations.

People making improvements must consider performance on multiple systems 
and multiple benchmarks. If someone wants to work on search speed, they 
cannot just optimize that one operation on one system.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#27311

From	wxjmfauth@gmail.com
Date	2012-08-18 09:38 -0700
Message-ID	<mailman.3459.1345307892.4697.python-list@python.org>
In reply to	#27304

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

Now, the reason. I think it is due the "flexible represention".

Deeper reason. The "boss" do not wish to hear from a (pure)
ucs-4/utf-32 "engine" (this has been discussed I do not know
how many times).

jmf

[toc] | [prev] | [next] | [standalone]

#27352

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-19 06:33 +0000
Message-ID	<503088b7$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27304

On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:

> The change does not just benefit ASCII users.  It primarily benefits
> anybody using a wide unicode build with strings mostly containing only
> BMP characters.

Just to be clear:

If you have many strings which are *mostly* BMP, but have one or two non-
BMP characters in *each* string, you will see no benefit.

But if you have many strings which are all BMP, and only a few strings 
containing non-BMP characters, then you will see a big benefit.


> Even for narrow build users, there is the benefit that
> with approximately the same amount of memory usage in most cases, they
> no longer have to worry about non-BMP characters sneaking in and
> breaking their code.

Yes! +1000 on that.


> There is some additional benefit for Latin-1 users, but this has nothing
> to do with Python.  If Python is going to have the option of a 1-byte
> representation (and as long as we have the flexible representation, I
> can see no reason not to), 

The PEP explicitly states that it only uses a 1-byte format for ASCII 
strings, not Latin-1:

"ASCII-only Unicode strings will again use only one byte per character"

and later:

"If the maximum character is less than 128, they use the PyASCIIObject 
structure"

and:

"The data and utf8 pointers point to the same memory if the string uses 
only ASCII characters (using only Latin-1 is not sufficient)."


> then it is going to be Latin-1 by definition,

Certainly not, either in fact or in principle. There are a large number 
of 1-byte encodings, Latin-1 is hardly the only one.


> because that's what 1-byte Unicode (UCS-1, if you will) is.  If you have
> an issue with that, take it up with the designers of Unicode.

The designers of Unicode have never created a standard "1-byte Unicode" 
or UCS-1, as far as I can determine.

The Unicode standard refers to some multiple million code points, far too 
many to fit in a single byte. There is some historical justification for 
using "Unicode" to mean UCS-2, but with the standard being extended 
beyond the BMP, that is no longer valid.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details.


I think what you are trying to say is that the Unicode designers 
deliberately matched the Latin-1 standard for Unicode's first 256 code 
points. That's not the same thing though: there is no Unicode standard 
mapping to a single byte format.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27407

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-19 11:50 -0600
Message-ID	<mailman.3513.1345398650.4697.python-list@python.org>
In reply to	#27352

On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:
>> There is some additional benefit for Latin-1 users, but this has nothing
>> to do with Python.  If Python is going to have the option of a 1-byte
>> representation (and as long as we have the flexible representation, I
>> can see no reason not to),
>
> The PEP explicitly states that it only uses a 1-byte format for ASCII
> strings, not Latin-1:

I think you misunderstand the PEP then, because that is empirically false.

Python 3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:23:35) [MSC
v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getsizeof(bytes(range(256)).decode('latin1'))
329

The constructed string contains all 256 Latin-1 characters, so if
Latin-1 strings must be stored in the 2-byte format, then the size
should be at least 512 bytes.  It is not, so I think it must be using
the 1-byte encoding.

> "ASCII-only Unicode strings will again use only one byte per character"

This says nothing one way or the other about non-ASCII Latin-1 strings.

> "If the maximum character is less than 128, they use the PyASCIIObject
> structure"

Note that this only describes the structure of "compact" string
objects, which I have to admit I do not fully understand from the PEP.
 The wording suggests that it only uses the PyASCIIObject structure,
not the derived structures.  It then says that for compact ASCII
strings "the UTF-8 data, the UTF-8 length and the wstr length are the
same as the length of the ASCII data."  But these fields are part of
the PyCompactUnicodeObject structure, not the base PyASCIIObject
structure, so they would not exist if only PyASCIIObject were used.
It would also imply that compact non-ASCII strings are stored
internally as UTF-8, which would be surprising.

> and:
>
> "The data and utf8 pointers point to the same memory if the string uses
> only ASCII characters (using only Latin-1 is not sufficient)."

This says that if the data are ASCII, then the 1-byte representation
and the utf8 pointer will share the same memory.  It does not imply
that the 1-byte representation is not used for Latin-1, only that it
cannot also share memory with the utf8 pointer.

[toc] | [prev] | [next] | [standalone]

#27418

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-19 11:20 -0700
Message-ID	<7xobm6u4kk.fsf@ruckus.brouhaha.com>
In reply to	#27407

Ian Kelly <ian.g.kelly@gmail.com> writes:
>>>> sys.getsizeof(bytes(range(256)).decode('latin1'))
> 329

Please try:

   print (type(bytes(range(256)).decode('latin1')))

to make sure that what comes back is actually a unicode string rather
than a byte string.

[toc] | [prev] | [next] | [standalone]

#27419

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-19 12:31 -0600
Message-ID	<mailman.3520.1345401102.4697.python-list@python.org>
In reply to	#27418

On Sun, Aug 19, 2012 at 12:20 PM, Paul Rubin <no.email@nospam.invalid> wrote:
> Ian Kelly <ian.g.kelly@gmail.com> writes:
>>>>> sys.getsizeof(bytes(range(256)).decode('latin1'))
>> 329
>
> Please try:
>
>    print (type(bytes(range(256)).decode('latin1')))
>
> to make sure that what comes back is actually a unicode string rather
> than a byte string.

As I understand it, the decode method never returns a byte string in
Python 3, but if you insist:

>>> print (type(bytes(range(256)).decode('latin1')))
<class 'str'>

[toc] | [prev] | [next] | [standalone]

#27423

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-19 12:23 -0700
Message-ID	<7xsjbiele3.fsf@ruckus.brouhaha.com>
In reply to	#27419

Ian Kelly <ian.g.kelly@gmail.com> writes:
>>>> print (type(bytes(range(256)).decode('latin1')))
> <class 'str'>

Thanks.

[toc] | [prev] | [next] | [standalone]

#27427

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-19 20:16 +0000
Message-ID	<503149bb$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27407

On Sun, 19 Aug 2012 11:50:12 -0600, Ian Kelly wrote:

> On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
[...]
>> The PEP explicitly states that it only uses a 1-byte format for ASCII
>> strings, not Latin-1:
> 
> I think you misunderstand the PEP then, because that is empirically
> false.

Yes I did misunderstand. Thank you for the clarification.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27420

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-19 12:46 -0600
Message-ID	<mailman.3521.1345402019.4697.python-list@python.org>
In reply to	#27352

On Sun, Aug 19, 2012 at 11:50 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> Note that this only describes the structure of "compact" string
> objects, which I have to admit I do not fully understand from the PEP.
>  The wording suggests that it only uses the PyASCIIObject structure,
> not the derived structures.  It then says that for compact ASCII
> strings "the UTF-8 data, the UTF-8 length and the wstr length are the
> same as the length of the ASCII data."  But these fields are part of
> the PyCompactUnicodeObject structure, not the base PyASCIIObject
> structure, so they would not exist if only PyASCIIObject were used.
> It would also imply that compact non-ASCII strings are stored
> internally as UTF-8, which would be surprising.

Oh, now I get it.  I had missed the part where it says "character data
immediately follow the base structure".  And the bit about the "UTF-8
data, the UTF-8 length and the wstr length" are not describing the
contents of those fields, but rather where the data can be alternatively
found since the fields don't exist.

[toc] | [prev] | [next] | [standalone]

#27319

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-18 17:59 +0000
Message-ID	<502fd7f6$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27296

On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:

> Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>> [...]
>> The problem with UCS-4 is that every character requires four bytes.
>> [...]
> 
> I'm aware of this (and all the blah blah blah you are explaining). This
> always the same song. Memory.

Exactly. The reason it is always the same song is because it is an 
important song.

> Let me ask. Is Python an 'american" product for us-users or is it a tool
> for everybody [*]?

It is a product for everyone, which is exactly why PEP 393 is so 
important. PEP 393 means that users who have only a few non-BMP 
characters don't have to pay the cost of UCS-4 for every single string in 
their application, only for the ones that actually require it. PEP 393 
means that using Unicode strings is now cheaper for everybody.

You seem to be arguing that the way forward is not to make Unicode 
cheaper for everyone, but to make ASCII strings more expensive so that 
everyone suffers equally. I reject that idea.

> Is there any reason why non ascii users are somehow penalized compared
> to ascii users?

Of course there is a reason.

If you want to represent 1114111 different characters in a string, as 
Unicode supports, you can't use a single byte per character, or even two 
bytes. That is a fact of basic mathematics. Supporting 1114111 characters 
must be more expensive than supporting 128 of them.

But why should you carry the cost of 4-bytes per character just because 
someday you *might* need a non-BMP character?

> This flexible string representation is a regression (ascii users or
> not).

No it is not. It is a great step forward to more efficient Unicode.

And it means that now Python can correctly deal with non-BMP characters 
without the nonsense of UTF-16 surrogates:

steve@runes:~$ python3.3 -c "print(len(chr(1114000)))"  # Right!
1
steve@runes:~$ python3.2 -c "print(len(chr(1114000)))"  # Wrong!
2

without doubling the storage of every string.

This is an important step towards making the full range of Unicode 
available more widely.

> I recognize in practice the real impact is for many users closed to zero

Then what's the problem?

> (including me) but I have shown (I think) that this flexible
> representation is, by design, not as optimal as it is supposed to be.

You have not shown any real problem at all. 

You have shown untrustworthy, edited timing results that don't match what 
other people are reporting.

Even if your timing results are genuine, you haven't shown that they make 
any difference for real code that does useful work.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27323

From	wxjmfauth@gmail.com
Date	2012-08-18 11:30 -0700
Message-ID	<d09842f1-78b5-4c0b-8c4e-b8523a53c289@googlegroups.com>
In reply to	#27319

Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :
> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> 
> 
> 
> > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
> 
> >> [...]
> 
> >> The problem with UCS-4 is that every character requires four bytes.
> 
> >> [...]
> 
> > 
> 
> > I'm aware of this (and all the blah blah blah you are explaining). This
> 
> > always the same song. Memory.
> 
> 
> 
> Exactly. The reason it is always the same song is because it is an 
> 
> important song.
> 
> 
No offense here. But this is an *american* answer.

The same story as the coding of text files, where "utf-8 == ascii"
and the rest of the world doesn't count.

jmf

[toc] | [prev] | [next] | [standalone]

#27327

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-08-18 20:45 +0100
Message-ID	<mailman.3470.1345319111.4697.python-list@python.org>
In reply to	#27323

On 18/08/2012 19:30, wxjmfauth@gmail.com wrote:
> Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :
>> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
>>
>>
>>
>>> Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>>
>>>> [...]
>>
>>>> The problem with UCS-4 is that every character requires four bytes.
>>
>>>> [...]
>>
>>>
>>
>>> I'm aware of this (and all the blah blah blah you are explaining). This
>>
>>> always the same song. Memory.
>>
>>
>>
>> Exactly. The reason it is always the same song is because it is an
>>
>> important song.
>>
>>
> No offense here. But this is an *american* answer.
>
> The same story as the coding of text files, where "utf-8 == ascii"
> and the rest of the world doesn't count.
>
> jmf
>

Thinking about it I entirely agree with you.  Steven D'Aprano strikes me 
as typically American, in the same way that I'm typically Brazilian :)

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#27350

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-19 06:13 +0000
Message-ID	<50308409$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27323

On Sat, 18 Aug 2012 11:30:19 -0700, wxjmfauth wrote:

>> > I'm aware of this (and all the blah blah blah you are explaining).
>> > This always the same song. Memory.
>> 
>> 
>> 
>> Exactly. The reason it is always the same song is because it is an
>> important song.
>> 
>> 
> No offense here. But this is an *american* answer.

I am not American.

I am not aware that computers outside of the USA, and Australia, have 
unlimited amounts of memory. You must be very lucky.


> The same story as the coding of text files, where "utf-8 == ascii" and
> the rest of the world doesn't count.

UTF-8 is not ASCII.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27325

From	rusi <rustompmody@gmail.com>
Date	2012-08-18 11:40 -0700
Message-ID	<e19b8f04-05f0-43d2-8983-30622513fab3@j9g2000pbg.googlegroups.com>
In reply to	#27319

On Aug 18, 10:59 pm, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:
> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> > Is there any reason why non ascii users are somehow penalized compared
> > to ascii users?
>
> Of course there is a reason.
>
> If you want to represent 1114111 different characters in a string, as
> Unicode supports, you can't use a single byte per character, or even two
> bytes. That is a fact of basic mathematics. Supporting 1114111 characters
> must be more expensive than supporting 128 of them.
>
> But why should you carry the cost of 4-bytes per character just because
> someday you *might* need a non-BMP character?

I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605

Original above does not open for me but here's a copy that does:

http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html

[toc] | [prev] | [next] | [standalone]

#27328

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-08-18 20:50 +0100
Message-ID	<mailman.3471.1345319708.4697.python-list@python.org>
In reply to	#27325

On 18/08/2012 19:40, rusi wrote:
> On Aug 18, 10:59 pm, Steven D'Aprano <steve
> +comp.lang.pyt...@pearwood.info> wrote:
>> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
>>> Is there any reason why non ascii users are somehow penalized compared
>>> to ascii users?
>>
>> Of course there is a reason.
>>
>> If you want to represent 1114111 different characters in a string, as
>> Unicode supports, you can't use a single byte per character, or even two
>> bytes. That is a fact of basic mathematics. Supporting 1114111 characters
>> must be more expensive than supporting 128 of them.
>>
>> But why should you carry the cost of 4-bytes per character just because
>> someday you *might* need a non-BMP character?
>
> I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605
>
> Original above does not open for me but here's a copy that does:
>
> http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html
>

ROFLMAO doesn't adequately some up how much I laughed.

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#27331

From	wxjmfauth@gmail.com
Date	2012-08-18 13:22 -0700
Message-ID	<3e235732-39e4-4877-a860-466e433cde5e@googlegroups.com>
In reply to	#27325

Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :
> On Aug 18, 10:59 pm, Steven D'Aprano <steve
> 
> +comp.lang.pyt...@pearwood.info> wrote:
> 
> > On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> 
> > > Is there any reason why non ascii users are somehow penalized compared
> 
> > > to ascii users?
> 
> >
> 
> > Of course there is a reason.
> 
> >
> 
> > If you want to represent 1114111 different characters in a string, as
> 
> > Unicode supports, you can't use a single byte per character, or even two
> 
> > bytes. That is a fact of basic mathematics. Supporting 1114111 characters
> 
> > must be more expensive than supporting 128 of them.
> 
> >
> 
> > But why should you carry the cost of 4-bytes per character just because
> 
> > someday you *might* need a non-BMP character?
> 
> 
> 
> I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605
> 
> 
> 
> Original above does not open for me but here's a copy that does:
> 
> 
> 
> http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html

I thing it's time to leave the discussion and to go to bed.

You can take the problem the way you wish, Python 3.3 is "slower"
than Python 3.2.

If you see the present status as an optimisation, I'm condidering
this as a regression.

I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
the correct solution.

To be extreme, tools using pure utf-16 or utf-32 are, at least,
considering all the citizen on this planet in the same way.

jmf

[toc] | [prev] | [next] | [standalone]

#27334

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-08-18 22:37 +0100
Message-ID	<mailman.3475.1345325786.4697.python-list@python.org>
In reply to	#27331

On 18/08/2012 21:22, wxjmfauth@gmail.com wrote:
> Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :
>> On Aug 18, 10:59 pm, Steven D'Aprano <steve
>>
>> +comp.lang.pyt...@pearwood.info> wrote:
>>
>>> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
>>
>>>> Is there any reason why non ascii users are somehow penalized compared
>>
>>>> to ascii users?
>>
>>>
>>
>>> Of course there is a reason.
>>
>>>
>>
>>> If you want to represent 1114111 different characters in a string, as
>>
>>> Unicode supports, you can't use a single byte per character, or even two
>>
>>> bytes. That is a fact of basic mathematics. Supporting 1114111 characters
>>
>>> must be more expensive than supporting 128 of them.
>>
>>>
>>
>>> But why should you carry the cost of 4-bytes per character just because
>>
>>> someday you *might* need a non-BMP character?
>>
>>
>>
>> I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605
>>
>>
>>
>> Original above does not open for me but here's a copy that does:
>>
>>
>>
>> http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html
>
> I thing it's time to leave the discussion and to go to bed.

In plain English, duck out cos I'm losing.

>
> You can take the problem the way you wish, Python 3.3 is "slower"
> than Python 3.2.

I'll ask for the second time.  Provide proof that is acceptable to 
everybody and not just yourself.

>
> If you see the present status as an optimisation, I'm condidering
> this as a regression.

Considering does not equate to proof.  Where are the figures which back 
up your claim?

>
> I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
> the correct solution.

I look forward to seeing your patch on the bug tracker.  If and only if 
you can find something that needs patching, which from the course of 
this thread I think is highly unlikely.


>
> To be extreme, tools using pure utf-16 or utf-32 are, at least,
> considering all the citizen on this planet in the same way.
>
> jmf
>


-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#27322

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-18 11:26 -0700
Message-ID	<7xehn4vyya.fsf@ruckus.brouhaha.com>
In reply to	#27291

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters 
> using two code points. This is fragile and doesn't work very well, 
> because string-handling methods can break the surrogate pairs apart, 
> leaving you with invalid unicode string. Not good.)
...
> With PEP 393, each Python string will be stored in the most efficient 
> format possible:

Can you explain the issue of "breaking surrogate pairs apart" a little
more?  Switching between encodings based on the string contents seems
silly at first glance.  Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages.  I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.

[toc] | [prev] | [next] | [standalone]

#27326

From	MRAB <python@mrabarnett.plus.com>
Date	2012-08-18 19:59 +0100
Message-ID	<mailman.3469.1345316373.4697.python-list@python.org>
In reply to	#27322

On 18/08/2012 19:26, Paul Rubin wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>> (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
>> using two code points. This is fragile and doesn't work very well,
>> because string-handling methods can break the surrogate pairs apart,
>> leaving you with invalid unicode string. Not good.)
> ...
>> With PEP 393, each Python string will be stored in the most efficient
>> format possible:
>
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more?  Switching between encodings based on the string contents seems
> silly at first glance.  Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages.  I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.
>
On a narrow build, codepoints outside the BMP are stored as a surrogate
pair (2 codepoints). On a wide build, all codepoints can be represented
without the need for surrogate pairs.

The problem with strings containing surrogate pairs is that you could
inadvertently slice the string in the middle of the surrogate pair.

[toc] | [prev] | [next] | [standalone]

#27356

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-19 07:17 +0000
Message-ID	<503092f6$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27326

On Sat, 18 Aug 2012 19:59:32 +0100, MRAB wrote:

> The problem with strings containing surrogate pairs is that you could
> inadvertently slice the string in the middle of the surrogate pair.

That's the *least* of the problems with surrogate pairs. That would be 
easy to fix: check the point of the slice, and back up or forward if 
you're on a surrogate pair. But that's not good enough, because the 
surrogates could be anywhere in the string. You have to touch every 
single character in order to know how many there are.

The problem with surrogate pairs is that they make basic string 
operations O(N) instead of O(1).

-- 
Steven

[toc] | [prev] | [next] | [standalone]

Page 4 of 8 — ← Prev page 1 2 3 [4] 5 6 7 8 Next page →

csiph-web

How do I display unicode value stored in a string variable using ord()

Contents

#27344

#27311

#27352

#27407

#27418

#27419

#27423

#27427

#27420

#27319

#27323

#27327

#27350

#27325

#27328

#27331

#27334

#27322

#27326

#27356