Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #27204 > unrolled thread
| Started by | Charles Jensen <hopefullycharles@gmail.com> |
|---|---|
| First post | 2012-08-16 15:09 -0700 |
| Last post | 2012-08-20 17:20 -0400 |
| Articles | 20 on this page of 145 — 26 participants |
Back to article view | Back to comp.lang.python
How do I display unicode value stored in a string variable using ord() Charles Jensen <hopefullycharles@gmail.com> - 2012-08-16 15:09 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-17 08:20 +1000
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-16 18:47 -0400
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-16 19:59 -0400
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
Re: How do I display unicode value stored in a string variable using ord() Jerry Hill <malaclypse2@gmail.com> - 2012-08-17 14:21 -0400
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 16:55 -0400
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 23:30 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 04:10 +0000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:18 -0600
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 03:59 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
Re: How do I display unicode value stored in a string variable using ord() Alister <alister.ware@ntlworld.com> - 2012-08-17 06:30 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 01:09 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 12:27 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 08:07 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 16:25 +0100
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 01:36 +1000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:51 -0600
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 02:57 +1000
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 18:28 +0100
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:34 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:35 +0000
New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Peter Otten <__peter__@web.de> - 2012-08-19 09:43 +0200
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:56 +0000
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 02:24 -0700
Re: New internal string format in 3.3 Peter Otten <__peter__@web.de> - 2012-08-19 11:37 +0200
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
Re: New internal string format in 3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:33 +0000
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:26 +1000
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:29 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 14:46 +0100
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 15:48 +0100
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:48 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:09 +0100
Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-20 07:50 +1000
Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-19 23:38 -0600
Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-20 09:17 -0400
Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-20 22:18 -0600
Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-21 07:48 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:56 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:35 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:30 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 16:09 -0400
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 23:12 -0400
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:33 +0000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 11:50 -0600
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 11:20 -0700
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:31 -0600
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 12:23 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:16 +0000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:46 -0600
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 17:59 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:30 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:45 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:13 +0000
Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-18 11:40 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:50 +0100
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 13:22 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 22:37 +0100
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 11:26 -0700
Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:59 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 07:17 +0000
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 10:46 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:11 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 12:19 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:35 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:01 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 20:10 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:31 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 22:58 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:01 +0000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:11 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 18:24 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:44 -0700
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 01:54 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 11:46 +0100
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 12:31 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 10:51 +0000
Re: How do I display unicode value stored in a string variable using ord() Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-21 17:03 +1000
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:09 +0000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:04 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:25 +0000
Re: How do I display unicode value stored in a string variable using ord() DJC <djc@news.invalid> - 2012-08-19 17:32 +0200
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:34 -0400
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 10:48 -0700
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 11:11 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:50 +0100
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:59 -0400
Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-19 23:13 -0700
Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:15 +0000
Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-19 16:42 -0700
Re: Abuse of Big Oh notation Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-08-20 09:24 +0100
Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 09:01 -0700
Re: Abuse of Big Oh notation Chris Angelico <rosuav@gmail.com> - 2012-08-21 02:09 +1000
Re: Abuse of Big Oh notation Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-20 11:12 -0600
Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 12:29 -0700
Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:16 -0700
Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:20 -0700
Re: Abuse of Big Oh notation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-21 09:53 +0000
Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
Re: Abuse of Big Oh notation Ned Deily <nad@acm.org> - 2012-08-20 18:19 -0700
Abuse of subject, was Re: Abuse of Big Oh notation Peter Otten <__peter__@web.de> - 2012-08-21 09:52 +0200
Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
Re: How do I display unicode value stored in a string variable using ord() Hans Mulder <hansmu@xs4all.nl> - 2012-08-22 20:53 +0200
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 08:42 +1000
Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-19 19:24 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 04:21 +0000
Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-20 00:44 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 05:56 +0000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 23:24 -0700
Re: How do I display unicode value stored in a string variable using ord() Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-20 12:58 -0400
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 20:35 -0400
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 14:07 +1000
Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:13 +0100
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:19 +1000
Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:49 +0100
Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 18:03 +0100
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 10:33 -0700
Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:04 +0100
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-19 14:05 -0400
Re: How do I display unicode value stored in a string variable usingord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:18 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:31 +0000
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:03 -0400
Re: How do I display unicode value stored in a string variable using ord() 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-19 17:32 -0700
Re: How do I display unicode value stored in a string variable using ord() Piet van Oostrum <piet@vanoostrum.org> - 2012-08-20 17:20 -0400
Page 1 of 8 [1] 2 3 4 5 6 7 8 Next page →
| From | Charles Jensen <hopefullycharles@gmail.com> |
|---|---|
| Date | 2012-08-16 15:09 -0700 |
| Subject | How do I display unicode value stored in a string variable using ord() |
| Message-ID | <f801e06f-f7b2-4aca-b352-66856a939746@googlegroups.com> |
Everyone knows that the python command
ord(u'…')
will output the number 8230 which is the unicode character for the horizontal ellipsis.
How would I use ord() to find the unicode value of a string stored in a variable?
So the following 2 lines of code will give me the ascii value of the variable a. How do I specify ord to give me the unicode value of a?
a = '…'
ord(a)
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-17 08:20 +1000 |
| Message-ID | <mailman.3397.1345155618.4697.python-list@python.org> |
| In reply to | #27204 |
On Fri, Aug 17, 2012 at 8:09 AM, Charles Jensen <hopefullycharles@gmail.com> wrote: > How would I use ord() to find the unicode value of a string stored in a variable? > > So the following 2 lines of code will give me the ascii value of the variable a. How do I specify ord to give me the unicode value of a? > > a = '…' > ord(a) I presume you're talking about Python 2, because in Python 3 your string variable is a Unicode string and will behave as you describe above. You'll need to look into what the encoding is, and figure it out from there. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-08-16 18:47 -0400 |
| Message-ID | <mailman.3401.1345157258.4697.python-list@python.org> |
| In reply to | #27204 |
On 08/16/2012 06:09 PM, Charles Jensen wrote:
> Everyone knows that the python command
>
> ord(u'…')
>
> will output the number 8230 which is the unicode character for the horizontal ellipsis.
>
> How would I use ord() to find the unicode value of a string stored in a variable?
>
> So the following 2 lines of code will give me the ascii value of the variable a. How do I specify ord to give me the unicode value of a?
>
> a = '…'
> ord(a)
You omitted the print statement. You also didn't specify what version
of Python you're using; I'll assume Python 2.x because in Python 3.x,
the u"xx" notation would have been a syntax error.
To get the ord of a unicode variable, you do it the same as a unicode
literal:
a = u"j" #note: for this to work reliably, you probably
need the correct Unicode declaration in line 2 of the file
print ord(a)
But if you have a byte string containing some binary bits, and you want
to get a unicode character value out of it, you'll need to explicitly
convert it to unicode.
First, decide what method the byte string was encoded. If you specify
the wrong encoding, you'll likely to get an exception, or maybe just a
nonsense answer.
a = "\xc1\xc1" #I just made this value up; it's not
valid utf8
b = a.decode("utf-8")
print ord(b)
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-08-16 19:59 -0400 |
| Message-ID | <mailman.3406.1345161591.4697.python-list@python.org> |
| In reply to | #27204 |
a = '…' print(ord(a)) >>> 8230 Most things with unicode are easier in 3.x, and some are even better in 3.3. The current beta is good enough for most informal work. 3.3.0 will be out in a month. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-17 10:49 -0700 |
| Message-ID | <a6c030b2-25da-47a2-97b5-1e349394d762@googlegroups.com> |
| In reply to | #27215 |
Le vendredi 17 août 2012 01:59:31 UTC+2, Terry Reedy a écrit :
> a = '…'
>
> print(ord(a))
>
> >>>
>
> 8230
>
> Most things with unicode are easier in 3.x, and some are even better in
>
> 3.3. The current beta is good enough for most informal work. 3.3.0 will
>
> be out in a month.
>
>
>
> --
>
> Terry Jan Reedy
Slightly off topic.
The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
is one of these characters existing in the cp1252, mac-roman
coding schemes and not in iso-8859-1 (latin-1) and obviously
not in ascii. It causes Py3.3 to work a few 100% slower
than Py<3.3 versions due to the flexible string representation
(ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
>>> '…'.encode('cp1252')
b'\x85'
>>> '…'.encode('mac-roman')
b'\xc9'
>>> '…'.encode('iso-8859-1') # latin-1
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
in position 0: ordinal not in range(256)
If one could neglect this (typographically important) glyph, what
to say about the characters of the European scripts (languages)
present in cp1252 or in mac-roman but not in latin-1 (eg. the
French script/language)?
Very nice. Python 2 was built for ascii user, now Python 3 is
*optimized* for, let say, ascii user!
The future is bright for Python. French users are better
served with Apple or MS products, simply because these
corporates know you can not write French with iso-8859-1.
PS When "TeX" moved from the ascii encoding to iso-8859-1
and the so called Cork encoding, "they" know this and provided
all the complementary packages to circumvent this. It was
in 199? (Python was not even born).
Ditto for the foundries (Adobe, Linotype, ...)
jmf
[toc] | [prev] | [next] | [standalone]
| From | Jerry Hill <malaclypse2@gmail.com> |
|---|---|
| Date | 2012-08-17 14:21 -0400 |
| Message-ID | <mailman.3422.1345227697.4697.python-list@python.org> |
| In reply to | #27248 |
On Fri, Aug 17, 2012 at 1:49 PM, <wxjmfauth@gmail.com> wrote:
> The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
> is one of these characters existing in the cp1252, mac-roman
> coding schemes and not in iso-8859-1 (latin-1) and obviously
> not in ascii. It causes Py3.3 to work a few 100% slower
> than Py<3.3 versions due to the flexible string representation
> (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
>
>>>> '…'.encode('cp1252')
> b'\x85'
>>>> '…'.encode('mac-roman')
> b'\xc9'
>>>> '…'.encode('iso-8859-1') # latin-1
> Traceback (most recent call last):
> File "<eta last command>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
> in position 0: ordinal not in range(256)
>
> If one could neglect this (typographically important) glyph, what
> to say about the characters of the European scripts (languages)
> present in cp1252 or in mac-roman but not in latin-1 (eg. the
> French script/language)?
So... python should change the longstanding definition of the latin-1
character set? This isn't some sort of python limitation, it's just
the reality of legacy encodings that actually exist in the real world.
> Very nice. Python 2 was built for ascii user, now Python 3 is
> *optimized* for, let say, ascii user!
>
> The future is bright for Python. French users are better
> served with Apple or MS products, simply because these
> corporates know you can not write French with iso-8859-1.
>
> PS When "TeX" moved from the ascii encoding to iso-8859-1
> and the so called Cork encoding, "they" know this and provided
> all the complementary packages to circumvent this. It was
> in 199? (Python was not even born).
>
> Ditto for the foundries (Adobe, Linotype, ...)
I don't understand what any of this has to do with Python. Just
output your text in UTF-8 like any civilized person in the 21st
century, and none of that is a problem at all. Python make that easy.
It also makes it easy to interoperate with older encodings if you
have to.
--
Jerry
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-17 11:45 -0700 |
| Message-ID | <mailman.3423.1345229106.4697.python-list@python.org> |
| In reply to | #27254 |
Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
> On Fri, Aug 17, 2012 at 1:49 PM, <wxjmfauth@gmail.com> wrote:
>
> > The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
>
> > is one of these characters existing in the cp1252, mac-roman
>
> > coding schemes and not in iso-8859-1 (latin-1) and obviously
>
> > not in ascii. It causes Py3.3 to work a few 100% slower
>
> > than Py<3.3 versions due to the flexible string representation
>
> > (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
>
> >
>
> >>>> '…'.encode('cp1252')
>
> > b'\x85'
>
> >>>> '…'.encode('mac-roman')
>
> > b'\xc9'
>
> >>>> '…'.encode('iso-8859-1') # latin-1
>
> > Traceback (most recent call last):
>
> > File "<eta last command>", line 1, in <module>
>
> > UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
>
> > in position 0: ordinal not in range(256)
>
> >
>
> > If one could neglect this (typographically important) glyph, what
>
> > to say about the characters of the European scripts (languages)
>
> > present in cp1252 or in mac-roman but not in latin-1 (eg. the
>
> > French script/language)?
>
>
>
> So... python should change the longstanding definition of the latin-1
>
> character set? This isn't some sort of python limitation, it's just
>
> the reality of legacy encodings that actually exist in the real world.
>
>
>
>
>
> > Very nice. Python 2 was built for ascii user, now Python 3 is
>
> > *optimized* for, let say, ascii user!
>
> >
>
> > The future is bright for Python. French users are better
>
> > served with Apple or MS products, simply because these
>
> > corporates know you can not write French with iso-8859-1.
>
> >
>
> > PS When "TeX" moved from the ascii encoding to iso-8859-1
>
> > and the so called Cork encoding, "they" know this and provided
>
> > all the complementary packages to circumvent this. It was
>
> > in 199? (Python was not even born).
>
> >
>
> > Ditto for the foundries (Adobe, Linotype, ...)
>
>
>
>
>
> I don't understand what any of this has to do with Python. Just
>
> output your text in UTF-8 like any civilized person in the 21st
>
> century, and none of that is a problem at all. Python make that easy.
>
> It also makes it easy to interoperate with older encodings if you
>
> have to.
>
Sorry, you missed the point.
My comment had nothing to do with the code source coding,
the coding of a Python "string" in the code source or with
the display of a Python3 <str>.
I wrote about the *internal* Python "coding", the
way Python keeps "strings" in memory. See PEP 393.
jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-17 11:45 -0700 |
| Message-ID | <253ddd61-4bb5-4f46-b58c-525e55b27558@googlegroups.com> |
| In reply to | #27254 |
Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
> On Fri, Aug 17, 2012 at 1:49 PM, <wxjmfauth@gmail.com> wrote:
>
> > The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
>
> > is one of these characters existing in the cp1252, mac-roman
>
> > coding schemes and not in iso-8859-1 (latin-1) and obviously
>
> > not in ascii. It causes Py3.3 to work a few 100% slower
>
> > than Py<3.3 versions due to the flexible string representation
>
> > (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
>
> >
>
> >>>> '…'.encode('cp1252')
>
> > b'\x85'
>
> >>>> '…'.encode('mac-roman')
>
> > b'\xc9'
>
> >>>> '…'.encode('iso-8859-1') # latin-1
>
> > Traceback (most recent call last):
>
> > File "<eta last command>", line 1, in <module>
>
> > UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
>
> > in position 0: ordinal not in range(256)
>
> >
>
> > If one could neglect this (typographically important) glyph, what
>
> > to say about the characters of the European scripts (languages)
>
> > present in cp1252 or in mac-roman but not in latin-1 (eg. the
>
> > French script/language)?
>
>
>
> So... python should change the longstanding definition of the latin-1
>
> character set? This isn't some sort of python limitation, it's just
>
> the reality of legacy encodings that actually exist in the real world.
>
>
>
>
>
> > Very nice. Python 2 was built for ascii user, now Python 3 is
>
> > *optimized* for, let say, ascii user!
>
> >
>
> > The future is bright for Python. French users are better
>
> > served with Apple or MS products, simply because these
>
> > corporates know you can not write French with iso-8859-1.
>
> >
>
> > PS When "TeX" moved from the ascii encoding to iso-8859-1
>
> > and the so called Cork encoding, "they" know this and provided
>
> > all the complementary packages to circumvent this. It was
>
> > in 199? (Python was not even born).
>
> >
>
> > Ditto for the foundries (Adobe, Linotype, ...)
>
>
>
>
>
> I don't understand what any of this has to do with Python. Just
>
> output your text in UTF-8 like any civilized person in the 21st
>
> century, and none of that is a problem at all. Python make that easy.
>
> It also makes it easy to interoperate with older encodings if you
>
> have to.
>
Sorry, you missed the point.
My comment had nothing to do with the code source coding,
the coding of a Python "string" in the code source or with
the display of a Python3 <str>.
I wrote about the *internal* Python "coding", the
way Python keeps "strings" in memory. See PEP 393.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-08-17 16:55 -0400 |
| Message-ID | <mailman.3431.1345236951.4697.python-list@python.org> |
| In reply to | #27257 |
On 08/17/2012 02:45 PM, wxjmfauth@gmail.com wrote: > Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit : >> <SNIP> >> >> I don't understand what any of this has to do with Python. Just >> >> output your text in UTF-8 like any civilized person in the 21st >> >> century, and none of that is a problem at all. Python make that easy. >> >> It also makes it easy to interoperate with older encodings if you >> >> have to. >> > Sorry, you missed the point. > > My comment had nothing to do with the code source coding, > the coding of a Python "string" in the code source or with > the display of a Python3 <str>. > I wrote about the *internal* Python "coding", the > way Python keeps "strings" in memory. See PEP 393. > > jmf The internal coding described in PEP 393 has nothing to do with latin-1 encoding. So what IS your point? Make it clearly, without all the snide side-comments. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-08-17 23:30 -0400 |
| Message-ID | <mailman.3440.1345260650.4697.python-list@python.org> |
| In reply to | #27257 |
On 08/17/2012 08:21 PM, Ian Kelly wrote: > On Aug 17, 2012 2:58 PM, "Dave Angel" <d@davea.name> wrote: >> The internal coding described in PEP 393 has nothing to do with latin-1 >> encoding. > It certainly does. PEP 393 provides for Unicode strings to be represented > internally as any of Latin-1, UCS-2, or UCS-4, whichever is smallest and > sufficient to contain the data. I understand the complaint to be that while > the change is great for strings that happen to fit in Latin-1, it is less > efficient than previous versions for strings that do not. That's not the way I interpreted the PEP 393. It takes a pure unicode string, finds the largest code point in that string, and chooses 1, 2 or 4 bytes for every character, based on how many bits it'd take for that largest code point. Further i read it to mean that only 00 bytes would be dropped in the process, no other bytes would be changed. I take it as a coincidence that it happens to match latin-1; that's the way Unicode happened historically, and is not Python's fault. Am I reading it wrong? I also figure this is going to be more space efficient than Python 3.2 for any string which had a max code point of 65535 or less (in Windows), or 4billion or less (in real systems). So unless French has code points over 64k, I can't figure that anything is lost. I have no idea about the times involved, so i wanted a more specific complaint. > I don't know how much merit there is to this claim. It would seem to me > that even in non-western locales, most strings are likely to be Latin-1 or > even ASCII, e.g. class and attribute and function names. > > The jmfauth rant I was responding to was saying that French isn't efficiently encoded, and that performance of some vague operations were somehow reduced by several fold. I was just trying to get him to be more specific. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-18 04:10 +0000 |
| Message-ID | <502f15b5$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27279 |
On Fri, 17 Aug 2012 23:30:22 -0400, Dave Angel wrote: > On 08/17/2012 08:21 PM, Ian Kelly wrote: >> On Aug 17, 2012 2:58 PM, "Dave Angel" <d@davea.name> wrote: >>> The internal coding described in PEP 393 has nothing to do with >>> latin-1 encoding. >> It certainly does. PEP 393 provides for Unicode strings to be >> represented internally as any of Latin-1, UCS-2, or UCS-4, whichever is >> smallest and sufficient to contain the data. Unicode strings are not represented as Latin-1 internally. Latin-1 is a byte encoding, not a unicode internal format. Perhaps you mean to say that they are represented as a single byte format? >> I understand the complaint >> to be that while the change is great for strings that happen to fit in >> Latin-1, it is less efficient than previous versions for strings that >> do not. > > That's not the way I interpreted the PEP 393. It takes a pure unicode > string, finds the largest code point in that string, and chooses 1, 2 or > 4 bytes for every character, based on how many bits it'd take for that > largest code point. That's how I interpret it too. > Further i read it to mean that only 00 bytes would > be dropped in the process, no other bytes would be changed. Just to clarify, you aren't talking about the \0 character, but only to extraneous "padding" 00 bytes. > I also figure this is going to be more space efficient than Python 3.2 > for any string which had a max code point of 65535 or less (in Windows), > or 4billion or less (in real systems). So unless French has code points > over 64k, I can't figure that anything is lost. I think that on narrow builds, it won't make terribly much difference. The big savings are for wide builds. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-18 09:18 -0600 |
| Message-ID | <mailman.3452.1345303152.4697.python-list@python.org> |
| In reply to | #27281 |
(Resending this to the list because I previously sent it only to Steven by mistake. Also showing off a case where top-posting is reasonable, since this bit requires no context. :-) On Sat, Aug 18, 2012 at 1:41 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote: > > On Aug 17, 2012 10:17 PM, "Steven D'Aprano" > <steve+comp.lang.python@pearwood.info> wrote: >> >> Unicode strings are not represented as Latin-1 internally. Latin-1 is a >> byte encoding, not a unicode internal format. Perhaps you mean to say >> that they are represented as a single byte format? > > They are represented as a single-byte format that happens to be equivalent > to Latin-1, because Latin-1 is a proper subset of Unicode; every character > representable in Latin-1 has a byte value equal to its Unicode codepoint. > This talk of whether it's a byte encoding or a 1-byte Unicode representation > is then just semantics. Even the PEP refers to the 1-byte representation as > Latin-1. > >> >> >> I understand the complaint >> >> to be that while the change is great for strings that happen to fit in >> >> Latin-1, it is less efficient than previous versions for strings that >> >> do not. >> > >> > That's not the way I interpreted the PEP 393. It takes a pure unicode >> > string, finds the largest code point in that string, and chooses 1, 2 or >> > 4 bytes for every character, based on how many bits it'd take for that >> > largest code point. >> >> That's how I interpret it too. > > I don't see how this is any different from what I described. Using all 4 > bytes of the code point, you get UCS-4. Truncating to 2 bytes, you get > UCS-2. Truncating to 1 byte, you get Latin-1.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-18 03:59 +0000 |
| Message-ID | <502f1333$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27257 |
On Fri, 17 Aug 2012 11:45:02 -0700, wxjmfauth wrote: > Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit : >> On Fri, Aug 17, 2012 at 1:49 PM, <wxjmfauth@gmail.com> wrote: >> >> > The character '…', Unicode name 'HORIZONTAL ELLIPSIS', >> > is one of these characters existing in the cp1252, mac-roman >> > coding schemes and not in iso-8859-1 (latin-1) and obviously >> > not in ascii. It causes Py3.3 to work a few 100% slower >> > than Py<3.3 versions due to the flexible string representation >> > (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%). [...] > Sorry, you missed the point. > > My comment had nothing to do with the code source coding, the coding of > a Python "string" in the code source or with the display of a Python3 > <str>. > I wrote about the *internal* Python "coding", the way Python keeps > "strings" in memory. See PEP 393. The PEP does not support your claim that flexible string storage is 100% to 1000% slower. It claims 1% - 30% slowdown, with a saving of up to 60% of the memory used for strings. I don't really understand what message you are trying to give here. Are you saying that PEP 393 is a good thing or a bad thing? In Python 1.x, there was no support for Unicode at all. You could only work with pure byte strings. Support for non-ascii characters like … ∞ é ñ £ π Ж ش was purely by accident -- if your terminal happened to be set to an encoding that supported a character, and you happened to use the appropriate byte value, you might see the character you wanted. In Python 2.2, Python gained support for Unicode. You could now guarantee support for any Unicode character in the Basic Multilingual Plane (BMP) by writing your strings using the u"..." style. In Python 3, you no longer need the leading U, all strings are unicode. But there is a problem: if your Python interpreter is a "narrow build", it *only* supports Unicode characters in the BMP. When Python is a "wide build", compiled with support for the additional character planes, then strings take much more memory, even if they are in the BMP, or are simple ASCII strings. PEP 393 fixes this problem and gets rid of the distinction between narrow and wide builds. From Python 3.3 onwards, all Python compilers will have the same support for unicode, rather than most being BMP-only. Each individual string's internal storage will use only as many bytes-per- character as needed to store the largest character in the string. This will save a lot of memory for those using mostly ASCII or Latin-1 but a few multibyte characters. While the increased complexity causes a small slowdown, the increased functionality makes it well worthwhile. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-17 10:49 -0700 |
| Message-ID | <mailman.3421.1345226504.4697.python-list@python.org> |
| In reply to | #27215 |
Le vendredi 17 août 2012 01:59:31 UTC+2, Terry Reedy a écrit :
> a = '…'
>
> print(ord(a))
>
> >>>
>
> 8230
>
> Most things with unicode are easier in 3.x, and some are even better in
>
> 3.3. The current beta is good enough for most informal work. 3.3.0 will
>
> be out in a month.
>
>
>
> --
>
> Terry Jan Reedy
Slightly off topic.
The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
is one of these characters existing in the cp1252, mac-roman
coding schemes and not in iso-8859-1 (latin-1) and obviously
not in ascii. It causes Py3.3 to work a few 100% slower
than Py<3.3 versions due to the flexible string representation
(ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
>>> '…'.encode('cp1252')
b'\x85'
>>> '…'.encode('mac-roman')
b'\xc9'
>>> '…'.encode('iso-8859-1') # latin-1
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
in position 0: ordinal not in range(256)
If one could neglect this (typographically important) glyph, what
to say about the characters of the European scripts (languages)
present in cp1252 or in mac-roman but not in latin-1 (eg. the
French script/language)?
Very nice. Python 2 was built for ascii user, now Python 3 is
*optimized* for, let say, ascii user!
The future is bright for Python. French users are better
served with Apple or MS products, simply because these
corporates know you can not write French with iso-8859-1.
PS When "TeX" moved from the ascii encoding to iso-8859-1
and the so called Cork encoding, "they" know this and provided
all the complementary packages to circumvent this. It was
in 199? (Python was not even born).
Ditto for the foundries (Adobe, Linotype, ...)
jmf
[toc] | [prev] | [next] | [standalone]
| From | Alister <alister.ware@ntlworld.com> |
|---|---|
| Date | 2012-08-17 06:30 +0000 |
| Message-ID | <lylXr.960568$gC5.364193@fx10.am4> |
| In reply to | #27204 |
On Thu, 16 Aug 2012 15:09:47 -0700, Charles Jensen wrote: > Everyone knows that the python command > > ord(u'…') > > will output the number 8230 which is the unicode character for the > horizontal ellipsis. > > How would I use ord() to find the unicode value of a string stored in a > variable? > > So the following 2 lines of code will give me the ascii value of the > variable a. How do I specify ord to give me the unicode value of a? > > a = '…' ord(a) the same way you did in your original example by defining the string ass unicode a=u'...' ord(a) -- Keep on keepin' on.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-18 01:09 -0700 |
| Message-ID | <308df2af-abe7-4043-b199-0a39f440e0ab@googlegroups.com> |
| In reply to | #27204 |
>>> sys.version
'3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
>>> timeit.timeit("('ab…' * 1000).replace('…', '……')")
37.32762490493721
timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
0.8158757139801764
>>> sys.version
'3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32 bit
(Intel)]'
>>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
61.919225272152346
>>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
1.2918679017971044
timeit.timeit("('ab…' * 10).replace('…', '€…')")
1.2484133226156757
* I intuitively and empirically noticed, this happens for
cp1252 or mac-roman characters and not characters which are
elements of the latin-1 coding scheme.
* Bad luck, such characters are usual characters in French scripts
(and in some other European language).
* I do not recall the extreme cases I found. Believe me, when
I'm speaking about a few 100%, I do not lie.
My take of the subject.
This is a typical Python desease. Do not solve a problem, but
find a way, a workaround, which is expecting to solve a problem
and which finally solves nothing. As far as I know, to break
the "BMP limit", the tools are here. They are called utf-8 or
ucs-4/utf-32.
One day, I fell on very, very old mail message, dating at the
time of the introduction of the unicode type in Python 2.
If I recall correctly it was from Victor Stinner. He wrote
something like this "Let's go with ucs-4, and the problems
are solved for ever". He was so right.
I'm spying the dev-list since years, my feeling is that
there is always a latent and permanent conflict between
"ascii users" and "non ascii users" (see the unicode
literal reintroduction).
Please, do not get me wrong. As a non-computer scientist,
I'm very happy with Python. If I try to take a distant
eye, I became more and more sceptical.
PS Py3.3b2 is still crashing, silently exiting, with
cp65001.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-18 12:27 +0000 |
| Message-ID | <502f8a2a$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27288 |
On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
>>>> sys.version
> '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
>>>> timeit.timeit("('ab…' * 1000).replace('…', '……')")
> 37.32762490493721
> timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 0.8158757139801764
>
>>>> sys.version
> '3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32
> bit (Intel)]'
>>>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
> 61.919225272152346
"imeit"?
It is hard to take your results seriously when you have so obviously
edited your timing results, not just copied and pasted them.
Here are my results, on my laptop running Debian Linux. First, testing on
Python 3.2:
steve@runes:~$ python3.2 -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 50.2 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 45.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 51.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 47.6 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 45.9 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 57.5 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
10000 loops, best of 3: 49.7 usec per loop
As you can see, the timing results are all consistently around 50
microseconds per loop, regardless of which characters I use, whether they
are in Latin-1 or not. The differences between one test and another are
not meaningful.
Now I do them again using Python 3.3:
steve@runes:~$ python3.3 -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 64.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 67.8 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 66 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 67.6 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 68.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 67.9 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
10000 loops, best of 3: 66.9 usec per loop
The results are all consistently around 67 microseconds. So Python's
string handling is about 30% slower in the examples show here.
If you can consistently replicate a 100% to 1000% slowdown in string
handling, please report it as a performance bug:
http://bugs.python.org/
Don't forget to report your operating system.
> My take of the subject.
>
> This is a typical Python desease. Do not solve a problem, but find a
> way, a workaround, which is expecting to solve a problem and which
> finally solves nothing. As far as I know, to break the "BMP limit", the
> tools are here. They are called utf-8 or ucs-4/utf-32.
The problem with UCS-4 is that every character requires four bytes.
Every. Single. One.
So under UCS-4, the pure-ascii string "hello world" takes 44 bytes plus
the object overhead. Under UCS-2, it takes half that space: 22 bytes, but
of course UCS-2 can only represent characters in the BMP. A pure ASCII
string would only take 11 bytes, but we're not going back to pure ASCII.
(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
using two code points. This is fragile and doesn't work very well,
because string-handling methods can break the surrogate pairs apart,
leaving you with invalid unicode string. Not good.)
The difference between 44 bytes and 22 bytes for one little string is not
very important, but when you double the memory required for every single
string it becomes huge. Remember that every class, function and method
has a name, which is a string; every attribute and variable has a name,
all strings; functions and classes have doc strings, all strings. Strings
are used everywhere in Python, and doubling the memory needed by Python
means that it will perform worse.
With PEP 393, each Python string will be stored in the most efficient
format possible:
- if it only contains ASCII characters, it will be stored using 1 byte
per character;
- if it only contains characters in the BMP, it will be stored using
UCS-2 (2 bytes per character);
- if it contains non-BMP characters, the string will be stored using
UCS-4 (4 bytes per character).
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-18 08:07 -0700 |
| Message-ID | <d575737d-c1e3-47db-9c7b-10fe0300cba7@googlegroups.com> |
| In reply to | #27291 |
Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : > [...] > The problem with UCS-4 is that every character requires four bytes. > [...] I'm aware of this (and all the blah blah blah you are explaining). This always the same song. Memory. Let me ask. Is Python an 'american" product for us-users or is it a tool for everybody [*]? Is there any reason why non ascii users are somehow penalized compared to ascii users? This flexible string representation is a regression (ascii users or not). I recognize in practice the real impact is for many users closed to zero (including me) but I have shown (I think) that this flexible representation is, by design, not as optimal as it is supposed to be. This is in my mind the relevant point. [*] This not even true, if we consider the €uro currency symbol used all around the world (banking, accounting applications). jmf
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-08-18 16:25 +0100 |
| Message-ID | <mailman.3453.1345303500.4697.python-list@python.org> |
| In reply to | #27296 |
On 18/08/2012 16:07, wxjmfauth@gmail.com wrote: > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : >> [...] >> The problem with UCS-4 is that every character requires four bytes. >> [...] > > I'm aware of this (and all the blah blah blah you are > explaining). This always the same song. Memory. > > Let me ask. Is Python an 'american" product for us-users > or is it a tool for everybody [*]? > Is there any reason why non ascii users are somehow penalized > compared to ascii users? > > This flexible string representation is a regression (ascii users > or not). > > I recognize in practice the real impact is for many users > closed to zero (including me) but I have shown (I think) that > this flexible representation is, by design, not as optimal > as it is supposed to be. This is in my mind the relevant point. > > [*] This not even true, if we consider the €uro currency > symbol used all around the world (banking, accounting > applications). > > jmf > Sorry but you've got me completely baffled. Could you please explain in words of one syllable or less so I can attempt to grasp what the hell you're on about? -- Cheers. Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-19 01:36 +1000 |
| Message-ID | <mailman.3454.1345304165.4697.python-list@python.org> |
| In reply to | #27296 |
On Sun, Aug 19, 2012 at 1:07 AM, <wxjmfauth@gmail.com> wrote: > I'm aware of this (and all the blah blah blah you are > explaining). This always the same song. Memory. > > Let me ask. Is Python an 'american" product for us-users > or is it a tool for everybody [*]? > Is there any reason why non ascii users are somehow penalized > compared to ascii users? Regardless of your own native language, "len" is the name of a popular Python function. And "dict" is a well-used class. Both those names are representable in ASCII, even if every quoted string in your code requires more bytes to store. And memory usage has significance in many other areas, too. CPU cache utilization turns a space saving into a time saving. That's why structure packing still exists, even though member alignment has other advantages. You'd be amazed how many non-USA strings still fit inside seven bits, too. Are you appending a space to something? Splitting on newlines? You'll have lots of strings that are going now to be space-optimized. Of course, the performance gains from shortening some of the strings may be offset by costs when comparing one-byte and multi-byte strings, but presumably that's all been gone into in great detail elsewhere. ChrisA
[toc] | [prev] | [next] | [standalone]
Page 1 of 8 [1] 2 3 4 5 6 7 8 Next page →
Back to top | Article view | comp.lang.python
csiph-web