Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #27204 > unrolled thread
| Started by | Charles Jensen <hopefullycharles@gmail.com> |
|---|---|
| First post | 2012-08-16 15:09 -0700 |
| Last post | 2012-08-20 17:20 -0400 |
| Articles | 20 on this page of 145 — 26 participants |
Back to article view | Back to comp.lang.python
How do I display unicode value stored in a string variable using ord() Charles Jensen <hopefullycharles@gmail.com> - 2012-08-16 15:09 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-17 08:20 +1000
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-16 18:47 -0400
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-16 19:59 -0400
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
Re: How do I display unicode value stored in a string variable using ord() Jerry Hill <malaclypse2@gmail.com> - 2012-08-17 14:21 -0400
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 16:55 -0400
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 23:30 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 04:10 +0000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:18 -0600
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 03:59 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
Re: How do I display unicode value stored in a string variable using ord() Alister <alister.ware@ntlworld.com> - 2012-08-17 06:30 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 01:09 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 12:27 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 08:07 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 16:25 +0100
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 01:36 +1000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:51 -0600
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 02:57 +1000
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 18:28 +0100
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:34 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:35 +0000
New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Peter Otten <__peter__@web.de> - 2012-08-19 09:43 +0200
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:56 +0000
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 02:24 -0700
Re: New internal string format in 3.3 Peter Otten <__peter__@web.de> - 2012-08-19 11:37 +0200
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
Re: New internal string format in 3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:33 +0000
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:26 +1000
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:29 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 14:46 +0100
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 15:48 +0100
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:48 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:09 +0100
Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-20 07:50 +1000
Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-19 23:38 -0600
Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-20 09:17 -0400
Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-20 22:18 -0600
Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-21 07:48 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:56 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:35 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:30 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 16:09 -0400
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 23:12 -0400
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:33 +0000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 11:50 -0600
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 11:20 -0700
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:31 -0600
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 12:23 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:16 +0000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:46 -0600
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 17:59 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:30 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:45 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:13 +0000
Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-18 11:40 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:50 +0100
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 13:22 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 22:37 +0100
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 11:26 -0700
Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:59 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 07:17 +0000
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 10:46 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:11 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 12:19 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:35 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:01 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 20:10 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:31 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 22:58 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:01 +0000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:11 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 18:24 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:44 -0700
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 01:54 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 11:46 +0100
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 12:31 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 10:51 +0000
Re: How do I display unicode value stored in a string variable using ord() Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-21 17:03 +1000
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:09 +0000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:04 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:25 +0000
Re: How do I display unicode value stored in a string variable using ord() DJC <djc@news.invalid> - 2012-08-19 17:32 +0200
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:34 -0400
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 10:48 -0700
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 11:11 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:50 +0100
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:59 -0400
Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-19 23:13 -0700
Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:15 +0000
Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-19 16:42 -0700
Re: Abuse of Big Oh notation Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-08-20 09:24 +0100
Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 09:01 -0700
Re: Abuse of Big Oh notation Chris Angelico <rosuav@gmail.com> - 2012-08-21 02:09 +1000
Re: Abuse of Big Oh notation Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-20 11:12 -0600
Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 12:29 -0700
Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:16 -0700
Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:20 -0700
Re: Abuse of Big Oh notation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-21 09:53 +0000
Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
Re: Abuse of Big Oh notation Ned Deily <nad@acm.org> - 2012-08-20 18:19 -0700
Abuse of subject, was Re: Abuse of Big Oh notation Peter Otten <__peter__@web.de> - 2012-08-21 09:52 +0200
Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
Re: How do I display unicode value stored in a string variable using ord() Hans Mulder <hansmu@xs4all.nl> - 2012-08-22 20:53 +0200
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 08:42 +1000
Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-19 19:24 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 04:21 +0000
Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-20 00:44 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 05:56 +0000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 23:24 -0700
Re: How do I display unicode value stored in a string variable using ord() Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-20 12:58 -0400
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 20:35 -0400
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 14:07 +1000
Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:13 +0100
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:19 +1000
Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:49 +0100
Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 18:03 +0100
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 10:33 -0700
Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:04 +0100
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-19 14:05 -0400
Re: How do I display unicode value stored in a string variable usingord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:18 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:31 +0000
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:03 -0400
Re: How do I display unicode value stored in a string variable using ord() 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-19 17:32 -0700
Re: How do I display unicode value stored in a string variable using ord() Piet van Oostrum <piet@vanoostrum.org> - 2012-08-20 17:20 -0400
Page 4 of 8 — ← Prev page 1 2 3 [4] 5 6 7 8 Next page →
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-08-18 23:12 -0400 |
| Message-ID | <mailman.3482.1345345997.4697.python-list@python.org> |
| In reply to | #27310 |
On 8/18/2012 4:09 PM, Terry Reedy wrote:
> print(timeit("c in a", "c = '…'; a = 'a'*1000+c"))
> # .6 in 3.2.3, 1.2 in 3.3.0
>
> This does not make sense to me and I will ask about it.
I did ask on pydef list and paraphrased responses include:
1. 'My system gives opposite ratios.'
2. 'With a default of 1000000 repetitions in a loop, the reported times
are microseconds per operation and thus not practically significant.'
3. 'There is a stringbench.py with a large number of such micro benchmarks.'
I believe there are also whole-application benchmarks that try to mimic
real-world mixtures of operations.
People making improvements must consider performance on multiple systems
and multiple benchmarks. If someone wants to work on search speed, they
cannot just optimize that one operation on one system.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-18 09:38 -0700 |
| Message-ID | <mailman.3459.1345307892.4697.python-list@python.org> |
| In reply to | #27304 |
Sorry guys, I'm not stupid (I think). I can open IDLE with Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is always slower. Period. Now, the reason. I think it is due the "flexible represention". Deeper reason. The "boss" do not wish to hear from a (pure) ucs-4/utf-32 "engine" (this has been discussed I do not know how many times). jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-19 06:33 +0000 |
| Message-ID | <503088b7$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27304 |
On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393: > The change does not just benefit ASCII users. It primarily benefits > anybody using a wide unicode build with strings mostly containing only > BMP characters. Just to be clear: If you have many strings which are *mostly* BMP, but have one or two non- BMP characters in *each* string, you will see no benefit. But if you have many strings which are all BMP, and only a few strings containing non-BMP characters, then you will see a big benefit. > Even for narrow build users, there is the benefit that > with approximately the same amount of memory usage in most cases, they > no longer have to worry about non-BMP characters sneaking in and > breaking their code. Yes! +1000 on that. > There is some additional benefit for Latin-1 users, but this has nothing > to do with Python. If Python is going to have the option of a 1-byte > representation (and as long as we have the flexible representation, I > can see no reason not to), The PEP explicitly states that it only uses a 1-byte format for ASCII strings, not Latin-1: "ASCII-only Unicode strings will again use only one byte per character" and later: "If the maximum character is less than 128, they use the PyASCIIObject structure" and: "The data and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient)." > then it is going to be Latin-1 by definition, Certainly not, either in fact or in principle. There are a large number of 1-byte encodings, Latin-1 is hardly the only one. > because that's what 1-byte Unicode (UCS-1, if you will) is. If you have > an issue with that, take it up with the designers of Unicode. The designers of Unicode have never created a standard "1-byte Unicode" or UCS-1, as far as I can determine. The Unicode standard refers to some multiple million code points, far too many to fit in a single byte. There is some historical justification for using "Unicode" to mean UCS-2, but with the standard being extended beyond the BMP, that is no longer valid. See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details. I think what you are trying to say is that the Unicode designers deliberately matched the Latin-1 standard for Unicode's first 256 code points. That's not the same thing though: there is no Unicode standard mapping to a single byte format. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-19 11:50 -0600 |
| Message-ID | <mailman.3513.1345398650.4697.python-list@python.org> |
| In reply to | #27352 |
On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:
>> There is some additional benefit for Latin-1 users, but this has nothing
>> to do with Python. If Python is going to have the option of a 1-byte
>> representation (and as long as we have the flexible representation, I
>> can see no reason not to),
>
> The PEP explicitly states that it only uses a 1-byte format for ASCII
> strings, not Latin-1:
I think you misunderstand the PEP then, because that is empirically false.
Python 3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:23:35) [MSC
v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getsizeof(bytes(range(256)).decode('latin1'))
329
The constructed string contains all 256 Latin-1 characters, so if
Latin-1 strings must be stored in the 2-byte format, then the size
should be at least 512 bytes. It is not, so I think it must be using
the 1-byte encoding.
> "ASCII-only Unicode strings will again use only one byte per character"
This says nothing one way or the other about non-ASCII Latin-1 strings.
> "If the maximum character is less than 128, they use the PyASCIIObject
> structure"
Note that this only describes the structure of "compact" string
objects, which I have to admit I do not fully understand from the PEP.
The wording suggests that it only uses the PyASCIIObject structure,
not the derived structures. It then says that for compact ASCII
strings "the UTF-8 data, the UTF-8 length and the wstr length are the
same as the length of the ASCII data." But these fields are part of
the PyCompactUnicodeObject structure, not the base PyASCIIObject
structure, so they would not exist if only PyASCIIObject were used.
It would also imply that compact non-ASCII strings are stored
internally as UTF-8, which would be surprising.
> and:
>
> "The data and utf8 pointers point to the same memory if the string uses
> only ASCII characters (using only Latin-1 is not sufficient)."
This says that if the data are ASCII, then the 1-byte representation
and the utf8 pointer will share the same memory. It does not imply
that the 1-byte representation is not used for Latin-1, only that it
cannot also share memory with the utf8 pointer.
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-19 11:20 -0700 |
| Message-ID | <7xobm6u4kk.fsf@ruckus.brouhaha.com> |
| In reply to | #27407 |
Ian Kelly <ian.g.kelly@gmail.com> writes:
>>>> sys.getsizeof(bytes(range(256)).decode('latin1'))
> 329
Please try:
print (type(bytes(range(256)).decode('latin1')))
to make sure that what comes back is actually a unicode string rather
than a byte string.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-19 12:31 -0600 |
| Message-ID | <mailman.3520.1345401102.4697.python-list@python.org> |
| In reply to | #27418 |
On Sun, Aug 19, 2012 at 12:20 PM, Paul Rubin <no.email@nospam.invalid> wrote:
> Ian Kelly <ian.g.kelly@gmail.com> writes:
>>>>> sys.getsizeof(bytes(range(256)).decode('latin1'))
>> 329
>
> Please try:
>
> print (type(bytes(range(256)).decode('latin1')))
>
> to make sure that what comes back is actually a unicode string rather
> than a byte string.
As I understand it, the decode method never returns a byte string in
Python 3, but if you insist:
>>> print (type(bytes(range(256)).decode('latin1')))
<class 'str'>
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-19 12:23 -0700 |
| Message-ID | <7xsjbiele3.fsf@ruckus.brouhaha.com> |
| In reply to | #27419 |
Ian Kelly <ian.g.kelly@gmail.com> writes:
>>>> print (type(bytes(range(256)).decode('latin1')))
> <class 'str'>
Thanks.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-19 20:16 +0000 |
| Message-ID | <503149bb$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27407 |
On Sun, 19 Aug 2012 11:50:12 -0600, Ian Kelly wrote: > On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: [...] >> The PEP explicitly states that it only uses a 1-byte format for ASCII >> strings, not Latin-1: > > I think you misunderstand the PEP then, because that is empirically > false. Yes I did misunderstand. Thank you for the clarification. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-19 12:46 -0600 |
| Message-ID | <mailman.3521.1345402019.4697.python-list@python.org> |
| In reply to | #27352 |
On Sun, Aug 19, 2012 at 11:50 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote: > Note that this only describes the structure of "compact" string > objects, which I have to admit I do not fully understand from the PEP. > The wording suggests that it only uses the PyASCIIObject structure, > not the derived structures. It then says that for compact ASCII > strings "the UTF-8 data, the UTF-8 length and the wstr length are the > same as the length of the ASCII data." But these fields are part of > the PyCompactUnicodeObject structure, not the base PyASCIIObject > structure, so they would not exist if only PyASCIIObject were used. > It would also imply that compact non-ASCII strings are stored > internally as UTF-8, which would be surprising. Oh, now I get it. I had missed the part where it says "character data immediately follow the base structure". And the bit about the "UTF-8 data, the UTF-8 length and the wstr length" are not describing the contents of those fields, but rather where the data can be alternatively found since the fields don't exist.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-18 17:59 +0000 |
| Message-ID | <502fd7f6$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27296 |
On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : >> [...] >> The problem with UCS-4 is that every character requires four bytes. >> [...] > > I'm aware of this (and all the blah blah blah you are explaining). This > always the same song. Memory. Exactly. The reason it is always the same song is because it is an important song. > Let me ask. Is Python an 'american" product for us-users or is it a tool > for everybody [*]? It is a product for everyone, which is exactly why PEP 393 is so important. PEP 393 means that users who have only a few non-BMP characters don't have to pay the cost of UCS-4 for every single string in their application, only for the ones that actually require it. PEP 393 means that using Unicode strings is now cheaper for everybody. You seem to be arguing that the way forward is not to make Unicode cheaper for everyone, but to make ASCII strings more expensive so that everyone suffers equally. I reject that idea. > Is there any reason why non ascii users are somehow penalized compared > to ascii users? Of course there is a reason. If you want to represent 1114111 different characters in a string, as Unicode supports, you can't use a single byte per character, or even two bytes. That is a fact of basic mathematics. Supporting 1114111 characters must be more expensive than supporting 128 of them. But why should you carry the cost of 4-bytes per character just because someday you *might* need a non-BMP character? > This flexible string representation is a regression (ascii users or > not). No it is not. It is a great step forward to more efficient Unicode. And it means that now Python can correctly deal with non-BMP characters without the nonsense of UTF-16 surrogates: steve@runes:~$ python3.3 -c "print(len(chr(1114000)))" # Right! 1 steve@runes:~$ python3.2 -c "print(len(chr(1114000)))" # Wrong! 2 without doubling the storage of every string. This is an important step towards making the full range of Unicode available more widely. > I recognize in practice the real impact is for many users closed to zero Then what's the problem? > (including me) but I have shown (I think) that this flexible > representation is, by design, not as optimal as it is supposed to be. You have not shown any real problem at all. You have shown untrustworthy, edited timing results that don't match what other people are reporting. Even if your timing results are genuine, you haven't shown that they make any difference for real code that does useful work. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-18 11:30 -0700 |
| Message-ID | <d09842f1-78b5-4c0b-8c4e-b8523a53c289@googlegroups.com> |
| In reply to | #27319 |
Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit : > On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: > > > > > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : > > >> [...] > > >> The problem with UCS-4 is that every character requires four bytes. > > >> [...] > > > > > > I'm aware of this (and all the blah blah blah you are explaining). This > > > always the same song. Memory. > > > > Exactly. The reason it is always the same song is because it is an > > important song. > > No offense here. But this is an *american* answer. The same story as the coding of text files, where "utf-8 == ascii" and the rest of the world doesn't count. jmf
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-08-18 20:45 +0100 |
| Message-ID | <mailman.3470.1345319111.4697.python-list@python.org> |
| In reply to | #27323 |
On 18/08/2012 19:30, wxjmfauth@gmail.com wrote: > Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit : >> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: >> >> >> >>> Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : >> >>>> [...] >> >>>> The problem with UCS-4 is that every character requires four bytes. >> >>>> [...] >> >>> >> >>> I'm aware of this (and all the blah blah blah you are explaining). This >> >>> always the same song. Memory. >> >> >> >> Exactly. The reason it is always the same song is because it is an >> >> important song. >> >> > No offense here. But this is an *american* answer. > > The same story as the coding of text files, where "utf-8 == ascii" > and the rest of the world doesn't count. > > jmf > Thinking about it I entirely agree with you. Steven D'Aprano strikes me as typically American, in the same way that I'm typically Brazilian :) -- Cheers. Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-19 06:13 +0000 |
| Message-ID | <50308409$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27323 |
On Sat, 18 Aug 2012 11:30:19 -0700, wxjmfauth wrote: >> > I'm aware of this (and all the blah blah blah you are explaining). >> > This always the same song. Memory. >> >> >> >> Exactly. The reason it is always the same song is because it is an >> important song. >> >> > No offense here. But this is an *american* answer. I am not American. I am not aware that computers outside of the USA, and Australia, have unlimited amounts of memory. You must be very lucky. > The same story as the coding of text files, where "utf-8 == ascii" and > the rest of the world doesn't count. UTF-8 is not ASCII. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | rusi <rustompmody@gmail.com> |
|---|---|
| Date | 2012-08-18 11:40 -0700 |
| Message-ID | <e19b8f04-05f0-43d2-8983-30622513fab3@j9g2000pbg.googlegroups.com> |
| In reply to | #27319 |
On Aug 18, 10:59 pm, Steven D'Aprano <steve +comp.lang.pyt...@pearwood.info> wrote: > On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: > > Is there any reason why non ascii users are somehow penalized compared > > to ascii users? > > Of course there is a reason. > > If you want to represent 1114111 different characters in a string, as > Unicode supports, you can't use a single byte per character, or even two > bytes. That is a fact of basic mathematics. Supporting 1114111 characters > must be more expensive than supporting 128 of them. > > But why should you carry the cost of 4-bytes per character just because > someday you *might* need a non-BMP character? I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605 Original above does not open for me but here's a copy that does: http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-08-18 20:50 +0100 |
| Message-ID | <mailman.3471.1345319708.4697.python-list@python.org> |
| In reply to | #27325 |
On 18/08/2012 19:40, rusi wrote: > On Aug 18, 10:59 pm, Steven D'Aprano <steve > +comp.lang.pyt...@pearwood.info> wrote: >> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: >>> Is there any reason why non ascii users are somehow penalized compared >>> to ascii users? >> >> Of course there is a reason. >> >> If you want to represent 1114111 different characters in a string, as >> Unicode supports, you can't use a single byte per character, or even two >> bytes. That is a fact of basic mathematics. Supporting 1114111 characters >> must be more expensive than supporting 128 of them. >> >> But why should you carry the cost of 4-bytes per character just because >> someday you *might* need a non-BMP character? > > I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605 > > Original above does not open for me but here's a copy that does: > > http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html > ROFLMAO doesn't adequately some up how much I laughed. -- Cheers. Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-18 13:22 -0700 |
| Message-ID | <3e235732-39e4-4877-a860-466e433cde5e@googlegroups.com> |
| In reply to | #27325 |
Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit : > On Aug 18, 10:59 pm, Steven D'Aprano <steve > > +comp.lang.pyt...@pearwood.info> wrote: > > > On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: > > > > Is there any reason why non ascii users are somehow penalized compared > > > > to ascii users? > > > > > > Of course there is a reason. > > > > > > If you want to represent 1114111 different characters in a string, as > > > Unicode supports, you can't use a single byte per character, or even two > > > bytes. That is a fact of basic mathematics. Supporting 1114111 characters > > > must be more expensive than supporting 128 of them. > > > > > > But why should you carry the cost of 4-bytes per character just because > > > someday you *might* need a non-BMP character? > > > > I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605 > > > > Original above does not open for me but here's a copy that does: > > > > http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html I thing it's time to leave the discussion and to go to bed. You can take the problem the way you wish, Python 3.3 is "slower" than Python 3.2. If you see the present status as an optimisation, I'm condidering this as a regression. I'm pretty sure a pure ucs-4/utf-32 can only be, by nature, the correct solution. To be extreme, tools using pure utf-16 or utf-32 are, at least, considering all the citizen on this planet in the same way. jmf
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-08-18 22:37 +0100 |
| Message-ID | <mailman.3475.1345325786.4697.python-list@python.org> |
| In reply to | #27331 |
On 18/08/2012 21:22, wxjmfauth@gmail.com wrote: > Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit : >> On Aug 18, 10:59 pm, Steven D'Aprano <steve >> >> +comp.lang.pyt...@pearwood.info> wrote: >> >>> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: >> >>>> Is there any reason why non ascii users are somehow penalized compared >> >>>> to ascii users? >> >>> >> >>> Of course there is a reason. >> >>> >> >>> If you want to represent 1114111 different characters in a string, as >> >>> Unicode supports, you can't use a single byte per character, or even two >> >>> bytes. That is a fact of basic mathematics. Supporting 1114111 characters >> >>> must be more expensive than supporting 128 of them. >> >>> >> >>> But why should you carry the cost of 4-bytes per character just because >> >>> someday you *might* need a non-BMP character? >> >> >> >> I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605 >> >> >> >> Original above does not open for me but here's a copy that does: >> >> >> >> http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html > > I thing it's time to leave the discussion and to go to bed. In plain English, duck out cos I'm losing. > > You can take the problem the way you wish, Python 3.3 is "slower" > than Python 3.2. I'll ask for the second time. Provide proof that is acceptable to everybody and not just yourself. > > If you see the present status as an optimisation, I'm condidering > this as a regression. Considering does not equate to proof. Where are the figures which back up your claim? > > I'm pretty sure a pure ucs-4/utf-32 can only be, by nature, > the correct solution. I look forward to seeing your patch on the bug tracker. If and only if you can find something that needs patching, which from the course of this thread I think is highly unlikely. > > To be extreme, tools using pure utf-16 or utf-32 are, at least, > considering all the citizen on this planet in the same way. > > jmf > -- Cheers. Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-18 11:26 -0700 |
| Message-ID | <7xehn4vyya.fsf@ruckus.brouhaha.com> |
| In reply to | #27291 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: > (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters > using two code points. This is fragile and doesn't work very well, > because string-handling methods can break the surrogate pairs apart, > leaving you with invalid unicode string. Not good.) ... > With PEP 393, each Python string will be stored in the most efficient > format possible: Can you explain the issue of "breaking surrogate pairs apart" a little more? Switching between encodings based on the string contents seems silly at first glance. Strings are immutable so I don't understand why not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in Latin-based alphabets and UTF-16 may be more efficient for some other languages. I think even UCS-4 doesn't completely fix the surrogate pair issue if it means the only thing I can think of.
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-08-18 19:59 +0100 |
| Message-ID | <mailman.3469.1345316373.4697.python-list@python.org> |
| In reply to | #27322 |
On 18/08/2012 19:26, Paul Rubin wrote: > Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: >> (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters >> using two code points. This is fragile and doesn't work very well, >> because string-handling methods can break the surrogate pairs apart, >> leaving you with invalid unicode string. Not good.) > ... >> With PEP 393, each Python string will be stored in the most efficient >> format possible: > > Can you explain the issue of "breaking surrogate pairs apart" a little > more? Switching between encodings based on the string contents seems > silly at first glance. Strings are immutable so I don't understand why > not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in > Latin-based alphabets and UTF-16 may be more efficient for some other > languages. I think even UCS-4 doesn't completely fix the surrogate pair > issue if it means the only thing I can think of. > On a narrow build, codepoints outside the BMP are stored as a surrogate pair (2 codepoints). On a wide build, all codepoints can be represented without the need for surrogate pairs. The problem with strings containing surrogate pairs is that you could inadvertently slice the string in the middle of the surrogate pair.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-19 07:17 +0000 |
| Message-ID | <503092f6$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27326 |
On Sat, 18 Aug 2012 19:59:32 +0100, MRAB wrote: > The problem with strings containing surrogate pairs is that you could > inadvertently slice the string in the middle of the surrogate pair. That's the *least* of the problems with surrogate pairs. That would be easy to fix: check the point of the slice, and back up or forward if you're on a surrogate pair. But that's not good enough, because the surrogates could be anywhere in the string. You have to touch every single character in order to know how many there are. The problem with surrogate pairs is that they make basic string operations O(N) instead of O(1). -- Steven
[toc] | [prev] | [next] | [standalone]
Page 4 of 8 — ← Prev page 1 2 3 [4] 5 6 7 8 Next page →
Back to top | Article view | comp.lang.python
csiph-web