Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #27843 > unrolled thread
| Started by | Antoine Pitrou <solipsis@pitrou.net> |
|---|---|
| First post | 2012-08-25 00:24 +0000 |
| Last post | 2012-08-25 07:23 -0400 |
| Articles | 20 on this page of 83 — 18 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-25 00:24 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
Re: Flexible string representation, unicode, typography, ... Ben Finney <ben+python@benfinney.id.au> - 2012-08-25 17:54 +1000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 09:58 +0100
Re: Flexible string representation, unicode, typography, ... Frank Millman <frank@chagford.com> - 2012-08-25 11:46 +0200
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-25 16:26 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:50 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 11:49 +0000
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:40 -0600
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 20:13 +0000
Re: Flexible string representation, unicode, typography, ... Dan Sommers <dan@tombstonezero.net> - 2012-08-26 13:45 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-27 14:14 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 13:37 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-28 09:54 +1000
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 13:59 +1000
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-28 22:15 -0600
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-29 08:05 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-08-29 08:01 -0400
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 08:43 -0700
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 06:55 +0000
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 18:59 +1000
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-30 07:02 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 16:00 +0000
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-30 16:44 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 12:32 +0000
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-31 09:13 -0600
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-31 08:43 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 14:54 +0000
Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-30 15:01 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 09:58 +0100
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-02 03:06 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
Re: Flexible string representation, unicode, typography, ... Michael Torrie <torriem@gmail.com> - 2012-09-02 13:45 -0600
Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-09-02 16:07 -0400
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 16:38 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:42 +0000
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:26 +0300
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 00:53 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-02 11:52 +0200
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 11:36 +0100
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 15:00 +0300
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-03 07:11 +0100
Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-03 08:15 +0200
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-03 04:38 -0400
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:56 +0300
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 13:23 +0100
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-02 08:35 -0400
Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 15:46 +0100
Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-03 12:33 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-30 10:27 -0600
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 23:38 +0300
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:54 +0000
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 22:33 -0400
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-03 11:24 -0400
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:41 +0300
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 00:45 +0300
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 01:54 +1000
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 22:34 +1000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 15:42 -0600
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 23:31 +0000
Re: Flexible string representation, unicode, typography, ... Paul Rubin <no.email@nospam.invalid> - 2012-08-26 17:47 -0700
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:04 +1000
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 12:05 +0100
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:19 +1000
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-25 07:23 -0400
Page 4 of 5 — ← Prev page 1 2 3 [4] 5 Next page →
| From | Ramchandra Apte <maniandram01@gmail.com> |
|---|---|
| Date | 2012-09-02 06:48 -0700 |
| Message-ID | <5c453ede-33dd-4b7f-aa53-9424224ec6c7@googlegroups.com> |
| In reply to | #28268 |
On Sunday, 2 September 2012 17:53:16 UTC+5:30, Mark Lawrence wrote: > On 02/09/2012 13:00, Serhiy Storchaka wrote: > > > On 02.09.12 12:52, Peter Otten wrote: > > >> Ian Kelly wrote: > > >> > > >>> Rewriting the example to use locale.strcoll instead: > > >> > > >>>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll)) > > >> > > >> There is also locale.strxfrm() which you can use directly: > > >> > > >> sorted(li, key=locale.strxfrm) > > > > > > Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2. > > > > > > > > > > That's it then I'm giving up with Python. In future I'll be writing > > everything in machine code to ensure that I get the fastest possible run > > times. > > > > -- > > Cheers. > > > > Mark Lawrence. please make it *heavily optimized* machine code
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-09-02 15:46 +0100 |
| Message-ID | <mailman.90.1346597079.27098.python-list@python.org> |
| In reply to | #28272 |
On 02/09/2012 14:48, Ramchandra Apte wrote: > > please make it *heavily optimized* machine code > Goes without saying. First thing I'll concentrate on is removing superfluous newlines sent by crappy mail clients or similar. -- Cheers. Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Ramchandra Apte <maniandram01@gmail.com> |
|---|---|
| Date | 2012-09-02 06:48 -0700 |
| Message-ID | <mailman.87.1346593749.27098.python-list@python.org> |
| In reply to | #28268 |
On Sunday, 2 September 2012 17:53:16 UTC+5:30, Mark Lawrence wrote: > On 02/09/2012 13:00, Serhiy Storchaka wrote: > > > On 02.09.12 12:52, Peter Otten wrote: > > >> Ian Kelly wrote: > > >> > > >>> Rewriting the example to use locale.strcoll instead: > > >> > > >>>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll)) > > >> > > >> There is also locale.strxfrm() which you can use directly: > > >> > > >> sorted(li, key=locale.strxfrm) > > > > > > Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2. > > > > > > > > > > That's it then I'm giving up with Python. In future I'll be writing > > everything in machine code to ensure that I get the fastest possible run > > times. > > > > -- > > Cheers. > > > > Mark Lawrence. please make it *heavily optimized* machine code
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-09-03 12:33 -0600 |
| Message-ID | <mailman.153.1346697242.27098.python-list@python.org> |
| In reply to | #28245 |
On Sun, Sep 2, 2012 at 6:00 AM, Serhiy Storchaka <storchaka@gmail.com> wrote: > On 02.09.12 12:52, Peter Otten wrote: >> >> Ian Kelly wrote: >> >>> Rewriting the example to use locale.strcoll instead: >> >> >>>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll)) >> >> >> There is also locale.strxfrm() which you can use directly: >> >> sorted(li, key=locale.strxfrm) > > > Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2. Doh! In Python 3.3, strcoll and strxfrm are the same speed, so I guess that the actual optimization I'm seeing here is that in Python 3.3, cmp_to_key(strcoll) has been optimized to return strxfrm.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-09-02 00:36 -0700 |
| Message-ID | <mailman.63.1346571419.27098.python-list@python.org> |
| In reply to | #28126 |
Le jeudi 30 août 2012 17:01:50 UTC+2, Antoine Pitrou a écrit : > > > I honestly suggest you shut up until you have a clue. > Désolé Antoine, I have not the knowledge to dive in the Python code, but I know what is a character. The coding of the characters is a domain per se, independent from the os, from the computer languages. Before spending time to implement a new algorithm, maybe it is better to ask, if there is something better than the actual schemes. I still remember my thoughts when I read the PEP 393 discussion: "this is not logical", "they do no understand typography", "atomic character ???", ... Real world exemples. >>> import libfrancais >>> li = ['noël', 'noir', 'nœud', 'noduleux', \ ... 'noétique', 'noèse', 'noirâtre'] >>> r = libfrancais.sortfr(li) >>> r ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir', 'noirâtre'] (cf "Le Petit Robert") or The *letters* satisfying the requirements of the "Imprimerie nationale". jmf
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-30 10:27 -0600 |
| Message-ID | <mailman.3976.1346344057.4697.python-list@python.org> |
| In reply to | #28092 |
On Thu, Aug 30, 2012 at 2:51 AM, <wxjmfauth@gmail.com> wrote: > But as soon as you introduce artificially a "latin-1" > bottleneck, all this machinery just become useless. How is this a bottleneck? If you removed the Latin-1 encoding altogether and limited the flexible representation to just UCS-2 / UCS-4, I doubt very much that you would see any significant speed gains. The flexibility is the part that makes string creation slower, not the Latin-1 option in particular. > This flexible representation is working absurdly. > It optimizes the characters you are not using (in one > sense), it defaults to a non optimized form for the > characters you wish to use. I'm sure that if you wanted to you could patch Python to use Latin-9 instead. Just be prepared for it to be slower than UCS-2, since it would mean having to encode the code points rather than merely truncating them. > Pick up a random text and see the probability this > text match the most optimized case 1 char / 1 byte, > practically never. Pick up a random text and see that this text matches the next most optimized case, 1 char / 2 bytes: practically always. > If a user will use exclusively latin-1, she/he is better > served by using a dedicated tool for "latin-1" Speaker as a user who almost exclusively uses Latin-1, I strongly disagree. What you're describing is Python 2.x. The user is always almost better served by not having to worry about the full extent of the character set their program might use. That's why we moved to Unicode strings in Python 3 in the first place. > If a user will comfortably work with Unicode, she/he is > better served by using one of this tools which is using > properly one of the available Unicode schemes. > > In a funny way, this is what Python was doing and it > performs better! Seriously, please show us just one *real world* benchmark in which Python 3.3 performs demonstrably worse than Python 3.2. All you've shown so far is this one microbenchmark of string creation that is utterly irrelevant to actual programs.
[toc] | [prev] | [next] | [standalone]
| From | Serhiy Storchaka <storchaka@gmail.com> |
|---|---|
| Date | 2012-09-02 23:38 +0300 |
| Message-ID | <mailman.115.1346618346.27098.python-list@python.org> |
| In reply to | #28092 |
On 30.08.12 09:55, Steven D'Aprano wrote: > And Python's solution uses those: UCS-2, UCS-4, and UTF-8. I see that this misconception widely spread. In fact Python 3.3 uses four kinds of ready strings. * ASCII. All codes <= U+007F. * UCS1. All codes <= U+00FF, at least one code > U+007F. * UCS2. All codes <= U+FFFF, at least one code > U+00FF. * UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF. Indexing is O(0) for any string. Also the string can optionally cache UTF-8 and wchar_t* representation.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-03 01:54 +0000 |
| Message-ID | <50440de2$0$29967$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28317 |
On Sun, 02 Sep 2012 23:38:49 +0300, Serhiy Storchaka wrote: > On 30.08.12 09:55, Steven D'Aprano wrote: >> And Python's solution uses those: UCS-2, UCS-4, and UTF-8. > > I see that this misconception widely spread. I am not familiar enough with the C implementation to tell what Python 3.3 actually does, and the PEP assumes a fair amount of familiarity with the CPython source. So I welcome corrections. > In fact Python 3.3 uses four kinds of ready strings. > > * ASCII. All codes <= U+007F. > * UCS1. All codes <= U+00FF, at least one code > U+007F. > * UCS2. All codes <= U+FFFF, at least one code > U+00FF. > * UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF. Where UCS1 is equivalent to Latin-1, correct? UCS2 is what Python 3.2 narrow builds uses for all strings, including codes > U+FFFF using surrogate pairs. UCS4 is what Python 3.2 wide builds uses for all strings. This means that Python 3.3 will no longer have surrogate pairs. Am I right? > Indexing is O(0) for any string. I think you mean O(1) for constant-time lookups. > Also the string can optionally cache UTF-8 and wchar_t* representation. Right, that's the bit that wasn't clear -- the UTF-8 data is a cache, not the canonical representation. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-09-02 22:33 -0400 |
| Message-ID | <mailman.124.1346639651.27098.python-list@python.org> |
| In reply to | #28333 |
On 9/2/2012 9:54 PM, Steven D'Aprano wrote: > On Sun, 02 Sep 2012 23:38:49 +0300, Serhiy Storchaka wrote: > >> On 30.08.12 09:55, Steven D'Aprano wrote: >>> And Python's solution uses those: UCS-2, UCS-4, and UTF-8. >> >> I see that this misconception widely spread. > > I am not familiar enough with the C implementation to tell what Python > 3.3 actually does, and the PEP assumes a fair amount of familiarity with > the CPython source. So I welcome corrections. > > >> In fact Python 3.3 uses four kinds of ready strings. >> >> * ASCII. All codes <= U+007F. >> * UCS1. All codes <= U+00FF, at least one code > U+007F. >> * UCS2. All codes <= U+FFFF, at least one code > U+00FF. >> * UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF. > > Where UCS1 is equivalent to Latin-1, correct? > > UCS2 is what Python 3.2 narrow builds uses for all strings, including > codes > U+FFFF using surrogate pairs. > > UCS4 is what Python 3.2 wide builds uses for all strings. > > This means that Python 3.3 will no longer have surrogate pairs. Basically, yes. I believe CPython will only use surrogate code points if one requests errors=surrogate-escape on decoding or explicitly puts them in a literal (\unnnn or \Ummmmmmmm). The consequences fall under the 'consenting adults' policy. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-09-03 11:24 -0400 |
| Message-ID | <roy-4C0CCA.11245603092012@news.panix.com> |
| In reply to | #28333 |
In article <50440de2$0$29967$c3e8da3$5496439d@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > > Indexing is O(0) for any string. > > I think you mean O(1) for constant-time lookups. Why settle for constant-time, when you can have zero-time instead :-)
[toc] | [prev] | [next] | [standalone]
| From | Serhiy Storchaka <storchaka@gmail.com> |
|---|---|
| Date | 2012-09-03 18:41 +0300 |
| Message-ID | <mailman.149.1346686927.27098.python-list@python.org> |
| In reply to | #28333 |
On 03.09.12 04:54, Steven D'Aprano wrote: > This means that Python 3.3 will no longer have surrogate pairs. > > Am I right? As Terry said, basically, yes. Python 3.3 does not need in surrogate pairs, but does not prevent their creation. You can create a surrogate code (U+D800..U+DFFF) intentionally (as you can create a single accent modifier or other senseless alone charcode), but less likely that you will get them unintentionally.
[toc] | [prev] | [next] | [standalone]
| From | Serhiy Storchaka <storchaka@gmail.com> |
|---|---|
| Date | 2012-09-03 00:45 +0300 |
| Message-ID | <mailman.117.1346622353.27098.python-list@python.org> |
| In reply to | #28092 |
On 02.09.12 23:38, Serhiy Storchaka wrote: > Indexing is O(0) for any string. Typo. O(1)
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-30 01:54 +1000 |
| Message-ID | <mailman.3939.1346255667.4697.python-list@python.org> |
| In reply to | #28059 |
On Thu, Aug 30, 2012 at 1:43 AM, <wxjmfauth@gmail.com> wrote: > If "Python" has found a new way to cover the set > of the Unicode characters, why not proposing it > to the Unicode consortium? Python's open source. If some other language wants to borrow the idea, they can look at the code, or alternatively, just read PEP 393 and implement something similar. It's a free world. By the way, can you please trim the quoted text in your replies? It's rather lengthy. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-29 22:34 +1000 |
| Message-ID | <mailman.3930.1346243680.4697.python-list@python.org> |
| In reply to | #28055 |
On Wed, Aug 29, 2012 at 9:40 PM, <wxjmfauth@gmail.com> wrote: > For a given coding scheme, all code points/characters are > equivalent. Expecting to handle a sub-range in a coding > scheme without shaking that coding scheme is impossible. Not all codepoints are equally likely. That's the whole point behind variable-length encodings like Huffman compression (eg deflation as used in zip/gzip), UTF-8, quoted-printable, and Morse code. They handle a sub-range efficiently and the rest of the range less efficiently. > If a coding scheme does not give satisfaction, the only > valid solution is to create a new coding scheme, cp1252, > mac-roman, EBCDIC, ... or the interesting "TeX" case, where > the "internal" coding depends on the fonts! http://xkcd.com/927/ > This "Flexible String Representation" fails. Not only > it is unable to stick with a coding scheme, it is > a mixing of coding schemes, the worst of all possible > implementations. I propose, then, that we abolish files. Who *knows* how many different things might be represented in a file! We need a single coding scheme that can handle everything, without changing representation. This ridiculous state of affairs must not go on; the same representation can be used for bitmapped images or raw audio data! ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-29 04:40 -0700 |
| Message-ID | <mailman.3927.1346240457.4697.python-list@python.org> |
| In reply to | #28044 |
Le mercredi 29 août 2012 06:16:05 UTC+2, Ian a écrit :
> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody@gmail.com> wrote:
>
> > In summary:
>
> > 1. The problem is not on jmf's computer
>
> > 2. It is not windows-only
>
> > 3. It is not directly related to latin-1 encodable or not
>
> >
>
> > The only question which is not yet clear is this:
>
> > Given a typical string operation that is complexity O(n), in more
>
> > detail it is going to be O(a + bn)
>
> > If only a is worse going 3.2 to 3.3, it may be a small issue.
>
> > If b is worse by even a tiny amount, it is likely to be a significant
>
> > regression for some use-cases.
>
>
>
> As has been pointed out repeatedly already, this is a microbenchmark.
>
> jmf is focusing in one one particular area (string construction) where
>
> Python 3.3 happens to be slower than Python 3.2, ignoring the fact
>
> that real code usually does lots of things other than building
>
> strings, many of which are slower to begin with. In the real-world
>
> benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.
>
> Here's a much more realistic benchmark that nonetheless still focuses
>
> on strings: word counting.
>
>
>
> Source: http://pastebin.com/RDeDsgPd
>
>
>
>
>
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"
>
> "wc.wc('unilang8.htm')"
>
> 1000 loops, best of 3: 310 usec per loop
>
>
>
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"
>
> "wc.wc('unilang8.htm')"
>
> 1000 loops, best of 3: 302 usec per loop
>
>
>
> "unilang8.htm" is an arbitrary UTF-8 document containing a broad swath
>
> of Unicode characters that I pulled off the web. Even though this
>
> program is still mostly string processing, Python 3.3 wins. Of
>
> course, that's not really a very good test -- since it reads the file
>
> on every pass, it probably spends more time in I/O than it does in
>
> actual processing. Let's try it again with prepared string data:
>
>
>
>
>
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
>
> open('unilang8.htm', 'r', encoding
>
> ='utf-8').read()" "wc.wc_str(t)"
>
> 10000 loops, best of 3: 87.3 usec per loop
>
>
>
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
>
> open('unilang8.htm', 'r', encoding
>
> ='utf-8').read()" "wc.wc_str(t)"
>
> 10000 loops, best of 3: 84.6 usec per loop
>
>
>
> Nope, 3.3 still wins. And just for the sake of my own curiosity, I
>
> decided to try it again using str.split() instead of a StringIO.
>
> Since str.split() creates more strings, I expect Python 3.2 might
>
> actually win this time.
>
>
>
>
>
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
>
> open('unilang8.htm', 'r', encoding
>
> ='utf-8').read()" "wc.wc_split(t)"
>
> 10000 loops, best of 3: 88 usec per loop
>
>
>
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
>
> open('unilang8.htm', 'r', encoding
>
> ='utf-8').read()" "wc.wc_split(t)"
>
> 10000 loops, best of 3: 76.5 usec per loop
>
>
>
> Interestingly, although Python 3.2 performs the splits in about the
>
> same time as the StringIO operation, Python 3.3 is significantly
>
> *faster* using str.split(), at least on this data set.
>
>
>
>
>
> > So doing some arm-chair thinking (I dont know the code and difficulty
>
> > involved):
>
> >
>
> > Clearly there are 3 string-engines in the python 3 world:
>
> > - 3.2 narrow
>
> > - 3.2 wide
>
> > - 3.3 (flexible)
>
> >
>
> > How difficult would it be to giving the choice of string engine as a
>
> > command-line flag?
>
> > This would avoid the nuisance of having two binaries -- narrow and
>
> > wide.
>
>
>
> Quite difficult. Even if we avoid having two or three separate
>
> binaries, we would still have separate binary representations of the
>
> string structs. It makes the maintainability of the software go down
>
> instead of up.
>
>
>
> > And it would give the python programmer a choice of efficiency
>
> > profiles.
>
>
>
> So instead of having just one test for my Unicode-handling code, I'll
>
> now have to run that same test *three times* -- once for each possible
>
> string engine option. Choice isn't always a good thing.
>
>
Forget Python and all these benchmarks. The problem
is on an other level. Coding schemes, typography,
usage of characters, ...
For a given coding scheme, all code points/characters are
equivalent. Expecting to handle a sub-range in a coding
scheme without shaking that coding scheme is impossible.
If a coding scheme does not give satisfaction, the only
valid solution is to create a new coding scheme, cp1252,
mac-roman, EBCDIC, ... or the interesting "TeX" case, where
the "internal" coding depends on the fonts!
Unicode (utf***), as just one another coding scheme, does
not escape to this rule.
This "Flexible String Representation" fails. Not only
it is unable to stick with a coding scheme, it is
a mixing of coding schemes, the worst of all possible
implementations.
jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-27 12:16 -0700 |
| Message-ID | <mailman.3882.1346094990.4697.python-list@python.org> |
| In reply to | #27947 |
Le dimanche 26 août 2012 22:45:09 UTC+2, Dan Sommers a écrit : > On 2012-08-26 at 20:13:21 +0000, > > Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > > > > > I note that not all 32-bit ints are valid code points. I suppose I can > > > see sense in having rune be a 32-bit integer value limited to those > > > valid code points. (But, dammit, why not call it a code point?) But if > > > rune is merely an alias for int32, why not just call it int32? > > > > Having a "code point" type is a good idea. If nothing else, human code > > readers can tell that you're doing something with characters rather than > > something with integers. If your language provides any sort of type > > safety, then you get that, too. > > > > Calling your code points int32 is a bad idea for the same reason that it > > turned out to be a bad idea to call all my old ASCII characters int8. > > Or all my pointers int<n> (or unsigned int<n>), for n in 16, 20, 24, 32, > > 36, 48, or 64 (or I'm sure other values of n that I never had the pain > > or pleasure of using). > And this is precisely the concept of rune, a real int which is a name for Unicode code point. Go "has" the integers int32 and int64. A rune ensure the usage of int32. "Text libs" use runes. Go has only bytes and runes. If you do not like the word "perfection", this mechanism has at least an ideal simplicity (with probably a lot of positive consequences). rune -> int32 -> utf32 -> unicode code points. - Why int32 and not uint32? No idea, I tried to find an answer without asking. - I find the name "rune" elegant. "char" would have been too confusing. End. This is supposed to be a Python forum. jmf
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-26 15:42 -0600 |
| Message-ID | <mailman.3855.1346017353.4697.python-list@python.org> |
| In reply to | #27946 |
On Sun, Aug 26, 2012 at 2:13 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Sun, 26 Aug 2012 09:40:13 -0600, Ian Kelly wrote: > >> I think the documentation for those functions is simply badly worded. >> The "width in bytes" it returns is not the width of the rune (which as >> jmf notes is simply an alias for int32 that stores a single code point). > > Is this documented somewhere? http://golang.org/ref/spec#Numeric_types
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-26 23:31 +0000 |
| Message-ID | <503ab1d9$0$1555$c3e8da3$76491128@news.astraweb.com> |
| In reply to | #27949 |
On Sun, 26 Aug 2012 15:42:00 -0600, Ian Kelly wrote: > On Sun, Aug 26, 2012 at 2:13 PM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> On Sun, 26 Aug 2012 09:40:13 -0600, Ian Kelly wrote: >> >>> I think the documentation for those functions is simply badly worded. >>> The "width in bytes" it returns is not the width of the rune (which as >>> jmf notes is simply an alias for int32 that stores a single code >>> point). >> >> Is this documented somewhere? > > http://golang.org/ref/spec#Numeric_types Thanks. Well that's just plain nuts. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-26 17:47 -0700 |
| Message-ID | <7x6285kvp5.fsf@ruckus.brouhaha.com> |
| In reply to | #27955 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: >> http://golang.org/ref/spec#Numeric_types > Thanks. > Well that's just plain nuts. I'm not sure how Rust handles Unicode, but overall I think it is more clueful than Go while having sort of comparable goals. See: http://rust-lang.org .
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-25 21:04 +1000 |
| Message-ID | <mailman.3796.1345892653.4697.python-list@python.org> |
| In reply to | #27854 |
On Sat, Aug 25, 2012 at 7:46 PM, Frank Millman <frank@chagford.com> wrote: > Therefore, I think he is saying that he would have preferred that python > standardise on 4-byte characters, on the grounds that the saving in memory > does not justify the performance overhead. If that's indeed the argument, then at least it's something to argue. What gets difficult is when people complain about the expansion from a 2-byte narrow build to the current 1/2/4-byte representation, which will indeed use more memory if there are a small number of >0xFFFF codepoints. But there's a correctness difference there. ChrisA
[toc] | [prev] | [next] | [standalone]
Page 4 of 5 — ← Prev page 1 2 3 [4] 5 Next page →
Back to top | Article view | comp.lang.python
csiph-web