Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #27204 > unrolled thread
| Started by | Charles Jensen <hopefullycharles@gmail.com> |
|---|---|
| First post | 2012-08-16 15:09 -0700 |
| Last post | 2012-08-20 17:20 -0400 |
| Articles | 20 on this page of 145 — 26 participants |
Back to article view | Back to comp.lang.python
How do I display unicode value stored in a string variable using ord() Charles Jensen <hopefullycharles@gmail.com> - 2012-08-16 15:09 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-17 08:20 +1000
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-16 18:47 -0400
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-16 19:59 -0400
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
Re: How do I display unicode value stored in a string variable using ord() Jerry Hill <malaclypse2@gmail.com> - 2012-08-17 14:21 -0400
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 16:55 -0400
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 23:30 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 04:10 +0000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:18 -0600
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 03:59 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
Re: How do I display unicode value stored in a string variable using ord() Alister <alister.ware@ntlworld.com> - 2012-08-17 06:30 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 01:09 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 12:27 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 08:07 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 16:25 +0100
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 01:36 +1000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:51 -0600
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 02:57 +1000
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 18:28 +0100
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:34 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:35 +0000
New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Peter Otten <__peter__@web.de> - 2012-08-19 09:43 +0200
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:56 +0000
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 02:24 -0700
Re: New internal string format in 3.3 Peter Otten <__peter__@web.de> - 2012-08-19 11:37 +0200
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
Re: New internal string format in 3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:33 +0000
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:26 +1000
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:29 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 14:46 +0100
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 15:48 +0100
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:48 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:09 +0100
Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-20 07:50 +1000
Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-19 23:38 -0600
Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-20 09:17 -0400
Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-20 22:18 -0600
Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-21 07:48 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:56 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:35 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:30 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 16:09 -0400
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 23:12 -0400
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:33 +0000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 11:50 -0600
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 11:20 -0700
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:31 -0600
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 12:23 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:16 +0000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:46 -0600
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 17:59 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:30 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:45 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:13 +0000
Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-18 11:40 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:50 +0100
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 13:22 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 22:37 +0100
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 11:26 -0700
Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:59 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 07:17 +0000
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 10:46 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:11 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 12:19 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:35 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:01 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 20:10 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:31 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 22:58 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:01 +0000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:11 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 18:24 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:44 -0700
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 01:54 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 11:46 +0100
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 12:31 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 10:51 +0000
Re: How do I display unicode value stored in a string variable using ord() Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-21 17:03 +1000
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:09 +0000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:04 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:25 +0000
Re: How do I display unicode value stored in a string variable using ord() DJC <djc@news.invalid> - 2012-08-19 17:32 +0200
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:34 -0400
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 10:48 -0700
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 11:11 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:50 +0100
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:59 -0400
Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-19 23:13 -0700
Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:15 +0000
Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-19 16:42 -0700
Re: Abuse of Big Oh notation Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-08-20 09:24 +0100
Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 09:01 -0700
Re: Abuse of Big Oh notation Chris Angelico <rosuav@gmail.com> - 2012-08-21 02:09 +1000
Re: Abuse of Big Oh notation Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-20 11:12 -0600
Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 12:29 -0700
Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:16 -0700
Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:20 -0700
Re: Abuse of Big Oh notation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-21 09:53 +0000
Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
Re: Abuse of Big Oh notation Ned Deily <nad@acm.org> - 2012-08-20 18:19 -0700
Abuse of subject, was Re: Abuse of Big Oh notation Peter Otten <__peter__@web.de> - 2012-08-21 09:52 +0200
Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
Re: How do I display unicode value stored in a string variable using ord() Hans Mulder <hansmu@xs4all.nl> - 2012-08-22 20:53 +0200
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 08:42 +1000
Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-19 19:24 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 04:21 +0000
Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-20 00:44 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 05:56 +0000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 23:24 -0700
Re: How do I display unicode value stored in a string variable using ord() Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-20 12:58 -0400
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 20:35 -0400
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 14:07 +1000
Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:13 +0100
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:19 +1000
Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:49 +0100
Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 18:03 +0100
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 10:33 -0700
Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:04 +0100
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-19 14:05 -0400
Re: How do I display unicode value stored in a string variable usingord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:18 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:31 +0000
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:03 -0400
Re: How do I display unicode value stored in a string variable using ord() 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-19 17:32 -0700
Re: How do I display unicode value stored in a string variable using ord() Piet van Oostrum <piet@vanoostrum.org> - 2012-08-20 17:20 -0400
Page 6 of 8 — ← Prev page 1 2 3 4 5 [6] 7 8 Next page →
| From | DJC <djc@news.invalid> |
|---|---|
| Date | 2012-08-19 17:32 +0200 |
| Message-ID | <k0r0tq$36l$1@dont-email.me> |
| In reply to | #27389 |
On 19/08/12 15:25, Steven D'Aprano wrote: > Not necessarily. Presumably you're scanning each page into a single > string. Then only the pages containing a supplementary plane char will be > bloated, which is likely to be rare. Especially since I don't expect your > OCR application would recognise many non-BMP characters -- what does > U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software > doesn't recognise it, you can't get it in your output. (If you do, the > OCR software has a nasty bug.) > > Anyway, in my ignorant opinion the proper fix here is to tell the OCR > software not to bother trying to recognise Imperial Aramaic, Domino > Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't > expecting them in your source material. Not only will the scanning go > faster, but you'll get fewer wrong characters. Consider the automated recognition of a CAPTCHA. As the chars have to be entered by the user on a keyboard, only the most basic charset can be used, so the problem of which chars are possible is quite limited.
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-08-19 13:34 -0400 |
| Message-ID | <mailman.3511.1345397678.4697.python-list@python.org> |
| In reply to | #27360 |
On 8/19/2012 4:04 AM, Paul Rubin wrote: > Meanwhile, an example of the 393 approach failing: I am completely baffled by this, as this example is one where the 393 approach potentially wins. > I was involved in a > project that dealt with terabytes of OCR data of mostly English text. > So the chars were mostly ascii, 3.3 stores ascii pages 1 byte/char rather than 2 or 4. > but there would be occasional non-ascii > chars including supplementary plane characters, either because of > special symbols that were really in the text, or the typical OCR > confusion emitting those symbols due to printing imprecision. I doubt that there are really any non-bmp chars. As Steven said, reject such false identifications. > That's a natural for UTF-8 3.3 would convert to utf-8 for storage on disk. > but the PEP-393 approach would bloat up the memory > requirements by a factor of 4. 3.2- wide builds would *always* use 4 bytes/char. Is not occasionally better than always? > py> s = chr(0xFFFF + 1) > py> a, b = s > > That looks like Python 3.2 is buggy and that sample should just throw an > error. s is a one-character string and should not be unpackable. That looks like a 3.2- narrow build. Such which treat unicode strings as sequences of code units rather than sequences of codepoints. Not an implementation bug, but compromise design that goes back about a decade to when unicode was added to Python. At that time, there were only a few defined non-BMP chars and their usage was extremely rare. There are now more extended chars than BMP chars and usage will become more common even in English text. Pre 3.3, there are really 2 sub-versions of every Python version: a narrow build and a wide build version, with not very well documented different behaviors for any string with extended chars. That is and would have become an increasing problem as extended chars are increasingly used. If you want to say that what was once a practical compromise has become a design bug, I would not argue. In any case, 3.3 fixes that split and returns Python to being one cross-platform language. > I realize the folks who designed and implemented PEP 393 are very smart > cookies and considered stuff carefully, while I'm just an internet user > posting an immediate impression of something I hadn't seen before (I > still use Python 2.6), but I still have to ask: if the 393 approach > makes sense, why don't other languages do it? Python has often copied or borrowed, with adjustments. This time it is the first. We will see how it goes, but it has been tested for nearly a year already. > Ropes of UTF-8 segments seems like the most obvious approach and I > wonder if it was considered. By that I mean pick some implementation > constant k (say k=128) and represent the string as a UTF-8 encoded byte > array, accompanied by a vector n//k pointers into the byte array, where > n is the number of codepoints in the string. Then you can reach any > offset analogously to reading a random byte on a disk, by seeking to the > appropriate block, and then reading the block and getting the char you > want within it. Random access is then O(1) though the constant is > higher than it would be with fixed width encoding. I would call it O(k), where k is a selectable constant. Slowing access by a factor of 100 is hardly acceptable to me. For strings less than k, access is O(len). I believe slicing would require re-indexing. As 393 was near adoption, I proposed a scheme using utf-16 (narrow builds) with a supplementary index of extended chars when there are any. That makes access O(1) if there are none and O(log(k)), where k is the number of extended chars in the string, if there are some. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-19 10:48 -0700 |
| Message-ID | <7xfw7ilqnd.fsf@ruckus.brouhaha.com> |
| In reply to | #27403 |
Terry Reedy <tjreedy@udel.edu> writes: >> Meanwhile, an example of the 393 approach failing: > I am completely baffled by this, as this example is one where the 393 > approach potentially wins. What? The 393 approach is supposed to avoid memory bloat and that does the opposite. >> I was involved in a project that dealt with terabytes of OCR data of >> mostly English text. So the chars were mostly ascii, > 3.3 stores ascii pages 1 byte/char rather than 2 or 4. But they are not ascii pages, they are (as stated) MOSTLY ascii. E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses a much more memory-expensive encoding than UTF-8. > I doubt that there are really any non-bmp chars. You may be right about this. I thought about it some more after posting and I'm not certain that there were supplemental characters. > As Steven said, reject such false identifications. Reject them how? >> That's a natural for UTF-8 > 3.3 would convert to utf-8 for storage on disk. They are already in utf-8 on disk though that doesn't matter since they are also compressed. >> but the PEP-393 approach would bloat up the memory >> requirements by a factor of 4. > 3.2- wide builds would *always* use 4 bytes/char. Is not occasionally > better than always? The bloat is in comparison with utf-8, in that example. > That looks like a 3.2- narrow build. Such which treat unicode strings > as sequences of code units rather than sequences of codepoints. Not an > implementation bug, but compromise design that goes back about a > decade to when unicode was added to Python. I thought the whole point of Python 3's disruptive incompatibility with Python 2 was to clean up past mistakes and compromises, of which unicode headaches was near the top of the list. So I'm surprised they seem to repeated a mistake there. > I would call it O(k), where k is a selectable constant. Slowing access > by a factor of 100 is hardly acceptable to me. If k is constant then O(k) is the same as O(1). That is how O notation works. I wouldn't believe the 100x figure without seeing it measured in real-world applications.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-19 11:11 -0700 |
| Message-ID | <28f35cee-3e55-43af-afc8-1ded199c53d9@googlegroups.com> |
| In reply to | #27405 |
Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
>
>
> But they are not ascii pages, they are (as stated) MOSTLY ascii.
>
> E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
>
> a much more memory-expensive encoding than UTF-8.
>
>
Imagine an us banking application, everything in ascii,
except ... the € currency symbole, code point 0x20ac.
Well, it seems some software producers know what they
are doing.
>>> '€'.encode('cp1252')
b'\x80'
>>> '€'.encode('mac-roman')
b'\xdb'
>>> '€'.encode('iso-8859-1')
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)
jmf
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-08-19 19:50 +0100 |
| Message-ID | <mailman.3522.1345402115.4697.python-list@python.org> |
| In reply to | #27416 |
On 19/08/2012 19:11, wxjmfauth@gmail.com wrote:
> Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
>>
>>
>> But they are not ascii pages, they are (as stated) MOSTLY ascii.
>>
>> E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
>>
>> a much more memory-expensive encoding than UTF-8.
>>
>>
>
> Imagine an us banking application, everything in ascii,
> except ... the € currency symbole, code point 0x20ac.
>
> Well, it seems some software producers know what they
> are doing.
>
>>>> '€'.encode('cp1252')
> b'\x80'
>>>> '€'.encode('mac-roman')
> b'\xdb'
>>>> '€'.encode('iso-8859-1')
> Traceback (most recent call last):
> File "<eta last command>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
> in position 0: ordinal not in range(256)
>
> jmf
>
Well that's it then, the world stock markets will all collapse tonight
when the news leaks out that those stupid Americans haven't yet realised
that much of Europe (with at least one very noticeable and sensible
exception :) uses Euros. I'd better sell all my stock holdings fast.
--
Cheers.
Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-08-19 17:59 -0400 |
| Message-ID | <mailman.3529.1345413613.4697.python-list@python.org> |
| In reply to | #27416 |
On 8/19/2012 2:11 PM, wxjmfauth@gmail.com wrote:
> Well, it seems some software producers know what they
> are doing.
>
>>>> '€'.encode('cp1252')
> b'\x80'
>>>> '€'.encode('mac-roman')
> b'\xdb'
>>>> '€'.encode('iso-8859-1')
> Traceback (most recent call last):
> File "<eta last command>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
> in position 0: ordinal not in range(256)
Yes, Python lets you choose your byte encoding from those and a hundred
others. I believe all the codecs are now tested in both directions. It
was not an easy task.
As to the examples: Latin-1 dates to 1985 and before and the 1988
version was published as a standard in 1992.
https://en.wikipedia.org/wiki/Latin-1
"The name euro was officially adopted on 16 December 1995."
https://en.wikipedia.org/wiki/Euro
No wonder Latin-1 does not contain the Euro sign. International
standards organizations standards are relatively fixed. (The unicode
consortium will not even correct misspelled character names.) Instead,
new standards with a new number are adopted.
For better or worse, private mappings are more flexible. In its Mac
mapping Apple "replaced the generic currency sign ¤ with the euro sign
€". (See Latin-1 reference.) Great if you use Euros, not so great if you
were using the previous sign for something else.
Microsoft changed an unneeded code to the Euro for Windows cp-1252.
https://en.wikipedia.org/wiki/Windows-1252
"It is very common to mislabel Windows-1252 text with the charset label
ISO-8859-1. A common result was that all the quotes and apostrophes
(produced by "smart quotes" in Microsoft software) were replaced with
question marks or boxes on non-Windows operating systems, making text
difficult to read. Most modern web browsers and e-mail clients treat the
MIME charset ISO-8859-1 as Windows-1252 in order to accommodate such
mislabeling. This is now standard behavior in the draft HTML 5
specification, which requires that documents advertised as ISO-8859-1
actually be parsed with the Windows-1252 encoding.[1]"
Lots of fun. Too bad Microsoft won't push utf-8 so we can all
communicate text with much less chance of ambiguity.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | rusi <rustompmody@gmail.com> |
|---|---|
| Date | 2012-08-19 23:13 -0700 |
| Message-ID | <7e85ae06-9340-4677-bde8-a7bd4aec8a0b@j2g2000pbg.googlegroups.com> |
| In reply to | #27416 |
On Aug 19, 11:11 pm, wxjmfa...@gmail.com wrote:
> Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
>
>
>
> > But they are not ascii pages, they are (as stated) MOSTLY ascii.
>
> > E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
>
> > a much more memory-expensive encoding than UTF-8.
>
>
> Well, it seems some software producers know what they
> are doing.
>
> >>> '€'.encode('cp1252')
> b'\x80'
> >>> '€'.encode('mac-roman')
> b'\xdb'
> >>> '€'.encode('iso-8859-1')
>
> Traceback (most recent call last):
> File "<eta last command>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
> in position 0: ordinal not in range(256)
<facetious>
You want the Euro-sign in iso-8859-1??
I object. I want the rupee sign ( ₹ ) http://en.wikipedia.org/wiki/Indian_rupee_sign
And while we are at it, why not move it (both?) into ASCII?
</facetious>
The problem(s) are:
1. We dont really understand what you are objecting to.
2. Utf-8 like Huffman coding is a prefix code
http://en.wikipedia.org/wiki/Prefix_code#Prefix_codes_in_use_today
Like Huffman coding, it compresses based on a statistical argument.
3. Unlike Huffman coding the statistics is very political: "Is the
Euro more important or Chinese ideograms?" depends on whom you ask
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-19 20:15 +0000 |
| Subject | Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()] |
| Message-ID | <50314968$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27405 |
On Sun, 19 Aug 2012 10:48:06 -0700, Paul Rubin wrote: > Terry Reedy <tjreedy@udel.edu> writes: >> I would call it O(k), where k is a selectable constant. Slowing access >> by a factor of 100 is hardly acceptable to me. > > If k is constant then O(k) is the same as O(1). That is how O notation > works. You might as well say that if N is constant, O(N**2) is constant too and just like magic you have now made Bubble Sort a constant-time sort function! That's not how it works. Of course *if* k is constant, O(k) is constant too, but k is not constant. In context we are talking about string indexing and slicing. There is no value of k, say, k = 2, for which you can say "People will sometimes ask for string[2] but never ask for string[3]". That is absurd. Since k can vary from 0 to N-1, we can say that the average string index lookup is k = (N-1)//2 which clearly depends on N. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-19 16:42 -0700 |
| Subject | Re: Abuse of Big Oh notation |
| Message-ID | <7xwr0ua1pw.fsf@ruckus.brouhaha.com> |
| In reply to | #27426 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: > Of course *if* k is constant, O(k) is constant too, but k is not > constant. In context we are talking about string indexing and slicing. > There is no value of k, say, k = 2, for which you can say "People will > sometimes ask for string[2] but never ask for string[3]". That is absurd. The context was parsing, e.g. recognizing a token like "a" or "foo" in a human-written chunk of text. Occasionally it might be "sesquipidalian" or some even worse outlier, but one can reasonably put a fixed and relatively small upper bound on the expected value of k. That makes the amortized complexity O(1), I think.
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2012-08-20 09:24 +0100 |
| Subject | Re: Abuse of Big Oh notation |
| Message-ID | <mailman.3547.1345451405.4697.python-list@python.org> |
| In reply to | #27442 |
On Sun, 19 Aug 2012 16:42:03 -0700, Paul Rubin <no.email@nospam.invalid> wrote: > Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: > > Of course *if* k is constant, O(k) is constant too, but k is not > > constant. In context we are talking about string indexing and slicing. > > There is no value of k, say, k = 2, for which you can say "People will > > sometimes ask for string[2] but never ask for string[3]". That is absurd. > The context was parsing, e.g. recognizing a token like "a" or "foo" in a > human-written chunk of text. Occasionally it might be "sesquipidalian" > or some even worse outlier, but one can reasonably put a fixed and > relatively small upper bound on the expected value of k. That makes the > amortized complexity O(1), I think. No it doen't. It is still O(k). The point of big O notation is to understand the asymptotic behaviour of one variable as it becomes large because of changes in other variables. If k is small then you can often guess that O(k) is small. To say that an operation is O(k), however, is a statement about what happens when k is big (and is not refuted by saying that k is typically not big). Oscar
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-20 09:01 -0700 |
| Subject | Re: Abuse of Big Oh notation |
| Message-ID | <7xr4r1pn72.fsf@ruckus.brouhaha.com> |
| In reply to | #27477 |
Oscar Benjamin <oscar.j.benjamin@gmail.com> writes: > No it doen't. It is still O(k). The point of big O notation is to > understand the asymptotic behaviour of one variable as it becomes > large because of changes in other variables. Actually, two separate problems got mixed together late at night. In neither case is k an independent variable that ranges over all possible values. In both cases it is selected or observed by measurement (i.e. it is a dependent variable determined by something that is itself not independent). 1) Access in a rope: here, k is basically determined by the pointer size of the computer, which in CPython (the implementation we're discussing) the pointer size is 4 or 8 bytes (constants) in all instances AFAIK. k should be a big enough that the pointer and allocation overhead is small compared to bloating the strings with UCS-2 or UCS-4, and small enough to not add much scan time. It seems realistic to say k<=128 for this (several times smaller is probably fine). 128 is of course a constant and not a variable. We are not concerned about hypothetical computers with billion bit pointers. 2) Parsing tokens: here, k is drawn from a fixed probability distribution such that its expectation value is again a small constant (this is an assertion about the data looks like in typical parsing applications in the real world, not what is mathematically possible). The average-case complexity (I said "amortized" earlier but should have said "average-case") is proportional to that constant, which is to say, O(1). Of course there is still more wiggle room about what is "typical". Analogy: how big a box is required to hold a pair of shoes? In a purely theoretical sense we might say O(S) where S is the shoe size, treating shoe size as an arbitrary independent variable. But in the real world, shoe size is controlled by the size of human feet, which is bounded by a constant for biological reasons. You don't have to consider shoes the size of Jupiter. So it is O(1).
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-21 02:09 +1000 |
| Subject | Re: Abuse of Big Oh notation |
| Message-ID | <mailman.3563.1345478961.4697.python-list@python.org> |
| In reply to | #27504 |
On Tue, Aug 21, 2012 at 2:01 AM, Paul Rubin <no.email@nospam.invalid> wrote: > Analogy: how big a box is required to hold a pair of shoes? In a purely > theoretical sense we might say O(S) where S is the shoe size, treating > shoe size as an arbitrary independent variable. But in the real world, > shoe size is controlled by the size of human feet, which is bounded by a > constant for biological reasons. You don't have to consider shoes the > size of Jupiter. So it is O(1). By that argument, everything is amortized O(1), because there's a limit on every variable. You can't possibly be working with a data set greater than the total sum of storage space in the entire world. That still doesn't mean that bubble sort and heap sort are equivalently efficient. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-20 11:12 -0600 |
| Subject | Re: Abuse of Big Oh notation |
| Message-ID | <mailman.3567.1345482779.4697.python-list@python.org> |
| In reply to | #27504 |
On Mon, Aug 20, 2012 at 10:09 AM, Chris Angelico <rosuav@gmail.com> wrote: > On Tue, Aug 21, 2012 at 2:01 AM, Paul Rubin <no.email@nospam.invalid> wrote: >> Analogy: how big a box is required to hold a pair of shoes? In a purely >> theoretical sense we might say O(S) where S is the shoe size, treating >> shoe size as an arbitrary independent variable. But in the real world, >> shoe size is controlled by the size of human feet, which is bounded by a >> constant for biological reasons. You don't have to consider shoes the >> size of Jupiter. So it is O(1). > > By that argument, everything is amortized O(1), because there's a > limit on every variable. You can't possibly be working with a data set > greater than the total sum of storage space in the entire world. That > still doesn't mean that bubble sort and heap sort are equivalently > efficient. The difference between the two is that the former is bounded by a constant that is fundamental to the algorithm at hand, whereas the latter is bounded only by available resources. By any practical consideration the latter must be considered a variable, but the former need not be. Paul discusses above the asymptotic growth of a variable as O(S) where S is shoe size, but really this measure makes no sense in the first place. The question that this notation seeks to answer is, "What is the big-Oh behaviour of this variable as shoe size increases indefinitely (i.e. to infinity)?" and the answer in this case is "Shoe size does not increase indefinitely"; the question is invalid. A more logical question might be, "How much material do I need to construct N shoes of size S?" The answer to this question would presumably be some constant factor of N * S**2, which is O(N * S**2). Although N can be assumed to vary freely (up to nonsensical quantities like the mass of the entire universe), S is clearly bounded by the constraints of actual shoes, so we can safely treat S as a constant and call it O(N).
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-20 12:29 -0700 |
| Subject | Re: Abuse of Big Oh notation |
| Message-ID | <7xfw7hl5vb.fsf@ruckus.brouhaha.com> |
| In reply to | #27510 |
Ian Kelly <ian.g.kelly@gmail.com> writes: > The difference between the two is that the former is bounded by a > constant that is fundamental to the algorithm at hand,... S is > clearly bounded by the constraints of actual shoes, so we can safely > treat S as a constant and call it O(N). Thanks, that is a good explain of what I was trying to get at. One quibble is in the parsing example, the constant is inherent in the distribution of input data one expects to normally encounter, rather than in the algorithm itself. It's sort of like saying dictionary access (based on hashing) is O(1), based on the variety of input data that is normally used. There is such a thing as pathological (e.g. malicious) input data that causes a lot of hash collisions, making O(n) access in the size of the dictionary. I suppose abnormal input should also be taken into account in examples involving parsing if one parses potentially hostile data.
[toc] | [prev] | [next] | [standalone]
| From | 88888 Dihedral <dihedral88888@googlemail.com> |
|---|---|
| Date | 2012-08-20 15:16 -0700 |
| Subject | Re: Abuse of Big Oh notation |
| Message-ID | <464d6549-7fc5-4405-92ec-285bc2afdb7b@googlegroups.com> |
| In reply to | #27527 |
Paul Rubin於 2012年8月21日星期二UTC+8上午3時29分12秒寫道: > Ian Kelly <ian.g.kelly@gmail.com> writes: > > > The difference between the two is that the former is bounded by a > > > constant that is fundamental to the algorithm at hand,... S is > > > clearly bounded by the constraints of actual shoes, so we can safely > > > treat S as a constant and call it O(N). > > > > Thanks, that is a good explain of what I was trying to get at. One > > quibble is in the parsing example, the constant is inherent in the > > distribution of input data one expects to normally encounter, rather > > than in the algorithm itself. It's sort of like saying dictionary > > access (based on hashing) is O(1), based on the variety of input data > > that is normally used. There is such a thing as pathological > > (e.g. malicious) input data that causes a lot of hash collisions, making > > O(n) access in the size of the dictionary. > > > > I suppose abnormal input should also be taken into account in examples > > involving parsing if one parses potentially hostile data. OK, the hash key function has to be seeded with some randomization in the begining of a hash table from some data independent factors. Also for those items hashed to the same key with a length >=16 I think an insertion sort is good in soring a sorted list of items hashed to the same key of a length 16 to 256 or even larger which indicates the hash function should be changed from time to time in the occasions of resizing the hash table when the table is under a lot operations of item inserstions and deletions which are much greater than the number of item searches in the same period.
[toc] | [prev] | [next] | [standalone]
| From | 88888 Dihedral <dihedral88888@googlemail.com> |
|---|---|
| Date | 2012-08-20 15:20 -0700 |
| Subject | Re: Abuse of Big Oh notation |
| Message-ID | <e8631b7a-776c-4680-a3c5-9fe731748ddc@googlegroups.com> |
| In reply to | #27527 |
Paul Rubin於 2012年8月21日星期二UTC+8上午3時29分12秒寫道: > Ian Kelly <ian.g.kelly@gmail.com> writes: > > > The difference between the two is that the former is bounded by a > > > constant that is fundamental to the algorithm at hand,... S is > > > clearly bounded by the constraints of actual shoes, so we can safely > > > treat S as a constant and call it O(N). > > > > Thanks, that is a good explain of what I was trying to get at. One > > quibble is in the parsing example, the constant is inherent in the > > distribution of input data one expects to normally encounter, rather > > than in the algorithm itself. It's sort of like saying dictionary > > access (based on hashing) is O(1), based on the variety of input data > > that is normally used. There is such a thing as pathological > > (e.g. malicious) input data that causes a lot of hash collisions, making > > O(n) access in the size of the dictionary. > > > > I suppose abnormal input should also be taken into account in examples > > involving parsing if one parses potentially hostile data. OK, the hash key function has to be seeded with some randomization in the creation of a hash table from some data independent factors. Also for those items hashed to the same key with a length >=16 I think an insertion sort and a heap sort are good in storing a sorted list of items hashed to the same key of a length 16 to 256 or even larger which indicates the hash function should be changed from time to time in the occasions of resizing the hash table when the table is under a lot operations of item insertions and deletions which are operated much frequently than the number of item searches in the same period.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-21 09:53 +0000 |
| Subject | Re: Abuse of Big Oh notation |
| Message-ID | <50335aa8$0$1645$c3e8da3$76491128@news.astraweb.com> |
| In reply to | #27510 |
On Mon, 20 Aug 2012 11:12:22 -0600, Ian Kelly wrote:
> On Mon, Aug 20, 2012 at 10:09 AM, Chris Angelico <rosuav@gmail.com>
> wrote:
>> On Tue, Aug 21, 2012 at 2:01 AM, Paul Rubin <no.email@nospam.invalid>
>> wrote:
>>> Analogy: how big a box is required to hold a pair of shoes? In a
>>> purely theoretical sense we might say O(S) where S is the shoe size,
>>> treating shoe size as an arbitrary independent variable. But in the
>>> real world, shoe size is controlled by the size of human feet, which
>>> is bounded by a constant for biological reasons. You don't have to
>>> consider shoes the size of Jupiter. So it is O(1).
>>
>> By that argument, everything is amortized O(1), because there's a limit
>> on every variable. You can't possibly be working with a data set
>> greater than the total sum of storage space in the entire world. That
>> still doesn't mean that bubble sort and heap sort are equivalently
>> efficient.
>
> The difference between the two is that the former is bounded by a
> constant that is fundamental to the algorithm at hand, whereas the
> latter is bounded only by available resources. By any practical
> consideration the latter must be considered a variable, but the former
> need not be.
>
> Paul discusses above the asymptotic growth of a variable as O(S) where S
> is shoe size, but really this measure makes no sense in the first place.
> The question that this notation seeks to answer is, "What is the big-Oh
> behaviour of this variable as shoe size increases indefinitely (i.e. to
> infinity)?" and the answer in this case is "Shoe size does not increase
> indefinitely"; the question is invalid.
>
> A more logical question might be, "How much material do I need to
> construct N shoes of size S?" The answer to this question would
> presumably be some constant factor of N * S**2, which is O(N * S**2).
> Although N can be assumed to vary freely (up to nonsensical quantities
> like the mass of the entire universe), S is clearly bounded by the
> constraints of actual shoes, so we can safely treat S as a constant and
> call it O(N).
Why the hell are we talking about shoe sizes?
Let's not lose sight of the actual question in hand: how expensive is it
to fetch a character at some arbitrary index in a string?
With fixed-width characters, average time is O(1), best-case time is O(1)
and worst-case time is O(1).
With non-fixed width characters, average time and worst-case time are
both O(N), and best-case ("give me the character at index 0") is O(1).
But that's a degenerate case that gives us absolutely no insight into the
cost of the operation. It's like the observation that "sorting a list of
one item takes constant time, regardless of the algorithm". Um, yes, it
does. So what?
Invented constraints like "people only care about indexes really close to
zero, or to the end of the string", effectively confuse the best possible
case for the average case. In real life, people do care about random
access to strings and the average case is more important than the best
case. That's why strings are typically implemented as arrays of fixed-
width characters rather than linked lists, and Haskell (which doesn't)
explicitly warns you that they don't and that your string-handling code
is likely to be slow.
Of course, you can swap space and complexity for time, e.g. ropes, or use
cached pointers to character locations in the hope that indexes will be
clustered rather than scattered randomly. These more complex
implementations typically use as much memory than, and performance is
often no better than, a dead simple implementation that uses a simple
array of fixed width characters. Most harmful, it means that a simple
indexing operation may be fast on average, and occasionally degenerate to
slow. The only thing worse than "This program is slow" is "This program
is *occasionally* slow, and I can't predict when".
Not surprising, almost all programming languages in common usage use
arrays of fixed-width characters as their native string type. Python 3.3
will now do the same thing, with the memory optimization that the fixed-
width will be per string and selected from 1, 2, or 4 bytes according to
the minimum width needed, rather than choosing between 2 and 4 bytes when
you build the Python compiler.
This is a big positive step forward with a pretty simple algorithm and
will strongly help encourage the use of Unicode by taking much of the
sting out of the perennial complaints that "Unicode wastes space". Not so
wasteful any more.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-20 11:42 -0700 |
| Subject | Re: Abuse of Big Oh notation |
| Message-ID | <f0878bc4-539b-4570-a138-ea413e6d9fbb@googlegroups.com> |
| In reply to | #27477 |
By chance and luckily, first attempt.
IDLE, Windows 7.0 Pro 32, Pentium Dual Core 2.6, RAM 2 Go
Py 3.2.3
>>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')")
[1.6939567134893707, 1.672874290786993, 1.6761219212298073]
Py 3.3.0b2
>>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')")
[7.924470733910917, 7.8554985620787345, 7.878623849091914]
Console
c:\python32\python -m timeit "('€'*100+'€'*100).replace('€'
, 'œ')"
1000000 loops, best of 3: 1.48 usec per loop
c:\python33\python -m timeit "('€'*100+'€'*100).replace('€'
, 'œ')"
100000 loops, best of 3: 7.62 usec per loop
Note
The used characters are not members of the latin-1 coding
scheme (btw an *unusable* coding).
They are however charaters in cp1252 and mac-roman.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Ned Deily <nad@acm.org> |
|---|---|
| Date | 2012-08-20 18:19 -0700 |
| Subject | Re: Abuse of Big Oh notation |
| Message-ID | <mailman.3586.1345511985.4697.python-list@python.org> |
| In reply to | #27524 |
In article <f0878bc4-539b-4570-a138-ea413e6d9fbb@googlegroups.com>, wxjmfauth@gmail.com wrote: > Note > The used characters are not members of the latin-1 coding > scheme (btw an *unusable* coding). > They are however charaters in cp1252 and mac-roman. mac-roman is an obsolete encoding that was used in MacOS 9 and MacOS Classic systems of previous decades. Except in a very small and shrinking number of legacy applications, it isn't used in modern systems and hasn't been widely used for a long time. MacOS X systems generally use standard Unicode encodings, usually UTF-8, for external representations. Forget about mac-roman; it is not relevant. -- Ned Deily, nad@acm.org
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-08-21 09:52 +0200 |
| Subject | Abuse of subject, was Re: Abuse of Big Oh notation |
| Message-ID | <mailman.3594.1345535550.4697.python-list@python.org> |
| In reply to | #27524 |
wxjmfauth@gmail.com wrote:
> By chance and luckily, first attempt.
> c:\python32\python -m timeit "('€'*100+'€'*100).replace('€'
> , 'œ')"
> 1000000 loops, best of 3: 1.48 usec per loop
> c:\python33\python -m timeit "('€'*100+'€'*100).replace('€'
> , 'œ')"
> 100000 loops, best of 3: 7.62 usec per loop
OK, that is roughly factor 5. Let's see what I get:
$ python3.2 -m timeit '("€"*100+"€"*100).replace("€", "œ")'
100000 loops, best of 3: 1.8 usec per loop
$ python3.3 -m timeit '("€"*100+"€"*100).replace("€", "œ")'
10000 loops, best of 3: 9.11 usec per loop
That is factor 5, too. So I can replicate your measurement on an AMD64 Linux
system with self-built 3.3 versus system 3.2.
> Note
> The used characters are not members of the latin-1 coding
> scheme (btw an *unusable* coding).
> They are however charaters in cp1252 and mac-roman.
You seem to imply that the slowdown is connected to the inability of latin-1
to encode "œ" and "€" (to take the examples relevant to the above
microbench). So let's repeat with latin-1 characters:
$ python3.2 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")'
100000 loops, best of 3: 1.76 usec per loop
$ python3.3 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")'
10000 loops, best of 3: 10.3 usec per loop
Hm, the slowdown is even a tad bigger. So we can safely dismiss your theory
that an unfortunate choice of the 8 bit encoding is causing it. Do you
agree?
[toc] | [prev] | [next] | [standalone]
Page 6 of 8 — ← Prev page 1 2 3 4 5 [6] 7 8 Next page →
Back to top | Article view | comp.lang.python
csiph-web