Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #27204 > unrolled thread
| Started by | Charles Jensen <hopefullycharles@gmail.com> |
|---|---|
| First post | 2012-08-16 15:09 -0700 |
| Last post | 2012-08-20 17:20 -0400 |
| Articles | 20 on this page of 145 — 26 participants |
Back to article view | Back to comp.lang.python
How do I display unicode value stored in a string variable using ord() Charles Jensen <hopefullycharles@gmail.com> - 2012-08-16 15:09 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-17 08:20 +1000
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-16 18:47 -0400
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-16 19:59 -0400
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
Re: How do I display unicode value stored in a string variable using ord() Jerry Hill <malaclypse2@gmail.com> - 2012-08-17 14:21 -0400
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 16:55 -0400
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 23:30 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 04:10 +0000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:18 -0600
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 03:59 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
Re: How do I display unicode value stored in a string variable using ord() Alister <alister.ware@ntlworld.com> - 2012-08-17 06:30 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 01:09 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 12:27 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 08:07 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 16:25 +0100
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 01:36 +1000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:51 -0600
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 02:57 +1000
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 18:28 +0100
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:34 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:35 +0000
New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Peter Otten <__peter__@web.de> - 2012-08-19 09:43 +0200
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:56 +0000
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 02:24 -0700
Re: New internal string format in 3.3 Peter Otten <__peter__@web.de> - 2012-08-19 11:37 +0200
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
Re: New internal string format in 3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:33 +0000
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:26 +1000
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:29 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 14:46 +0100
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 15:48 +0100
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:48 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:09 +0100
Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-20 07:50 +1000
Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-19 23:38 -0600
Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-20 09:17 -0400
Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-20 22:18 -0600
Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-21 07:48 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:56 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:35 -0400
Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:30 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 16:09 -0400
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 23:12 -0400
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:33 +0000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 11:50 -0600
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 11:20 -0700
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:31 -0600
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 12:23 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:16 +0000
Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:46 -0600
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 17:59 +0000
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:30 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:45 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:13 +0000
Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-18 11:40 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:50 +0100
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 13:22 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 22:37 +0100
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 11:26 -0700
Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:59 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 07:17 +0000
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 10:46 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:11 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 12:19 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:35 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:01 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 20:10 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:31 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 22:58 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:01 +0000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:11 -0700
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 18:24 +1000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:44 -0700
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 01:54 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 11:46 +0100
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 12:31 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 10:51 +0000
Re: How do I display unicode value stored in a string variable using ord() Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-21 17:03 +1000
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:09 +0000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:04 -0700
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:25 +0000
Re: How do I display unicode value stored in a string variable using ord() DJC <djc@news.invalid> - 2012-08-19 17:32 +0200
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:34 -0400
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 10:48 -0700
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 11:11 -0700
Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:50 +0100
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:59 -0400
Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-19 23:13 -0700
Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:15 +0000
Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-19 16:42 -0700
Re: Abuse of Big Oh notation Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-08-20 09:24 +0100
Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 09:01 -0700
Re: Abuse of Big Oh notation Chris Angelico <rosuav@gmail.com> - 2012-08-21 02:09 +1000
Re: Abuse of Big Oh notation Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-20 11:12 -0600
Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 12:29 -0700
Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:16 -0700
Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:20 -0700
Re: Abuse of Big Oh notation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-21 09:53 +0000
Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
Re: Abuse of Big Oh notation Ned Deily <nad@acm.org> - 2012-08-20 18:19 -0700
Abuse of subject, was Re: Abuse of Big Oh notation Peter Otten <__peter__@web.de> - 2012-08-21 09:52 +0200
Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
Re: How do I display unicode value stored in a string variable using ord() Hans Mulder <hansmu@xs4all.nl> - 2012-08-22 20:53 +0200
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 08:42 +1000
Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-19 19:24 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 04:21 +0000
Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-20 00:44 -0400
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 05:56 +0000
Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 23:24 -0700
Re: How do I display unicode value stored in a string variable using ord() Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-20 12:58 -0400
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 20:35 -0400
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 14:07 +1000
Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:13 +0100
Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:19 +1000
Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:49 +0100
Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 18:03 +0100
Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 10:33 -0700
Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:04 +0100
Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-19 14:05 -0400
Re: How do I display unicode value stored in a string variable usingord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:18 +0100
Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:31 +0000
Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:03 -0400
Re: How do I display unicode value stored in a string variable using ord() 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-19 17:32 -0700
Re: How do I display unicode value stored in a string variable using ord() Piet van Oostrum <piet@vanoostrum.org> - 2012-08-20 17:20 -0400
Page 5 of 8 — ← Prev page 1 2 3 4 [5] 6 7 8 Next page →
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-19 10:46 +1000 |
| Message-ID | <mailman.3477.1345337181.4697.python-list@python.org> |
| In reply to | #27322 |
On Sun, Aug 19, 2012 at 4:26 AM, Paul Rubin <no.email@nospam.invalid> wrote: > Can you explain the issue of "breaking surrogate pairs apart" a little > more? Switching between encodings based on the string contents seems > silly at first glance. Strings are immutable so I don't understand why > not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in > Latin-based alphabets and UTF-16 may be more efficient for some other > languages. I think even UCS-4 doesn't completely fix the surrogate pair > issue if it means the only thing I can think of. UTF-8 is highly inefficient for indexing. Given a buffer of (say) a few thousand bytes, how do you locate the 273rd character? You have to scan from the beginning. The same applies when surrogate pairs are used to represent single characters, unless the representation leaks and a surrogate is indexed as two - which is where the breaking-apart happens. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-18 19:11 -0700 |
| Message-ID | <7xfw7j3a1x.fsf@ruckus.brouhaha.com> |
| In reply to | #27336 |
Chris Angelico <rosuav@gmail.com> writes: > UTF-8 is highly inefficient for indexing. Given a buffer of (say) a > few thousand bytes, how do you locate the 273rd character? How often do you need to do that, as opposed to traversing the string by iteration? Anyway, you could use a rope-like implementation, or an index structure over the string.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-19 12:19 +1000 |
| Message-ID | <mailman.3479.1345342743.4697.python-list@python.org> |
| In reply to | #27337 |
On Sun, Aug 19, 2012 at 12:11 PM, Paul Rubin <no.email@nospam.invalid> wrote: > Chris Angelico <rosuav@gmail.com> writes: >> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a >> few thousand bytes, how do you locate the 273rd character? > > How often do you need to do that, as opposed to traversing the string by > iteration? Anyway, you could use a rope-like implementation, or an > index structure over the string. Well, imagine if Python strings were stored in UTF-8. How would you slice it? >>> "asdfqwer"[4:] 'qwer' That's a not uncommon operation when parsing strings or manipulating data. You'd need to completely rework your algorithms to maintain a position somewhere. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-18 19:35 -0700 |
| Message-ID | <7xtxvzehhb.fsf@ruckus.brouhaha.com> |
| In reply to | #27338 |
Chris Angelico <rosuav@gmail.com> writes: >>>> "asdfqwer"[4:] > 'qwer' > > That's a not uncommon operation when parsing strings or manipulating > data. You'd need to completely rework your algorithms to maintain a > position somewhere. Scanning 4 characters (or a few dozen, say) to peel off a token in parsing a UTF-8 string is no big deal. It gets more expensive if you want to index far more deeply into the string. I'm asking how often that is done in real code. Obviously one can concoct hypothetical examples that would suffer.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-19 13:01 +1000 |
| Message-ID | <mailman.3481.1345345309.4697.python-list@python.org> |
| In reply to | #27340 |
On Sun, Aug 19, 2012 at 12:35 PM, Paul Rubin <no.email@nospam.invalid> wrote: > Chris Angelico <rosuav@gmail.com> writes: >>>>> "asdfqwer"[4:] >> 'qwer' >> >> That's a not uncommon operation when parsing strings or manipulating >> data. You'd need to completely rework your algorithms to maintain a >> position somewhere. > > Scanning 4 characters (or a few dozen, say) to peel off a token in > parsing a UTF-8 string is no big deal. It gets more expensive if you > want to index far more deeply into the string. I'm asking how often > that is done in real code. Obviously one can concoct hypothetical > examples that would suffer. Sure, four characters isn't a big deal to step through. But it still makes indexing and slicing operations O(N) instead of O(1), plus you'd have to zark the whole string up to where you want to work. It'd be workable, but you'd have to redo your algorithms significantly; I don't have a Python example of parsing a huge string, but I've done it in other languages, and when I can depend on indexing being a cheap operation, I'll happily do exactly that. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-18 20:10 -0700 |
| Message-ID | <7x7gsv4lw4.fsf@ruckus.brouhaha.com> |
| In reply to | #27342 |
Chris Angelico <rosuav@gmail.com> writes: > Sure, four characters isn't a big deal to step through. But it still > makes indexing and slicing operations O(N) instead of O(1), plus you'd > have to zark the whole string up to where you want to work. I know some systems chop the strings into blocks of (say) a few hundred chars, so you can immediately get to the correct block, then scan into the block to get to the desired char offset. > I don't have a Python example of parsing a huge string, but I've done > it in other languages, and when I can depend on indexing being a cheap > operation, I'll happily do exactly that. I'd be interested to know what the context was, where you parsed a big unicode string in a way that required random access to the nth character in the string.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-19 13:31 +1000 |
| Message-ID | <mailman.3483.1345347084.4697.python-list@python.org> |
| In reply to | #27343 |
On Sun, Aug 19, 2012 at 1:10 PM, Paul Rubin <no.email@nospam.invalid> wrote: > Chris Angelico <rosuav@gmail.com> writes: >> I don't have a Python example of parsing a huge string, but I've done >> it in other languages, and when I can depend on indexing being a cheap >> operation, I'll happily do exactly that. > > I'd be interested to know what the context was, where you parsed > a big unicode string in a way that required random access to > the nth character in the string. It's something I've done in C/C++ fairly often. Take one big fat buffer, slice it and dice it as you get the information you want out of it. I'll retain and/or calculate indices (when I'm not using pointers, but that's a different kettle of fish). Generally, I'm working with pure ASCII, but port those same algorithms to Python and you'll easily be able to read in a file in some known encoding and manipulate it as Unicode. It's not so much 'random access to the nth character' as an efficient way of jumping forward. For instance, if I know that the next thing is a literal string of n characters (that I don't care about), I want to skip over that and keep parsing. The Adobe Message Format is particularly noteworthy in this, but it's a stupid format and I don't recommend people spend too much time reading up on it (unless you like that sensation of your brain trying to escape through your ear). ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-18 22:58 -0700 |
| Message-ID | <7xfw7jv2x5.fsf@ruckus.brouhaha.com> |
| In reply to | #27345 |
Chris Angelico <rosuav@gmail.com> writes: > Generally, I'm working with pure ASCII, but port those same algorithms > to Python and you'll easily be able to read in a file in some known > encoding and manipulate it as Unicode. If it's pure ASCII, you can use the bytes or bytearray type. > It's not so much 'random access to the nth character' as an efficient > way of jumping forward. For instance, if I know that the next thing is > a literal string of n characters (that I don't care about), I want to > skip over that and keep parsing. I don't understand how this is supposed to work. You're going to read a large unicode text file (let's say it's UTF-8) into a single big string? So the runtime library has to scan the encoded contents to find the highest numbered codepoint (let's say it's mostly ascii but has a few characters outside the BMP), expand it all (in this case) to UCS-4 giving 4x memory bloat and requiring decoding all the UTF-8 regardless, and now we should worry about the efficiency of skipping n characters? Since you have to decode the n characters regardless, I'd think this skipping part should only be an issue if you have to do it a lot of times.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-19 08:01 +0000 |
| Message-ID | <50309d69$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27340 |
On Sat, 18 Aug 2012 19:35:44 -0700, Paul Rubin wrote:
> Scanning 4 characters (or a few dozen, say) to peel off a token in
> parsing a UTF-8 string is no big deal. It gets more expensive if you
> want to index far more deeply into the string. I'm asking how often
> that is done in real code.
It happens all the time.
Let's say you've got a bunch of text, and you use a regex to scan through
it looking for a match. Let's ignore the regular expression engine, since
it has to look at every character anyway. But you've done your search and
found your matching text and now want everything *after* it. That's not
exactly an unusual use-case.
mo = re.search(pattern, text)
if mo:
start, end = mo.span()
result = text[end:]
Easy-peasy, right? But behind the scenes, you have a problem: how does
Python know where text[end:] starts? With fixed-size characters, that's
O(1): Python just moves forward end*width bytes into the string. Nice and
fast.
With a variable-sized characters, Python has to start from the beginning
again, and inspect each byte or pair of bytes. This turns the slice
operation into O(N) and the combined op (search + slice) into O(N**2),
and that starts getting *horrible*.
As always, "everything is fast for small enough N", but you *really*
don't want O(N**2) operations when dealing with large amounts of data.
Insisting that the regex functions only ever return offsets to valid
character boundaries doesn't help you, because the string slice method
cannot know where the indexes came from.
I suppose you could have a "fast slice" and a "slow slice" method, but
really, that sucks, and besides all that does is pass responsibility for
tracking character boundaries to the developer instead of the language,
and you know damn well that they will get it wrong and their code will
silently do the wrong thing and they'll say that Python sucks and we
never used to have this problem back in the good old days with ASCII. Boo
sucks to that.
UCS-4 is an option, since that's fixed-width. But it's also bulky. For
typical users, you end up wasting memory. That is the complaint driving
PEP 393 -- memory is cheap, but it's not so cheap that you can afford to
multiply your string memory by four just in case somebody someday gives
you a character in one of the supplementary planes.
If you have oodles of memory and small data sets, then UCS-4 is probably
all you'll ever need. I hear that the club for people who have all the
memory they'll ever need is holding their annual general meeting in a
phone-booth this year.
You could say "Screw the full Unicode standard, who needs more than 64K
different characters anyway?" Well apart from Asians, and historians, and
a bunch of other people. If you can control your data and make sure no
non-BMP characters are used, UCS-2 is fine -- except Python doesn't
actually use that.
You could do what Python 3.2 narrow builds do: use UTF-16 and leave it up
to the individual programmer to track character boundaries, and we know
how well that works. Luckily the supplementary planes are only rarely
used, and people who need them tend to buy more memory and use wide
builds. People who only need a few non-BMP characters in a narrow build
generally just cross their fingers and hope for the best.
You could add a whole lot more heavyweight infrastructure to strings,
turn them into suped-up ropes-on-steroids. All those extra indexes mean
that you don't save any memory. Because the objects are so much bigger
and more complex, your CPU cache goes to the dogs and your code still
runs slow.
Which leaves us right back where we started, PEP 393.
> Obviously one can concoct hypothetical examples that would suffer.
If you think "slicing at arbitrary indexes" is a hypothetical example, I
don't know what to say.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-19 01:11 -0700 |
| Message-ID | <7x4nnzmhbn.fsf@ruckus.brouhaha.com> |
| In reply to | #27359 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: > result = text[end:] if end not near the end of the original string, then this is O(N) even with fixed-width representation, because of the char copying. if it is near the end, by knowing where the string data area ends, I think it should be possible to scan backwards from the end, recognizing what bytes can be the beginning of code points and counting off the appropriate number. This is O(1) if "near the end" means "within a constant". > You could say "Screw the full Unicode standard, who needs more than 64K No if you're claiming the language supports unicode it should be the whole standard. > You could do what Python 3.2 narrow builds do: use UTF-16 and leave it > up to the individual programmer to track character boundaries, I'm surprised the Python 3 implementers even considered that approach much less went ahead with it. It's obviously wrong. > You could add a whole lot more heavyweight infrastructure to strings, > turn them into suped-up ropes-on-steroids. I'm not persuaded that PEP 393 isn't even worse.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-19 18:24 +1000 |
| Message-ID | <mailman.3487.1345364700.4697.python-list@python.org> |
| In reply to | #27361 |
On Sun, Aug 19, 2012 at 6:11 PM, Paul Rubin <no.email@nospam.invalid> wrote: > Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: >> result = text[end:] > > if end not near the end of the original string, then this is O(N) > even with fixed-width representation, because of the char copying. > > if it is near the end, by knowing where the string data area > ends, I think it should be possible to scan backwards from > the end, recognizing what bytes can be the beginning of code points and > counting off the appropriate number. This is O(1) if "near the end" > means "within a constant". Only if you know exactly where the end is (which requires storing and maintaining a character length - this may already be happening, I don't know). But that approach means you need to have code for both ways (forward search or reverse), and of course it relies on your encoding being reverse-scannable in this way (as UTF-8 is, but not all). And of course, taking the *entire* rest of the string isn't the only thing you do. What if you want to take the next six characters after that index? That would be constant time with a fixed-width storage format. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-19 01:44 -0700 |
| Message-ID | <7xy5lb9soz.fsf@ruckus.brouhaha.com> |
| In reply to | #27362 |
Chris Angelico <rosuav@gmail.com> writes:
> And of course, taking the *entire* rest of the string isn't the only
> thing you do. What if you want to take the next six characters after
> that index? That would be constant time with a fixed-width storage
> format.
How often is this an issue in practice?
I wonder how other languages deal with this. The examples I can think
of are poor role models:
1. C/C++ - unicode impaired, other than a wchar type
2. Java - bogus UCS-2-like(?) representation for historical reasons
Also has some modified UTF=8 for reasons that made no sense and
that I don't remember
3. Haskell - basic string type is a linked list of code points.
"hello" is five list nodes. New Data.Text library (much more
efficient) uses something like ropes, I think, with UTF-16 underneath.
4. Erlang - I think like Haskell. Efficiently handles byte blocks.
5. Perl 6 -- ???
6. Ruby - ??? (but probably quite slow like the rest of Ruby)
7. Objective C -- ???
8, 9 ... (any other important ones?)
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-19 01:54 -0700 |
| Message-ID | <bb45c0f1-4042-4653-b791-c216031a4d71@googlegroups.com> |
| In reply to | #27363 |
About the exemples contested by Steven:
eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
And it is good enough to show the problem. Period. The
rest (you have to do this, you should not do this, why
are you using these characters - amazing and stupid
question -) does not count.
The real problem is elsewhere. *Americans* do not wish
a character occupies 4 bytes in *their* memory. The rest
of the world does not count.
The same thing happens with the utf-8 coding scheme.
Technically, it is fine. But after n years of usage,
one should recognize it just became an ascii2. Especially
for those who undestand nothing in that field and are
not even aware, characters are "coded". I'm the first
to think, this is legitimate.
Memory or "ability to treat all text in the same and equal
way"?
End note. This kind of discussion is not specific to
Python, it always happen when there is some kind of
conflict between ascii and non ascii users.
Have a nice day.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-08-19 11:46 +0100 |
| Message-ID | <mailman.3494.1345373121.4697.python-list@python.org> |
| In reply to | #27365 |
On 19/08/2012 09:54, wxjmfauth@gmail.com wrote:
> About the exemples contested by Steven:
>
> eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
>
>
> And it is good enough to show the problem. Period. The
> rest (you have to do this, you should not do this, why
> are you using these characters - amazing and stupid
> question -) does not count.
>
> The real problem is elsewhere. *Americans* do not wish
> a character occupies 4 bytes in *their* memory. The rest
> of the world does not count.
>
> The same thing happens with the utf-8 coding scheme.
> Technically, it is fine. But after n years of usage,
> one should recognize it just became an ascii2. Especially
> for those who undestand nothing in that field and are
> not even aware, characters are "coded". I'm the first
> to think, this is legitimate.
>
> Memory or "ability to treat all text in the same and equal
> way"?
>
> End note. This kind of discussion is not specific to
> Python, it always happen when there is some kind of
> conflict between ascii and non ascii users.
>
> Have a nice day.
>
> jmf
>
Roughly translated. "I've been shot to pieces and having seen Monty
Python and the Holy Grail I know what to do. Run away, run away"
--
Cheers.
Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-08-19 12:31 -0400 |
| Message-ID | <mailman.3508.1345393941.4697.python-list@python.org> |
| In reply to | #27365 |
On 8/19/2012 4:54 AM, wxjmfauth@gmail.com wrote:
> About the exemples contested by Steven:
> eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
> And it is good enough to show the problem. Period.
Repeating a false claim over and over does not make it true. Two people
on pydev claim that 3.3 is *faster* on their systems (one unspecified,
one OSX10.8).
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-19 10:51 +0000 |
| Message-ID | <5030c52d$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27361 |
On Sun, 19 Aug 2012 01:11:56 -0700, Paul Rubin wrote: > Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes: >> result = text[end:] > > if end not near the end of the original string, then this is O(N) even > with fixed-width representation, because of the char copying. Technically, yes. But it's a straight copy of a chunk of memory, which means it's fast: your OS and hardware tries to make straight memory copies as fast as possible. Big-Oh analysis frequently glosses over implementation details like that. Of course, that assumption gets shaky when you start talking about extra large blocks, and it falls apart completely when your OS starts paging memory to disk. But if it helps to avoid irrelevant technical details, change it to text[end:end+10] or something. > if it is near the end, by knowing where the string data area ends, I > think it should be possible to scan backwards from the end, recognizing > what bytes can be the beginning of code points and counting off the > appropriate number. This is O(1) if "near the end" means "within a > constant". You know, I think you are misusing Big-Oh analysis here. It really wouldn't be helpful for me to say "Bubble Sort is O(1) if you only sort lists with a single item". Well, yes, that is absolutely true, but that's a special case that doesn't give you any insight into why using Bubble Sort as your general purpose sort routine is a terrible idea. Using variable-sized strings like UTF-8 and UTF-16 for in-memory representations is a terrible idea because you can't assume that people will only every want to index the first or last character. On average, you need to scan half the string, one character at a time. In Big-Oh, we can ignore the factor of 1/2 and just say we scan the string, O(N). That's why languages tend to use fixed character arrays for strings. Haskell is an exception, using linked lists which require traversing the string to jump to an index. The manual even warns: [quote] If you think of a Text value as an array of Char values (which it is not), you run the risk of writing inefficient code. An idiom that is common in some languages is to find the numeric offset of a character or substring, then use that number to split or trim the searched string. With a Text value, this approach would require two O(n) operations: one to perform the search, and one to operate from wherever the search ended. [end quote] http://hackage.haskell.org/packages/archive/text/0.11.2.2/doc/html/Data-Text.html -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Neil Hodgson <nhodgson@iinet.net.au> |
|---|---|
| Date | 2012-08-21 17:03 +1000 |
| Message-ID | <3bOdnbu1sNbdrq7NnZ2dnUVZ_vWdnZ2d@westnet.com.au> |
| In reply to | #27379 |
Steven D'Aprano:
> Using variable-sized strings like UTF-8 and UTF-16 for in-memory
> representations is a terrible idea because you can't assume that people
> will only every want to index the first or last character. On average,
> you need to scan half the string, one character at a time. In Big-Oh, we
> can ignore the factor of 1/2 and just say we scan the string, O(N).
In the majority of cases you can remove excessive scanning by
caching the most recent index->offset result. If the next index request
is nearer the cached index than to the beginning then iterate from that
offset. This converts many operations from quadratic to linear. Locality
of reference is common and can often be reasonably exploited.
However, exposing the variable length nature of UTF-8 allows the
application to choose efficient techniques for more cases.
Neil
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-19 06:09 +0000 |
| Message-ID | <5030832d$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27322 |
This is a long post. If you don't feel like reading an essay, skip to the
very bottom and read my last few paragraphs, starting with "To recap".
On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>> (There is an extension to UCS-2, UTF-16, which encodes non-BMP
>> characters using two code points. This is fragile and doesn't work very
>> well, because string-handling methods can break the surrogate pairs
>> apart, leaving you with invalid unicode string. Not good.)
> ...
>> With PEP 393, each Python string will be stored in the most efficient
>> format possible:
>
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more? Switching between encodings based on the string contents seems
> silly at first glance.
Forget encodings! We're not talking about encodings. Encodings are used
for converting text as bytes for transmission over the wire or storage on
disk. PEP 393 talks about the internal representation of text within
Python, the C-level data structure.
In 3.2, that data structure depends on a compile-time switch. In a
"narrow build", text is stored using two-bytes per character, so the
string "len" (as in the name of the built-in function) will be stored as
006c 0065 006e
(or possibly 6c00 6500 6e00, depending on whether your system is
LittleEndian or BigEndian), plus object-overhead, which I shall ignore.
Since most identifiers are ASCII, that's already using twice as much
memory as needed. This standard data structure is called UCS-2, and it
only handles characters in the Basic Multilingual Plane, the BMP (roughly
the first 64000 Unicode code points). I'll come back to that.
In a "wide build", text is stored as four-bytes per character, so "len"
is stored as either:
0000006c 00000065 0000006e
6c000000 65000000 6e000000
Now memory is cheap, but it's not *that* cheap, and no matter how much
memory you have, you can always use more.
This system is called UCS-4, and it can handle the entire Unicode
character set, for now and forever. (If we ever need more that four-bytes
worth of characters, it won't be called Unicode.)
Remember I said that UCS-2 can only handle the 64K characters
[technically: code points] in the Basic Multilingual Plane? There's an
extension to UCS-2 called UTF-16 which extends it to the entire Unicode
range. Yes, that's the same name as the UTF-16 encoding, because it's
more or less the same system.
UTF-16 says "let's represent characters in the BMP by two bytes, but
characters outside the BMP by four bytes." There's a neat trick to this:
the BMP doesn't use the entire two-byte range, so there are some byte
pairs which are illegal in UCS-2 -- they don't correspond to *any*
character. UTF-16 used those byte pairs to signal "this is half a
character, you need to look at the next pair for the rest of the
character".
Nifty hey? These pairs-of-pseudocharacters are called "surrogate pairs".
Except this comes at a big cost: you can no longer tell how long a string
is by counting the number of bytes, which is fast, because sometimes four
bytes is two characters and sometimes it's one and you can't tell which
it will be until you actually inspect all four bytes.
Copying sub-strings now becomes either slow, or buggy. Say you want to
grab the 10th characters in a string. The fast way using UCS-2 is to
simply grab bytes 8 and 9 (remember characters are pairs of bytes and we
start counting at zero) and you're done. Fast and safe if you're willing
to give up the non-BMP characters.
It's also fast and safe if you use USC-4, but then everything takes twice
as much space, so you probably end up spending so much time copying null
bytes that you're probably slower anyway. Especially when your OS starts
paging memory like mad.
But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8
and 9 are half of a surrogate pair, and you've now split the pair and
ended up with an invalid string. That's what Python 3.2 does, it fails to
handle surrogate pairs properly:
py> s = chr(0xFFFF + 1)
py> a, b = s
py> a
'\ud800'
py> b
'\udc00'
I've just split a single valid Unicode character into two invalid
characters. Python3.2 will (probably) mindless process those two non-
characters, and the only sign I have that I did something wrong is that
my data is now junk.
Since any character can be a surrogate pair, you have to scan every pair
of bytes in order to index a string, or work out it's length, or copy a
substring. It's not enough to just check if the last pair is a surrogate.
When you don't, you have bugs like this from Python 3.2:
py> s = "01234" + chr(0xFFFF + 1) + "6789"
py> s[9] == '9'
False
py> s[9], len(s)
('8', 11)
Which is now fixed in Python 3.3.
So variable-width data structures like UTF-8 or UTF-16 are crap for the
internal representation of strings -- they are either fast or correct but
cannot be both.
But UCS-2 is sub-optimal, because it can only handle the BMP, and UCS-4
is too because ASCII-only strings like identifiers end up being four
times as big as they need to be. 1-byte schemes like Latin-1 are
unspeakable because they only handle 256 characters, fewer if you don't
count the C0 and C1 control codes.
PEP 393 to the rescue! What if you could encode pure-ASCII strings like
"len" using one byte per character, and BMP strings using two bytes per
character (UCS-2), and fall back to four bytes (UCS-4) only when you
really need it?
The benefits are:
* Americans and English-Canadians and Australians and other barbarians of
that ilk who only use ASCII save a heap of memory;
* people who mostly use non-BMP characters only pay the cost of four-
bytes per character for strings that actually *need* four-bytes per
character;
* people who use lots of non-BMP characters are no worse off.
The costs are:
* string routines need to be smarter -- they have to handle three
different data structures (ASCII, UCS-2, UCS-4) instead of just one;
* there's a certain amount of overhead when creating a string -- you have
to work out which in-memory format to use, and that's not necessarily
trivial, but at least it's a once-off cost when you create the string;
* people who misunderstand what's going on get all upset over micro-
benchmarks.
> Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages. I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.
To recap:
* Variable-byte formats like UTF-8 and UTF-16 mean that basic string
operations are not O(1) but are O(N). That means they are slow, or buggy,
pick one.
* Fixed width UCS-2 doesn't handle the full Unicode range, only the BMP.
That's better than it sounds: the BMP supports most character sets, but
not all. Still, there are people who need the supplementary planes, and
UCS-2 lets them down.
* Fixed width UCS-4 does handle the full Unicode range, without
surrogates, but at the cost of using 2-4 times more string memory for the
vast majority of users.
* PEP 393 doesn't use variable-width characters, but variable-width
strings. Instead of choosing between 1, 2 and 4 bytes per character, it
chooses *per string*. This keeps basic string operations O(1) instead of
O(N), saves memory where possible, while still supporting the full
Unicode range without a compile-time option.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-08-19 01:04 -0700 |
| Message-ID | <7x8vdbmho6.fsf@ruckus.brouhaha.com> |
| In reply to | #27349 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> This is a long post. If you don't feel like reading an essay, skip to the
> very bottom and read my last few paragraphs, starting with "To recap".
I'm very flattered that you took the trouble to write that excellent
exposition of different Unicode encodings in response to my post. I can
only hope some readers will benefit from it. I regret that I wasn't
more clear about the perspective I posted from, i.e. that I'm already
familiar with how those encodings work.
After reading all of it, I still have the same skepticism on the main
point as before, but I think I see what the issue in contention is, and
some differences in perspectice. First of all, you wrote:
> This standard data structure is called UCS-2 ... There's an extension
> to UCS-2 called UTF-16
My own understanding is UCS-2 simply shouldn't be used any more.
Unicode was historically supposed to be a 16-bit character set, but that
turned out to not be enough, so the supplementary planes were added.
UCS-2 thus became obsolete and UTF-16 superseded it in 1996. UTF-16 in
turn is rather clumsy and the later UTF-8 is better in a lot of ways,
but both of these are at least capable of encoding all the character
codes.
On to the main issue:
> * Variable-byte formats like UTF-8 and UTF-16 mean that basic string
> operations are not O(1) but are O(N). That means they are slow, or buggy,
> pick one.
This I don't see. What are the basic string operations?
* Examine the first character, or first few characters ("few" = "usually
bounded by a small constant") such as to parse a token from an input
stream. This is O(1) with either encoding.
* Slice off the first N characters. This is O(N) with either encoding
if it involves copying the chars. I guess you could share references
into the same string, but if the slice reference persists while the
big reference is released, you end up not freeing the memory until
later than you really should.
* Concatenate two strings. O(N) either way.
* Find length of string. O(1) either way since you'd store it in
the string header when you build the string in the first place.
Building the string has to have been an O(N) operation in either
representation.
And finally:
* Access the nth char in the string for some large random n, or maybe
get a small slice from some random place in a big string. This is
where fixed-width representation is O(1) while variable-width is O(N).
What I'm not convinced of, is that the last thing happens all that
often.
Meanwhile, an example of the 393 approach failing: I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
So the chars were mostly ascii, but there would be occasional non-ascii
chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision. That's a
natural for UTF-8 but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.
py> s = chr(0xFFFF + 1)
py> a, b = s
That looks like Python 3.2 is buggy and that sample should just throw an
error. s is a one-character string and should not be unpackable.
I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?
Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered. By that I mean pick some implementation
constant k (say k=128) and represent the string as a UTF-8 encoded byte
array, accompanied by a vector n//k pointers into the byte array, where
n is the number of codepoints in the string. Then you can reach any
offset analogously to reading a random byte on a disk, by seeking to the
appropriate block, and then reading the block and getting the char you
want within it. Random access is then O(1) though the constant is
higher than it would be with fixed width encoding.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-19 13:25 +0000 |
| Message-ID | <5030e939$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27360 |
On Sun, 19 Aug 2012 01:04:25 -0700, Paul Rubin wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>> This standard data structure is called UCS-2 ... There's an extension
>> to UCS-2 called UTF-16
>
> My own understanding is UCS-2 simply shouldn't be used any more.
Pretty much. But UTF-16 with lax support for surrogates (that is,
surrogates are included but treated as two characters) is essentially
UCS-2 with the restriction against surrogates lifted. That's what Python
currently does, and Javascript.
http://mathiasbynens.be/notes/javascript-encoding
The reality is that support for the Unicode supplementary planes is
pretty poor. Even when applications support it, most fonts don't have
glyphs for the characters. Anything which makes handling of Unicode
supplementary characters better is a step forward.
>> * Variable-byte formats like UTF-8 and UTF-16 mean that basic string
>> operations are not O(1) but are O(N). That means they are slow, or
>> buggy, pick one.
>
> This I don't see. What are the basic string operations?
The ones I'm specifically referring to are indexing and copying
substrings. There may be others.
> * Examine the first character, or first few characters ("few" = "usually
> bounded by a small constant") such as to parse a token from an input
> stream. This is O(1) with either encoding.
That's actually O(K), for K = "a few", whatever "a few" means. But we
know that anything is fast for small enough N (or K in this case).
> * Slice off the first N characters. This is O(N) with either encoding
> if it involves copying the chars. I guess you could share references
> into the same string, but if the slice reference persists while the
> big reference is released, you end up not freeing the memory until
> later than you really should.
As a first approximation, memory copying is assumed to be free, or at
least constant time. That's not strictly true, but Big Oh analysis is
looking at algorithmic complexity. It's not a substitute for actual
benchmarks.
> Meanwhile, an example of the 393 approach failing: I was involved in a
> project that dealt with terabytes of OCR data of mostly English text.
I assume that this wasn't one giant multi-terrabyte string.
> So
> the chars were mostly ascii, but there would be occasional non-ascii
> chars including supplementary plane characters, either because of
> special symbols that were really in the text, or the typical OCR
> confusion emitting those symbols due to printing imprecision. That's a
> natural for UTF-8 but the PEP-393 approach would bloat up the memory
> requirements by a factor of 4.
Not necessarily. Presumably you're scanning each page into a single
string. Then only the pages containing a supplementary plane char will be
bloated, which is likely to be rare. Especially since I don't expect your
OCR application would recognise many non-BMP characters -- what does
U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software
doesn't recognise it, you can't get it in your output. (If you do, the
OCR software has a nasty bug.)
Anyway, in my ignorant opinion the proper fix here is to tell the OCR
software not to bother trying to recognise Imperial Aramaic, Domino
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
expecting them in your source material. Not only will the scanning go
faster, but you'll get fewer wrong characters.
[...]
> I realize the folks who designed and implemented PEP 393 are very smart
> cookies and considered stuff carefully, while I'm just an internet user
> posting an immediate impression of something I hadn't seen before (I
> still use Python 2.6), but I still have to ask: if the 393 approach
> makes sense, why don't other languages do it?
There has to be a first time for everything.
> Ropes of UTF-8 segments seems like the most obvious approach and I
> wonder if it was considered.
Ropes have been considered and rejected because while they are
asymptotically fast, in common cases the added complexity actually makes
them slower. Especially for immutable strings where you aren't inserting
into the middle of a string.
http://mail.python.org/pipermail/python-dev/2000-February/002321.html
PyPy has revisited ropes and uses, or at least used, ropes as their
native string data structure. But that's ropes of *bytes*, not UTF-8.
http://morepypy.blogspot.com.au/2007/11/ropes-branch-merged.html
--
Steven
[toc] | [prev] | [next] | [standalone]
Page 5 of 8 — ← Prev page 1 2 3 4 [5] 6 7 8 Next page →
Back to top | Article view | comp.lang.python
csiph-web