Groups > comp.lang.python > #27204 > unrolled thread

How do I display unicode value stored in a string variable using ord()

Started by	Charles Jensen <hopefullycharles@gmail.com>
First post	2012-08-16 15:09 -0700
Last post	2012-08-20 17:20 -0400
Articles	20 on this page of 145 — 26 participants

Back to article view | Back to comp.lang.python

  How do I display unicode value stored in a string variable using ord() Charles Jensen <hopefullycharles@gmail.com> - 2012-08-16 15:09 -0700
    Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-17 08:20 +1000
    Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-16 18:47 -0400
    Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-16 19:59 -0400
      Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
        Re: How do I display unicode value stored in a string variable using ord() Jerry Hill <malaclypse2@gmail.com> - 2012-08-17 14:21 -0400
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
            Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 16:55 -0400
            Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 23:30 -0400
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 04:10 +0000
                Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:18 -0600
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 03:59 +0000
      Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
    Re: How do I display unicode value stored in a string variable using ord() Alister <alister.ware@ntlworld.com> - 2012-08-17 06:30 +0000
    Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 01:09 -0700
      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 12:27 +0000
        Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 08:07 -0700
          Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 16:25 +0100
          Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 01:36 +1000
          Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:51 -0600
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 02:57 +1000
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 18:28 +0100
                Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
                  Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:34 +0100
                    Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:35 +0000
                      New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Peter Otten <__peter__@web.de> - 2012-08-19 09:43 +0200
                        Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:56 +0000
                          Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 02:24 -0700
                          Re: New internal string format in 3.3 Peter Otten <__peter__@web.de> - 2012-08-19 11:37 +0200
                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
                              Re: New internal string format in 3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:33 +0000
                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
                              Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:26 +1000
                                Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
                                  Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:29 -0400
                                    Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
                                      Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 14:46 +0100
                                        Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
                                        Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
                                          Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 15:48 +0100
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
                                          Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:48 -0400
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
                                              Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:09 +0100
                                              Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-20 07:50 +1000
                                              Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-19 23:38 -0600
                                                Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-20 09:17 -0400
                                                  Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-20 22:18 -0600
                                                    Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-21 07:48 -0400
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
                                      Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:56 -0400
                                    Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
                                  Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:35 -0400
                                Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:30 +0000
                Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 16:09 -0400
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 23:12 -0400
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:33 +0000
              Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 11:50 -0600
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 11:20 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:31 -0600
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 12:23 -0700
                Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:16 +0000
              Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:46 -0600
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 17:59 +0000
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:30 -0700
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:45 +0100
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:13 +0000
            Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-18 11:40 -0700
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:50 +0100
              Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 13:22 -0700
                Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 22:37 +0100
        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 11:26 -0700
          Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:59 +0100
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 07:17 +0000
          Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 10:46 +1000
            Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:11 -0700
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 12:19 +1000
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:35 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:01 +1000
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 20:10 -0700
                      Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:31 +1000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 22:58 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:01 +0000
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:11 -0700
                      Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 18:24 +1000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:44 -0700
                          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 01:54 -0700
                            Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 11:46 +0100
                            Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 12:31 -0400
                      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 10:51 +0000
                        Re: How do I display unicode value stored in a string variable using ord() Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-21 17:03 +1000
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:09 +0000
            Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:04 -0700
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:25 +0000
                Re: How do I display unicode value stored in a string variable using ord() DJC <djc@news.invalid> - 2012-08-19 17:32 +0200
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:34 -0400
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 10:48 -0700
                  Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 11:11 -0700
                    Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:50 +0100
                    Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:59 -0400
                    Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-19 23:13 -0700
                  Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:15 +0000
                    Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-19 16:42 -0700
                      Re: Abuse of Big Oh notation Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-08-20 09:24 +0100
                        Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 09:01 -0700
                          Re: Abuse of Big Oh notation Chris Angelico <rosuav@gmail.com> - 2012-08-21 02:09 +1000
                          Re: Abuse of Big Oh notation Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-20 11:12 -0600
                            Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 12:29 -0700
                              Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:16 -0700
                              Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:20 -0700
                            Re: Abuse of Big Oh notation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-21 09:53 +0000
                        Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
                          Re: Abuse of Big Oh notation Ned Deily <nad@acm.org> - 2012-08-20 18:19 -0700
                          Abuse of subject, was Re: Abuse of Big Oh notation Peter Otten <__peter__@web.de> - 2012-08-21 09:52 +0200
                            Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
                            Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
                        Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Hans Mulder <hansmu@xs4all.nl> - 2012-08-22 20:53 +0200
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 08:42 +1000
                Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-19 19:24 -0400
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 04:21 +0000
                    Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-20 00:44 -0400
                      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 05:56 +0000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 23:24 -0700
                    Re: How do I display unicode value stored in a string variable using ord() Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-20 12:58 -0400
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 20:35 -0400
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 14:07 +1000
            Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:13 +0100
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:19 +1000
                Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:49 +0100
        Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 18:03 +0100
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 10:33 -0700
            Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:04 +0100
          Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-19 14:05 -0400
            Re: How do I display unicode value stored in a string variable usingord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:18 +0100
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:31 +0000
          Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:03 -0400
          Re: How do I display unicode value stored in a string variable using ord() 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-19 17:32 -0700
          Re: How do I display unicode value stored in a string variable using ord() Piet van Oostrum <piet@vanoostrum.org> - 2012-08-20 17:20 -0400

Page 5 of 8 — ← Prev page 1 2 3 4 [5] 6 7 8 Next page →

#27336

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-19 10:46 +1000
Message-ID	<mailman.3477.1345337181.4697.python-list@python.org>
In reply to	#27322

On Sun, Aug 19, 2012 at 4:26 AM, Paul Rubin <no.email@nospam.invalid> wrote:
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more?  Switching between encodings based on the string contents seems
> silly at first glance.  Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages.  I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.

UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
few thousand bytes, how do you locate the 273rd character? You have to
scan from the beginning. The same applies when surrogate pairs are
used to represent single characters, unless the representation leaks
and a surrogate is indexed as two - which is where the breaking-apart
happens.

ChrisA

[toc] | [prev] | [next] | [standalone]

#27337

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-18 19:11 -0700
Message-ID	<7xfw7j3a1x.fsf@ruckus.brouhaha.com>
In reply to	#27336

Chris Angelico <rosuav@gmail.com> writes:
> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
> few thousand bytes, how do you locate the 273rd character? 

How often do you need to do that, as opposed to traversing the string by
iteration?  Anyway, you could use a rope-like implementation, or an
index structure over the string.

[toc] | [prev] | [next] | [standalone]

#27338

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-19 12:19 +1000
Message-ID	<mailman.3479.1345342743.4697.python-list@python.org>
In reply to	#27337

On Sun, Aug 19, 2012 at 12:11 PM, Paul Rubin <no.email@nospam.invalid> wrote:
> Chris Angelico <rosuav@gmail.com> writes:
>> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
>> few thousand bytes, how do you locate the 273rd character?
>
> How often do you need to do that, as opposed to traversing the string by
> iteration?  Anyway, you could use a rope-like implementation, or an
> index structure over the string.

Well, imagine if Python strings were stored in UTF-8. How would you slice it?

>>> "asdfqwer"[4:]
'qwer'

That's a not uncommon operation when parsing strings or manipulating
data. You'd need to completely rework your algorithms to maintain a
position somewhere.

ChrisA

[toc] | [prev] | [next] | [standalone]

#27340

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-18 19:35 -0700
Message-ID	<7xtxvzehhb.fsf@ruckus.brouhaha.com>
In reply to	#27338

Chris Angelico <rosuav@gmail.com> writes:
>>>> "asdfqwer"[4:]
> 'qwer'
>
> That's a not uncommon operation when parsing strings or manipulating
> data. You'd need to completely rework your algorithms to maintain a
> position somewhere.

Scanning 4 characters (or a few dozen, say) to peel off a token in
parsing a UTF-8 string is no big deal.  It gets more expensive if you
want to index far more deeply into the string.  I'm asking how often
that is done in real code.  Obviously one can concoct hypothetical
examples that would suffer.

[toc] | [prev] | [next] | [standalone]

#27342

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-19 13:01 +1000
Message-ID	<mailman.3481.1345345309.4697.python-list@python.org>
In reply to	#27340

On Sun, Aug 19, 2012 at 12:35 PM, Paul Rubin <no.email@nospam.invalid> wrote:
> Chris Angelico <rosuav@gmail.com> writes:
>>>>> "asdfqwer"[4:]
>> 'qwer'
>>
>> That's a not uncommon operation when parsing strings or manipulating
>> data. You'd need to completely rework your algorithms to maintain a
>> position somewhere.
>
> Scanning 4 characters (or a few dozen, say) to peel off a token in
> parsing a UTF-8 string is no big deal.  It gets more expensive if you
> want to index far more deeply into the string.  I'm asking how often
> that is done in real code.  Obviously one can concoct hypothetical
> examples that would suffer.

Sure, four characters isn't a big deal to step through. But it still
makes indexing and slicing operations O(N) instead of O(1), plus you'd
have to zark the whole string up to where you want to work. It'd be
workable, but you'd have to redo your algorithms significantly; I
don't have a Python example of parsing a huge string, but I've done it
in other languages, and when I can depend on indexing being a cheap
operation, I'll happily do exactly that.

ChrisA

[toc] | [prev] | [next] | [standalone]

#27343

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-18 20:10 -0700
Message-ID	<7x7gsv4lw4.fsf@ruckus.brouhaha.com>
In reply to	#27342

Chris Angelico <rosuav@gmail.com> writes:
> Sure, four characters isn't a big deal to step through. But it still
> makes indexing and slicing operations O(N) instead of O(1), plus you'd
> have to zark the whole string up to where you want to work.

I know some systems chop the strings into blocks of (say) a few
hundred chars, so you can immediately get to the correct
block, then scan into the block to get to the desired char offset.

> I don't have a Python example of parsing a huge string, but I've done
> it in other languages, and when I can depend on indexing being a cheap
> operation, I'll happily do exactly that.

I'd be interested to know what the context was, where you parsed
a big unicode string in a way that required random access to
the nth character in the string.

[toc] | [prev] | [next] | [standalone]

#27345

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-19 13:31 +1000
Message-ID	<mailman.3483.1345347084.4697.python-list@python.org>
In reply to	#27343

On Sun, Aug 19, 2012 at 1:10 PM, Paul Rubin <no.email@nospam.invalid> wrote:
> Chris Angelico <rosuav@gmail.com> writes:
>> I don't have a Python example of parsing a huge string, but I've done
>> it in other languages, and when I can depend on indexing being a cheap
>> operation, I'll happily do exactly that.
>
> I'd be interested to know what the context was, where you parsed
> a big unicode string in a way that required random access to
> the nth character in the string.

It's something I've done in C/C++ fairly often. Take one big fat
buffer, slice it and dice it as you get the information you want out
of it. I'll retain and/or calculate indices (when I'm not using
pointers, but that's a different kettle of fish). Generally, I'm
working with pure ASCII, but port those same algorithms to Python and
you'll easily be able to read in a file in some known encoding and
manipulate it as Unicode.

It's not so much 'random access to the nth character' as an efficient
way of jumping forward. For instance, if I know that the next thing is
a literal string of n characters (that I don't care about), I want to
skip over that and keep parsing. The Adobe Message Format is
particularly noteworthy in this, but it's a stupid format and I don't
recommend people spend too much time reading up on it (unless you like
that sensation of your brain trying to escape through your ear).

ChrisA

[toc] | [prev] | [next] | [standalone]

#27347

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-18 22:58 -0700
Message-ID	<7xfw7jv2x5.fsf@ruckus.brouhaha.com>
In reply to	#27345

Chris Angelico <rosuav@gmail.com> writes:
> Generally, I'm working with pure ASCII, but port those same algorithms
> to Python and you'll easily be able to read in a file in some known
> encoding and manipulate it as Unicode.

If it's pure ASCII, you can use the bytes or bytearray type.  

> It's not so much 'random access to the nth character' as an efficient
> way of jumping forward. For instance, if I know that the next thing is
> a literal string of n characters (that I don't care about), I want to
> skip over that and keep parsing.

I don't understand how this is supposed to work.  You're going to read a
large unicode text file (let's say it's UTF-8) into a single big string?
So the runtime library has to scan the encoded contents to find the
highest numbered codepoint (let's say it's mostly ascii but has a few
characters outside the BMP), expand it all (in this case) to UCS-4
giving 4x memory bloat and requiring decoding all the UTF-8 regardless,
and now we should worry about the efficiency of skipping n characters?

Since you have to decode the n characters regardless, I'd think this
skipping part should only be an issue if you have to do it a lot of
times.

[toc] | [prev] | [next] | [standalone]

#27359

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-19 08:01 +0000
Message-ID	<50309d69$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27340

On Sat, 18 Aug 2012 19:35:44 -0700, Paul Rubin wrote:

> Scanning 4 characters (or a few dozen, say) to peel off a token in
> parsing a UTF-8 string is no big deal.  It gets more expensive if you
> want to index far more deeply into the string.  I'm asking how often
> that is done in real code.

It happens all the time.

Let's say you've got a bunch of text, and you use a regex to scan through 
it looking for a match. Let's ignore the regular expression engine, since 
it has to look at every character anyway. But you've done your search and 
found your matching text and now want everything *after* it. That's not 
exactly an unusual use-case.

mo = re.search(pattern, text)
if mo:
    start, end = mo.span()
    result = text[end:]

Easy-peasy, right? But behind the scenes, you have a problem: how does 
Python know where text[end:] starts? With fixed-size characters, that's 
O(1): Python just moves forward end*width bytes into the string. Nice and 
fast.

With a variable-sized characters, Python has to start from the beginning 
again, and inspect each byte or pair of bytes. This turns the slice 
operation into O(N) and the combined op (search + slice) into O(N**2), 
and that starts getting *horrible*.

As always, "everything is fast for small enough N", but you *really* 
don't want O(N**2) operations when dealing with large amounts of data.

Insisting that the regex functions only ever return offsets to valid 
character boundaries doesn't help you, because the string slice method 
cannot know where the indexes came from.

I suppose you could have a "fast slice" and a "slow slice" method, but 
really, that sucks, and besides all that does is pass responsibility for 
tracking character boundaries to the developer instead of the language, 
and you know damn well that they will get it wrong and their code will 
silently do the wrong thing and they'll say that Python sucks and we 
never used to have this problem back in the good old days with ASCII. Boo 
sucks to that.

UCS-4 is an option, since that's fixed-width. But it's also bulky. For 
typical users, you end up wasting memory. That is the complaint driving 
PEP 393 -- memory is cheap, but it's not so cheap that you can afford to 
multiply your string memory by four just in case somebody someday gives 
you a character in one of the supplementary planes.

If you have oodles of memory and small data sets, then UCS-4 is probably 
all you'll ever need. I hear that the club for people who have all the 
memory they'll ever need is holding their annual general meeting in a 
phone-booth this year.

You could say "Screw the full Unicode standard, who needs more than 64K 
different characters anyway?" Well apart from Asians, and historians, and 
a bunch of other people. If you can control your data and make sure no 
non-BMP characters are used, UCS-2 is fine -- except Python doesn't 
actually use that.

You could do what Python 3.2 narrow builds do: use UTF-16 and leave it up 
to the individual programmer to track character boundaries, and we know 
how well that works. Luckily the supplementary planes are only rarely 
used, and people who need them tend to buy more memory and use wide 
builds. People who only need a few non-BMP characters in a narrow build 
generally just cross their fingers and hope for the best.

You could add a whole lot more heavyweight infrastructure to strings, 
turn them into suped-up ropes-on-steroids. All those extra indexes mean 
that you don't save any memory. Because the objects are so much bigger 
and more complex, your CPU cache goes to the dogs and your code still 
runs slow.

Which leaves us right back where we started, PEP 393.

> Obviously one can concoct hypothetical examples that would suffer.

If you think "slicing at arbitrary indexes" is a hypothetical example, I 
don't know what to say.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27361

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-19 01:11 -0700
Message-ID	<7x4nnzmhbn.fsf@ruckus.brouhaha.com>
In reply to	#27359

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>     result = text[end:]

if end not near the end of the original string, then this is O(N)
even with fixed-width representation, because of the char copying.

if it is near the end, by knowing where the string data area
ends, I think it should be possible to scan backwards from
the end, recognizing what bytes can be the beginning of code points and
counting off the appropriate number.  This is O(1) if "near the end"
means "within a constant".

> You could say "Screw the full Unicode standard, who needs more than 64K 

No if you're claiming the language supports unicode it should be
the whole standard.

> You could do what Python 3.2 narrow builds do: use UTF-16 and leave it
> up to the individual programmer to track character boundaries,

I'm surprised the Python 3 implementers even considered that approach
much less went ahead with it.  It's obviously wrong.

> You could add a whole lot more heavyweight infrastructure to strings,
> turn them into suped-up ropes-on-steroids.

I'm not persuaded that PEP 393 isn't even worse.

[toc] | [prev] | [next] | [standalone]

#27362

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-19 18:24 +1000
Message-ID	<mailman.3487.1345364700.4697.python-list@python.org>
In reply to	#27361

On Sun, Aug 19, 2012 at 6:11 PM, Paul Rubin <no.email@nospam.invalid> wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>>     result = text[end:]
>
> if end not near the end of the original string, then this is O(N)
> even with fixed-width representation, because of the char copying.
>
> if it is near the end, by knowing where the string data area
> ends, I think it should be possible to scan backwards from
> the end, recognizing what bytes can be the beginning of code points and
> counting off the appropriate number.  This is O(1) if "near the end"
> means "within a constant".

Only if you know exactly where the end is (which requires storing and
maintaining a character length - this may already be happening, I
don't know). But that approach means you need to have code for both
ways (forward search or reverse), and of course it relies on your
encoding being reverse-scannable in this way (as UTF-8 is, but not
all).

And of course, taking the *entire* rest of the string isn't the only
thing you do. What if you want to take the next six characters after
that index? That would be constant time with a fixed-width storage
format.

ChrisA

[toc] | [prev] | [next] | [standalone]

#27363

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-19 01:44 -0700
Message-ID	<7xy5lb9soz.fsf@ruckus.brouhaha.com>
In reply to	#27362

Chris Angelico <rosuav@gmail.com> writes:
> And of course, taking the *entire* rest of the string isn't the only
> thing you do. What if you want to take the next six characters after
> that index? That would be constant time with a fixed-width storage
> format.

How often is this an issue in practice?

I wonder how other languages deal with this.  The examples I can think
of are poor role models:

1. C/C++ - unicode impaired, other than a wchar type

2. Java - bogus UCS-2-like(?) representation for historical reasons
   Also has some modified UTF=8 for reasons that made no sense and
   that I don't remember

3. Haskell - basic string type is a linked list of code points.
   "hello" is five list nodes.  New Data.Text library (much more
    efficient) uses something like ropes, I think, with UTF-16 underneath.

4. Erlang - I think like Haskell.  Efficiently handles byte blocks.

5. Perl 6 -- ???

6. Ruby - ??? (but probably quite slow like the rest of Ruby)

7. Objective C -- ???

8, 9 ...  (any other important ones?)

[toc] | [prev] | [next] | [standalone]

#27365

From	wxjmfauth@gmail.com
Date	2012-08-19 01:54 -0700
Message-ID	<bb45c0f1-4042-4653-b791-c216031a4d71@googlegroups.com>
In reply to	#27363

About the exemples contested by Steven:

eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")


And it is good enough to show the problem. Period. The
rest (you have to do this, you should not do this, why
are you using these characters - amazing and stupid
question -) does not count.

The real problem is elsewhere. *Americans* do not wish
a character occupies 4 bytes in *their* memory. The rest
of the world does not count.

The same thing happens with the utf-8 coding scheme.
Technically, it is fine. But after n years of usage,
one should recognize it just became an ascii2. Especially
for those who undestand nothing in that field and are 
not even aware, characters are "coded". I'm the first 
to think, this is legitimate.

Memory or "ability to treat all text in the same and equal
way"?

End note. This kind of discussion is not specific to
Python, it always happen when there is some kind of
conflict between ascii and non ascii users.

Have a nice day.

jmf

[toc] | [prev] | [next] | [standalone]

#27377

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-08-19 11:46 +0100
Message-ID	<mailman.3494.1345373121.4697.python-list@python.org>
In reply to	#27365

On 19/08/2012 09:54, wxjmfauth@gmail.com wrote:
> About the exemples contested by Steven:
>
> eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
>
>
> And it is good enough to show the problem. Period. The
> rest (you have to do this, you should not do this, why
> are you using these characters - amazing and stupid
> question -) does not count.
>
> The real problem is elsewhere. *Americans* do not wish
> a character occupies 4 bytes in *their* memory. The rest
> of the world does not count.
>
> The same thing happens with the utf-8 coding scheme.
> Technically, it is fine. But after n years of usage,
> one should recognize it just became an ascii2. Especially
> for those who undestand nothing in that field and are
> not even aware, characters are "coded". I'm the first
> to think, this is legitimate.
>
> Memory or "ability to treat all text in the same and equal
> way"?
>
> End note. This kind of discussion is not specific to
> Python, it always happen when there is some kind of
> conflict between ascii and non ascii users.
>
> Have a nice day.
>
> jmf
>

Roughly translated.  "I've been shot to pieces and having seen Monty 
Python and the Holy Grail I know what to do.  Run away, run away"

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#27399

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-08-19 12:31 -0400
Message-ID	<mailman.3508.1345393941.4697.python-list@python.org>
In reply to	#27365

On 8/19/2012 4:54 AM, wxjmfauth@gmail.com wrote:
> About the exemples contested by Steven:
> eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
> And it is good enough to show the problem. Period.

Repeating a false claim over and over does not make it true. Two people 
on pydev claim that 3.3 is *faster* on their systems (one unspecified, 
one OSX10.8).

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#27379

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-19 10:51 +0000
Message-ID	<5030c52d$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27361

On Sun, 19 Aug 2012 01:11:56 -0700, Paul Rubin wrote:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>>     result = text[end:]
> 
> if end not near the end of the original string, then this is O(N) even
> with fixed-width representation, because of the char copying.

Technically, yes. But it's a straight copy of a chunk of memory, which 
means it's fast: your OS and hardware tries to make straight memory 
copies as fast as possible. Big-Oh analysis frequently glosses over 
implementation details like that.

Of course, that assumption gets shaky when you start talking about extra 
large blocks, and it falls apart completely when your OS starts paging 
memory to disk.

But if it helps to avoid irrelevant technical details, change it to 
text[end:end+10] or something.

> if it is near the end, by knowing where the string data area ends, I
> think it should be possible to scan backwards from the end, recognizing
> what bytes can be the beginning of code points and counting off the
> appropriate number.  This is O(1) if "near the end" means "within a
> constant".

You know, I think you are misusing Big-Oh analysis here. It really 
wouldn't be helpful for me to say "Bubble Sort is O(1) if you only sort 
lists with a single item". Well, yes, that is absolutely true, but that's 
a special case that doesn't give you any insight into why using Bubble 
Sort as your general purpose sort routine is a terrible idea.

Using variable-sized strings like UTF-8 and UTF-16 for in-memory 
representations is a terrible idea because you can't assume that people 
will only every want to index the first or last character. On average, 
you need to scan half the string, one character at a time. In Big-Oh, we 
can ignore the factor of 1/2 and just say we scan the string, O(N).

That's why languages tend to use fixed character arrays for strings. 
Haskell is an exception, using linked lists which require traversing the 
string to jump to an index. The manual even warns:

[quote]
If you think of a Text value as an array of Char values (which it is 
not), you run the risk of writing inefficient code.

An idiom that is common in some languages is to find the numeric offset 
of a character or substring, then use that number to split or trim the 
searched string. With a Text value, this approach would require two O(n) 
operations: one to perform the search, and one to operate from wherever 
the search ended. 
[end quote]

http://hackage.haskell.org/packages/archive/text/0.11.2.2/doc/html/Data-Text.html

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27553

From	Neil Hodgson <nhodgson@iinet.net.au>
Date	2012-08-21 17:03 +1000
Message-ID	<3bOdnbu1sNbdrq7NnZ2dnUVZ_vWdnZ2d@westnet.com.au>
In reply to	#27379

Steven D'Aprano:

> Using variable-sized strings like UTF-8 and UTF-16 for in-memory
> representations is a terrible idea because you can't assume that people
> will only every want to index the first or last character. On average,
> you need to scan half the string, one character at a time. In Big-Oh, we
> can ignore the factor of 1/2 and just say we scan the string, O(N).

    In the majority of cases you can remove excessive scanning by 
caching the most recent index->offset result. If the next index request 
is nearer the cached index than to the beginning then iterate from that 
offset. This converts many operations from quadratic to linear. Locality 
of reference is common and can often be reasonably exploited.

    However, exposing the variable length nature of UTF-8 allows the 
application to choose efficient techniques for more cases.

    Neil

[toc] | [prev] | [next] | [standalone]

#27349

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-19 06:09 +0000
Message-ID	<5030832d$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27322

This is a long post. If you don't feel like reading an essay, skip to the 
very bottom and read my last few paragraphs, starting with "To recap".

On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>> (There is an extension to UCS-2, UTF-16, which encodes non-BMP
>> characters using two code points. This is fragile and doesn't work very
>> well, because string-handling methods can break the surrogate pairs
>> apart, leaving you with invalid unicode string. Not good.)
> ...
>> With PEP 393, each Python string will be stored in the most efficient
>> format possible:
> 
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more?  Switching between encodings based on the string contents seems
> silly at first glance.  

Forget encodings! We're not talking about encodings. Encodings are used 
for converting text as bytes for transmission over the wire or storage on 
disk. PEP 393 talks about the internal representation of text within 
Python, the C-level data structure.

In 3.2, that data structure depends on a compile-time switch. In a 
"narrow build", text is stored using two-bytes per character, so the 
string "len" (as in the name of the built-in function) will be stored as 

006c 0065 006e

(or possibly 6c00 6500 6e00, depending on whether your system is 
LittleEndian or BigEndian), plus object-overhead, which I shall ignore.

Since most identifiers are ASCII, that's already using twice as much 
memory as needed. This standard data structure is called UCS-2, and it 
only handles characters in the Basic Multilingual Plane, the BMP (roughly 
the first 64000 Unicode code points). I'll come back to that.

In a "wide build", text is stored as four-bytes per character, so "len" 
is stored as either:

0000006c 00000065 0000006e
6c000000 65000000 6e000000

Now memory is cheap, but it's not *that* cheap, and no matter how much 
memory you have, you can always use more.

This system is called UCS-4, and it can handle the entire Unicode 
character set, for now and forever. (If we ever need more that four-bytes 
worth of characters, it won't be called Unicode.)

Remember I said that UCS-2 can only handle the 64K characters 
[technically: code points] in the Basic Multilingual Plane? There's an 
extension to UCS-2 called UTF-16 which extends it to the entire Unicode 
range. Yes, that's the same name as the UTF-16 encoding, because it's 
more or less the same system.

UTF-16 says "let's represent characters in the BMP by two bytes, but 
characters outside the BMP by four bytes." There's a neat trick to this: 
the BMP doesn't use the entire two-byte range, so there are some byte 
pairs which are illegal in UCS-2 -- they don't correspond to *any* 
character. UTF-16 used those byte pairs to signal "this is half a 
character, you need to look at the next pair for the rest of the 
character".

Nifty hey? These pairs-of-pseudocharacters are called "surrogate pairs".

Except this comes at a big cost: you can no longer tell how long a string 
is by counting the number of bytes, which is fast, because sometimes four 
bytes is two characters and sometimes it's one and you can't tell which 
it will be until you actually inspect all four bytes.

Copying sub-strings now becomes either slow, or buggy. Say you want to 
grab the 10th characters in a string. The fast way using UCS-2 is to 
simply grab bytes 8 and 9 (remember characters are pairs of bytes and we 
start counting at zero) and you're done. Fast and safe if you're willing 
to give up the non-BMP characters.

It's also fast and safe if you use USC-4, but then everything takes twice 
as much space, so you probably end up spending so much time copying null 
bytes that you're probably slower anyway. Especially when your OS starts 
paging memory like mad.

But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8 
and 9 are half of a surrogate pair, and you've now split the pair and 
ended up with an invalid string. That's what Python 3.2 does, it fails to 
handle surrogate pairs properly:

py> s = chr(0xFFFF + 1)
py> a, b = s
py> a
'\ud800'
py> b
'\udc00'

I've just split a single valid Unicode character into two invalid 
characters. Python3.2 will (probably) mindless process those two non-
characters, and the only sign I have that I did something wrong is that 
my data is now junk.

Since any character can be a surrogate pair, you have to scan every pair 
of bytes in order to index a string, or work out it's length, or copy a 
substring. It's not enough to just check if the last pair is a surrogate. 

When you don't, you have bugs like this from Python 3.2:

py> s = "01234" + chr(0xFFFF + 1) + "6789"
py> s[9] == '9'
False
py> s[9], len(s)
('8', 11)

Which is now fixed in Python 3.3.

So variable-width data structures like UTF-8 or UTF-16 are crap for the 
internal representation of strings -- they are either fast or correct but 
cannot be both.

But UCS-2 is sub-optimal, because it can only handle the BMP, and UCS-4 
is too because ASCII-only strings like identifiers end up being four 
times as big as they need to be. 1-byte schemes like Latin-1 are 
unspeakable because they only handle 256 characters, fewer if you don't 
count the C0 and C1 control codes.

PEP 393 to the rescue! What if you could encode pure-ASCII strings like 
"len" using one byte per character, and BMP strings using two bytes per 
character (UCS-2), and fall back to four bytes (UCS-4) only when you 
really need it?

The benefits are:

* Americans and English-Canadians and Australians and other barbarians of 
that ilk who only use ASCII save a heap of memory;

* people who mostly use non-BMP characters only pay the cost of four-
bytes per character for strings that actually *need* four-bytes per 
character;

* people who use lots of non-BMP characters are no worse off.

The costs are:

* string routines need to be smarter -- they have to handle three 
different data structures (ASCII, UCS-2, UCS-4) instead of just one;

* there's a certain amount of overhead when creating a string -- you have 
to work out which in-memory format to use, and that's not necessarily 
trivial, but at least it's a once-off cost when you create the string;

* people who misunderstand what's going on get all upset over micro-
benchmarks.

> Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages.  I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.

To recap:

* Variable-byte formats like UTF-8 and UTF-16 mean that basic string 
operations are not O(1) but are O(N). That means they are slow, or buggy, 
pick one.

* Fixed width UCS-2 doesn't handle the full Unicode range, only the BMP. 
That's better than it sounds: the BMP supports most character sets, but 
not all. Still, there are people who need the supplementary planes, and 
UCS-2 lets them down.

* Fixed width UCS-4 does handle the full Unicode range, without 
surrogates, but at the cost of using 2-4 times more string memory for the 
vast majority of users.

* PEP 393 doesn't use variable-width characters, but variable-width 
strings. Instead of choosing between 1, 2 and 4 bytes per character, it 
chooses *per string*. This keeps basic string operations O(1) instead of 
O(N), saves memory where possible, while still supporting the full 
Unicode range without a compile-time option.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27360

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-19 01:04 -0700
Message-ID	<7x8vdbmho6.fsf@ruckus.brouhaha.com>
In reply to	#27349

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> This is a long post. If you don't feel like reading an essay, skip to the 
> very bottom and read my last few paragraphs, starting with "To recap".

I'm very flattered that you took the trouble to write that excellent
exposition of different Unicode encodings in response to my post.  I can
only hope some readers will benefit from it.  I regret that I wasn't
more clear about the perspective I posted from, i.e. that I'm already
familiar with how those encodings work.

After reading all of it, I still have the same skepticism on the main
point as before, but I think I see what the issue in contention is, and
some differences in perspectice.  First of all, you wrote:

> This standard data structure is called UCS-2 ... There's an extension
> to UCS-2 called UTF-16

My own understanding is UCS-2 simply shouldn't be used any more.
Unicode was historically supposed to be a 16-bit character set, but that
turned out to not be enough, so the supplementary planes were added.
UCS-2 thus became obsolete and UTF-16 superseded it in 1996.  UTF-16 in
turn is rather clumsy and the later UTF-8 is better in a lot of ways,
but both of these are at least capable of encoding all the character
codes.

On to the main issue:

> * Variable-byte formats like UTF-8 and UTF-16 mean that basic string 
> operations are not O(1) but are O(N). That means they are slow, or buggy, 
> pick one.

This I don't see.  What are the basic string operations?

* Examine the first character, or first few characters ("few" = "usually
  bounded by a small constant") such as to parse a token from an input
  stream.  This is O(1) with either encoding.

* Slice off the first N characters.  This is O(N) with either encoding
  if it involves copying the chars.  I guess you could share references
  into the same string, but if the slice reference persists while the
  big reference is released, you end up not freeing the memory until
  later than you really should.

* Concatenate two strings.  O(N) either way.

* Find length of string.  O(1) either way since you'd store it in
  the string header when you build the string in the first place.
  Building the string has to have been an O(N) operation in either
  representation.

And finally:

* Access the nth char in the string for some large random n, or maybe
  get a small slice from some random place in a big string.  This is
  where fixed-width representation is O(1) while variable-width is O(N).

What I'm not convinced of, is that the last thing happens all that
often.

Meanwhile, an example of the 393 approach failing: I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
So the chars were mostly ascii, but there would be occasional non-ascii
chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision.  That's a
natural for UTF-8 but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.

    py> s = chr(0xFFFF + 1)
    py> a, b = s

That looks like Python 3.2 is buggy and that sample should just throw an
error.  s is a one-character string and should not be unpackable.

I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?

Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered.  By that I mean pick some implementation
constant k (say k=128) and represent the string as a UTF-8 encoded byte
array, accompanied by a vector n//k pointers into the byte array, where
n is the number of codepoints in the string.  Then you can reach any
offset analogously to reading a random byte on a disk, by seeking to the
appropriate block, and then reading the block and getting the char you
want within it.  Random access is then O(1) though the constant is
higher than it would be with fixed width encoding.

[toc] | [prev] | [next] | [standalone]

#27389

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-19 13:25 +0000
Message-ID	<5030e939$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27360

On Sun, 19 Aug 2012 01:04:25 -0700, Paul Rubin wrote:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:

>> This standard data structure is called UCS-2 ... There's an extension
>> to UCS-2 called UTF-16
> 
> My own understanding is UCS-2 simply shouldn't be used any more. 

Pretty much. But UTF-16 with lax support for surrogates (that is, 
surrogates are included but treated as two characters) is essentially 
UCS-2 with the restriction against surrogates lifted. That's what Python 
currently does, and Javascript.

http://mathiasbynens.be/notes/javascript-encoding

The reality is that support for the Unicode supplementary planes is 
pretty poor. Even when applications support it, most fonts don't have 
glyphs for the characters. Anything which makes handling of Unicode 
supplementary characters better is a step forward.

>> * Variable-byte formats like UTF-8 and UTF-16 mean that basic string
>> operations are not O(1) but are O(N). That means they are slow, or
>> buggy, pick one.
> 
> This I don't see.  What are the basic string operations?

The ones I'm specifically referring to are indexing and copying 
substrings. There may be others.

> * Examine the first character, or first few characters ("few" = "usually
>   bounded by a small constant") such as to parse a token from an input
>   stream.  This is O(1) with either encoding.

That's actually O(K), for K = "a few", whatever "a few" means. But we 
know that anything is fast for small enough N (or K in this case).

> * Slice off the first N characters.  This is O(N) with either encoding
>   if it involves copying the chars.  I guess you could share references
>   into the same string, but if the slice reference persists while the
>   big reference is released, you end up not freeing the memory until
>   later than you really should.

As a first approximation, memory copying is assumed to be free, or at 
least constant time. That's not strictly true, but Big Oh analysis is 
looking at algorithmic complexity. It's not a substitute for actual 
benchmarks.

> Meanwhile, an example of the 393 approach failing: I was involved in a
> project that dealt with terabytes of OCR data of mostly English text.

I assume that this wasn't one giant multi-terrabyte string.

> So
> the chars were mostly ascii, but there would be occasional non-ascii
> chars including supplementary plane characters, either because of
> special symbols that were really in the text, or the typical OCR
> confusion emitting those symbols due to printing imprecision.  That's a
> natural for UTF-8 but the PEP-393 approach would bloat up the memory
> requirements by a factor of 4.

Not necessarily. Presumably you're scanning each page into a single 
string. Then only the pages containing a supplementary plane char will be 
bloated, which is likely to be rare. Especially since I don't expect your 
OCR application would recognise many non-BMP characters -- what does 
U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software 
doesn't recognise it, you can't get it in your output. (If you do, the 
OCR software has a nasty bug.)

Anyway, in my ignorant opinion the proper fix here is to tell the OCR 
software not to bother trying to recognise Imperial Aramaic, Domino 
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't 
expecting them in your source material. Not only will the scanning go 
faster, but you'll get fewer wrong characters.

[...]
> I realize the folks who designed and implemented PEP 393 are very smart
> cookies and considered stuff carefully, while I'm just an internet user
> posting an immediate impression of something I hadn't seen before (I
> still use Python 2.6), but I still have to ask: if the 393 approach
> makes sense, why don't other languages do it?

There has to be a first time for everything.

> Ropes of UTF-8 segments seems like the most obvious approach and I
> wonder if it was considered.

Ropes have been considered and rejected because while they are 
asymptotically fast, in common cases the added complexity actually makes 
them slower. Especially for immutable strings where you aren't inserting 
into the middle of a string.

http://mail.python.org/pipermail/python-dev/2000-February/002321.html

PyPy has revisited ropes and uses, or at least used, ropes as their 
native string data structure. But that's ropes of *bytes*, not UTF-8.

http://morepypy.blogspot.com.au/2007/11/ropes-branch-merged.html

-- 
Steven

[toc] | [prev] | [next] | [standalone]

Page 5 of 8 — ← Prev page 1 2 3 4 [5] 6 7 8 Next page →

csiph-web

How do I display unicode value stored in a string variable using ord()

Contents

#27336

#27337

#27338

#27340

#27342

#27343

#27345

#27347

#27359

#27361

#27362

#27363

#27365

#27377

#27399

#27379

#27553

#27349

#27360

#27389