Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #27204 > unrolled thread

How do I display unicode value stored in a string variable using ord()

Started byCharles Jensen <hopefullycharles@gmail.com>
First post2012-08-16 15:09 -0700
Last post2012-08-20 17:20 -0400
Articles 20 on this page of 145 — 26 participants

Back to article view | Back to comp.lang.python


Contents

  How do I display unicode value stored in a string variable using ord() Charles Jensen <hopefullycharles@gmail.com> - 2012-08-16 15:09 -0700
    Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-17 08:20 +1000
    Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-16 18:47 -0400
    Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-16 19:59 -0400
      Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
        Re: How do I display unicode value stored in a string variable using ord() Jerry Hill <malaclypse2@gmail.com> - 2012-08-17 14:21 -0400
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 11:45 -0700
            Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 16:55 -0400
            Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-17 23:30 -0400
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 04:10 +0000
                Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:18 -0600
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 03:59 +0000
      Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-17 10:49 -0700
    Re: How do I display unicode value stored in a string variable using ord() Alister <alister.ware@ntlworld.com> - 2012-08-17 06:30 +0000
    Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 01:09 -0700
      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 12:27 +0000
        Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 08:07 -0700
          Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 16:25 +0100
          Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 01:36 +1000
          Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-18 09:51 -0600
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 02:57 +1000
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 18:28 +0100
                Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
                  Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:34 +0100
                    Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:35 +0000
                      New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Peter Otten <__peter__@web.de> - 2012-08-19 09:43 +0200
                        Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:56 +0000
                          Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 02:24 -0700
                          Re: New internal string format in 3.3 Peter Otten <__peter__@web.de> - 2012-08-19 11:37 +0200
                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
                              Re: New internal string format in 3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:33 +0000
                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 03:19 -0700
                              Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:26 +1000
                                Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
                                  Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:29 -0400
                                    Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
                                      Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 14:46 +0100
                                        Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
                                        Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 07:09 -0700
                                          Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 15:48 +0100
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 09:19 -0700
                                          Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:48 -0400
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
                                              Re: New internal string format in 3.3 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:09 +0100
                                              Re: New internal string format in 3.3 Chris Angelico <rosuav@gmail.com> - 2012-08-20 07:50 +1000
                                              Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-19 23:38 -0600
                                                Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-20 09:17 -0400
                                                  Re: New internal string format in 3.3 Michael Torrie <torriem@gmail.com> - 2012-08-20 22:18 -0600
                                                    Re: New internal string format in 3.3 Roy Smith <roy@panix.com> - 2012-08-21 07:48 -0400
                                            Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 10:51 -0700
                                      Re: New internal string format in 3.3 Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:56 -0400
                                    Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:59 -0700
                                  Re: New internal string format in 3.3 Dave Angel <d@davea.name> - 2012-08-19 08:35 -0400
                                Re: New internal string format in 3.3 wxjmfauth@gmail.com - 2012-08-19 05:14 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:30 +0000
                Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:05 -0700
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 16:09 -0400
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-18 23:12 -0400
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 09:38 -0700
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:33 +0000
              Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 11:50 -0600
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 11:20 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:31 -0600
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 12:23 -0700
                Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:16 +0000
              Re: How do I display unicode value stored in a string variable using ord() Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-19 12:46 -0600
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 17:59 +0000
            Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 11:30 -0700
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:45 +0100
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:13 +0000
            Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-18 11:40 -0700
              Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 20:50 +0100
              Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-18 13:22 -0700
                Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 22:37 +0100
        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 11:26 -0700
          Re: How do I display unicode value stored in a string variable using ord() MRAB <python@mrabarnett.plus.com> - 2012-08-18 19:59 +0100
            Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 07:17 +0000
          Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 10:46 +1000
            Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:11 -0700
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 12:19 +1000
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 19:35 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:01 +1000
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 20:10 -0700
                      Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 13:31 +1000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-18 22:58 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 08:01 +0000
                    Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:11 -0700
                      Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 18:24 +1000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:44 -0700
                          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 01:54 -0700
                            Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 11:46 +0100
                            Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 12:31 -0400
                      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 10:51 +0000
                        Re: How do I display unicode value stored in a string variable using ord() Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-21 17:03 +1000
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 06:09 +0000
            Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 01:04 -0700
              Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 13:25 +0000
                Re: How do I display unicode value stored in a string variable using ord() DJC <djc@news.invalid> - 2012-08-19 17:32 +0200
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 13:34 -0400
                Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 10:48 -0700
                  Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 11:11 -0700
                    Re: How do I display unicode value stored in a string variable using ord() Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-19 19:50 +0100
                    Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:59 -0400
                    Re: How do I display unicode value stored in a string variable using ord() rusi <rustompmody@gmail.com> - 2012-08-19 23:13 -0700
                  Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:15 +0000
                    Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-19 16:42 -0700
                      Re: Abuse of Big Oh notation Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-08-20 09:24 +0100
                        Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 09:01 -0700
                          Re: Abuse of Big Oh notation Chris Angelico <rosuav@gmail.com> - 2012-08-21 02:09 +1000
                          Re: Abuse of Big Oh notation Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-20 11:12 -0600
                            Re: Abuse of Big Oh notation Paul Rubin <no.email@nospam.invalid> - 2012-08-20 12:29 -0700
                              Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:16 -0700
                              Re: Abuse of Big Oh notation 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-20 15:20 -0700
                            Re: Abuse of Big Oh notation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-21 09:53 +0000
                        Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
                          Re: Abuse of Big Oh notation Ned Deily <nad@acm.org> - 2012-08-20 18:19 -0700
                          Abuse of subject, was Re: Abuse of Big Oh notation Peter Otten <__peter__@web.de> - 2012-08-21 09:52 +0200
                            Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
                            Re: Abuse of subject, was Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-21 10:16 -0700
                        Re: Abuse of Big Oh notation wxjmfauth@gmail.com - 2012-08-20 11:42 -0700
                  Re: How do I display unicode value stored in a string variable using ord() Hans Mulder <hansmu@xs4all.nl> - 2012-08-22 20:53 +0200
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 08:42 +1000
                Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-19 19:24 -0400
                  Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 04:21 +0000
                    Re: How do I display unicode value stored in a string variable using ord() Roy Smith <roy@panix.com> - 2012-08-20 00:44 -0400
                      Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-20 05:56 +0000
                        Re: How do I display unicode value stored in a string variable using ord() Paul Rubin <no.email@nospam.invalid> - 2012-08-19 23:24 -0700
                    Re: How do I display unicode value stored in a string variable using ord() Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-20 12:58 -0400
              Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 20:35 -0400
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-20 14:07 +1000
            Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:13 +0100
              Re: How do I display unicode value stored in a string variable using ord() Chris Angelico <rosuav@gmail.com> - 2012-08-19 20:19 +1000
                Re: How do I display unicode value stored in a string variable using ord() lipska the kat <lipskathekat@yahoo.co.uk> - 2012-08-19 11:49 +0100
        Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 18:03 +0100
          Re: How do I display unicode value stored in a string variable using ord() wxjmfauth@gmail.com - 2012-08-19 10:33 -0700
            Re: How do I display unicode value stored in a string variable using ord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:04 +0100
          Re: How do I display unicode value stored in a string variable using ord() Dave Angel <d@davea.name> - 2012-08-19 14:05 -0400
            Re: How do I display unicode value stored in a string variable usingord() "Blind Anagram" <noname@nowhere.com> - 2012-08-19 19:18 +0100
          Re: How do I display unicode value stored in a string variable using ord() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-19 20:31 +0000
          Re: How do I display unicode value stored in a string variable using ord() Terry Reedy <tjreedy@udel.edu> - 2012-08-19 17:03 -0400
          Re: How do I display unicode value stored in a string variable using ord() 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-19 17:32 -0700
          Re: How do I display unicode value stored in a string variable using ord() Piet van Oostrum <piet@vanoostrum.org> - 2012-08-20 17:20 -0400

Page 6 of 8 — ← Prev page 1 2 3 4 5 [6] 7 8  Next page →


#27395

FromDJC <djc@news.invalid>
Date2012-08-19 17:32 +0200
Message-ID<k0r0tq$36l$1@dont-email.me>
In reply to#27389
On 19/08/12 15:25, Steven D'Aprano wrote:

> Not necessarily. Presumably you're scanning each page into a single
> string. Then only the pages containing a supplementary plane char will be
> bloated, which is likely to be rare. Especially since I don't expect your
> OCR application would recognise many non-BMP characters -- what does
> U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software
> doesn't recognise it, you can't get it in your output. (If you do, the
> OCR software has a nasty bug.)
>
> Anyway, in my ignorant opinion the proper fix here is to tell the OCR
> software not to bother trying to recognise Imperial Aramaic, Domino
> Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
> expecting them in your source material. Not only will the scanning go
> faster, but you'll get fewer wrong characters.

Consider the automated recognition of a CAPTCHA. As the chars have to be 
entered by the user on a keyboard, only the most basic charset can be 
used, so the problem of which chars are possible is quite limited.

[toc] | [prev] | [next] | [standalone]


#27403

FromTerry Reedy <tjreedy@udel.edu>
Date2012-08-19 13:34 -0400
Message-ID<mailman.3511.1345397678.4697.python-list@python.org>
In reply to#27360
On 8/19/2012 4:04 AM, Paul Rubin wrote:


> Meanwhile, an example of the 393 approach failing:

I am completely baffled by this, as this example is one where the 393 
approach potentially wins.

> I was involved in a
> project that dealt with terabytes of OCR data of mostly English text.
> So the chars were mostly ascii,

3.3 stores ascii pages 1 byte/char rather than 2 or 4.

 > but there would be occasional non-ascii
> chars including supplementary plane characters, either because of
> special symbols that were really in the text, or the typical OCR
> confusion emitting those symbols due to printing imprecision.

I doubt that there are really any non-bmp chars. As Steven said, reject 
such false identifications.

 > That's a  natural for UTF-8

3.3 would convert to utf-8 for storage on disk.

> but the PEP-393 approach would bloat up the memory
> requirements by a factor of 4.

3.2- wide builds would *always* use 4 bytes/char. Is not occasionally 
better than always?

>      py> s = chr(0xFFFF + 1)
>      py> a, b = s
>
> That looks like Python 3.2 is buggy and that sample should just throw an
> error.  s is a one-character string and should not be unpackable.

That looks like a 3.2- narrow build. Such which treat unicode strings as 
sequences of code units rather than sequences of codepoints. Not an 
implementation bug, but compromise design that goes back about a decade 
to when unicode was added to Python. At that time, there were only a few 
defined non-BMP chars and their usage was extremely rare. There are now 
more extended chars than BMP chars and usage will become more common 
even in English text.

Pre 3.3, there are really 2 sub-versions of every Python version: a 
narrow build and a wide build version, with not very well documented 
different behaviors for any string with extended chars. That is and 
would have become an increasing problem as extended chars are 
increasingly used. If you want to say that what was once a practical 
compromise has become a design bug, I would not argue. In any case, 3.3 
fixes that split and returns Python to being one cross-platform language.

> I realize the folks who designed and implemented PEP 393 are very smart
> cookies and considered stuff carefully, while I'm just an internet user
> posting an immediate impression of something I hadn't seen before (I
> still use Python 2.6), but I still have to ask: if the 393 approach
> makes sense, why don't other languages do it?

Python has often copied or borrowed, with adjustments. This time it is 
the first. We will see how it goes, but it has been tested for nearly a 
year already.

> Ropes of UTF-8 segments seems like the most obvious approach and I
> wonder if it was considered.  By that I mean pick some implementation
> constant k (say k=128) and represent the string as a UTF-8 encoded byte
> array, accompanied by a vector n//k pointers into the byte array, where
> n is the number of codepoints in the string.  Then you can reach any
> offset analogously to reading a random byte on a disk, by seeking to the
> appropriate block, and then reading the block and getting the char you
> want within it.  Random access is then O(1) though the constant is
> higher than it would be with fixed width encoding.

I would call it O(k), where k is a selectable constant. Slowing access 
by a factor of 100 is hardly acceptable to me. For strings less than k, 
access is O(len). I believe slicing would require re-indexing.

As 393 was near adoption, I proposed a scheme using utf-16 (narrow 
builds) with a supplementary index of extended chars when there are any. 
That makes access O(1) if there are none and O(log(k)), where k is the 
number of extended chars in the string, if there are some.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#27405

FromPaul Rubin <no.email@nospam.invalid>
Date2012-08-19 10:48 -0700
Message-ID<7xfw7ilqnd.fsf@ruckus.brouhaha.com>
In reply to#27403
Terry Reedy <tjreedy@udel.edu> writes:
>> Meanwhile, an example of the 393 approach failing:
> I am completely baffled by this, as this example is one where the 393
> approach potentially wins.

What?  The 393 approach is supposed to avoid memory bloat and that
does the opposite.

>> I was involved in a project that dealt with terabytes of OCR data of
>> mostly English text.  So the chars were mostly ascii,
> 3.3 stores ascii pages 1 byte/char rather than 2 or 4.

But they are not ascii pages, they are (as stated) MOSTLY ascii.
E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
a much more memory-expensive encoding than UTF-8.

> I doubt that there are really any non-bmp chars.

You may be right about this.  I thought about it some more after
posting and I'm not certain that there were supplemental characters.

> As Steven said, reject such false identifications.

Reject them how?

>> That's a  natural for UTF-8
> 3.3 would convert to utf-8 for storage on disk.

They are already in utf-8 on disk though that doesn't matter since
they are also compressed.  

>> but the PEP-393 approach would bloat up the memory
>> requirements by a factor of 4.
> 3.2- wide builds would *always* use 4 bytes/char. Is not occasionally
> better than always?

The bloat is in comparison with utf-8, in that example.

> That looks like a 3.2- narrow build. Such which treat unicode strings
> as sequences of code units rather than sequences of codepoints. Not an
> implementation bug, but compromise design that goes back about a
> decade to when unicode was added to Python. 

I thought the whole point of Python 3's disruptive incompatibility with
Python 2 was to clean up past mistakes and compromises, of which unicode
headaches was near the top of the list.  So I'm surprised they seem to
repeated a mistake there.  

> I would call it O(k), where k is a selectable constant. Slowing access
> by a factor of 100 is hardly acceptable to me. 

If k is constant then O(k) is the same as O(1).  That is how O notation
works.  I wouldn't believe the 100x figure without seeing it measured in
real-world applications.

[toc] | [prev] | [next] | [standalone]


#27416

Fromwxjmfauth@gmail.com
Date2012-08-19 11:11 -0700
Message-ID<28f35cee-3e55-43af-afc8-1ded199c53d9@googlegroups.com>
In reply to#27405
Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
> 
> 
> But they are not ascii pages, they are (as stated) MOSTLY ascii.
> 
> E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
> 
> a much more memory-expensive encoding than UTF-8.
> 
> 

Imagine an us banking application, everything in ascii,
except ... the € currency symbole, code point 0x20ac.

Well, it seems some software producers know what they
are doing.

>>> '€'.encode('cp1252')
b'\x80'
>>> '€'.encode('mac-roman')
b'\xdb'
>>> '€'.encode('iso-8859-1')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' 
in position 0: ordinal not in range(256)

jmf

[toc] | [prev] | [next] | [standalone]


#27421

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2012-08-19 19:50 +0100
Message-ID<mailman.3522.1345402115.4697.python-list@python.org>
In reply to#27416
On 19/08/2012 19:11, wxjmfauth@gmail.com wrote:
> Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
>>
>>
>> But they are not ascii pages, they are (as stated) MOSTLY ascii.
>>
>> E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
>>
>> a much more memory-expensive encoding than UTF-8.
>>
>>
>
> Imagine an us banking application, everything in ascii,
> except ... the € currency symbole, code point 0x20ac.
>
> Well, it seems some software producers know what they
> are doing.
>
>>>> '€'.encode('cp1252')
> b'\x80'
>>>> '€'.encode('mac-roman')
> b'\xdb'
>>>> '€'.encode('iso-8859-1')
> Traceback (most recent call last):
>    File "<eta last command>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
> in position 0: ordinal not in range(256)
>
> jmf
>

Well that's it then, the world stock markets will all collapse tonight 
when the news leaks out that those stupid Americans haven't yet realised 
that much of Europe (with at least one very noticeable and sensible 
exception :) uses Euros.  I'd better sell all my stock holdings fast.

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]


#27436

FromTerry Reedy <tjreedy@udel.edu>
Date2012-08-19 17:59 -0400
Message-ID<mailman.3529.1345413613.4697.python-list@python.org>
In reply to#27416
On 8/19/2012 2:11 PM, wxjmfauth@gmail.com wrote:

> Well, it seems some software producers know what they
> are doing.
>
>>>> '€'.encode('cp1252')
> b'\x80'
>>>> '€'.encode('mac-roman')
> b'\xdb'
>>>> '€'.encode('iso-8859-1')
> Traceback (most recent call last):
>    File "<eta last command>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
> in position 0: ordinal not in range(256)

Yes, Python lets you choose your byte encoding from those and a hundred 
others. I believe all the codecs are now tested in both directions. It 
was not an easy task.

As to the examples: Latin-1 dates to 1985 and before and the 1988 
version was published as a standard in 1992.
https://en.wikipedia.org/wiki/Latin-1
"The name euro was officially adopted on 16 December 1995."
https://en.wikipedia.org/wiki/Euro
No wonder Latin-1 does not contain the Euro sign. International 
standards organizations standards are relatively fixed. (The unicode 
consortium will not even correct misspelled character names.) Instead, 
new standards with a new number are adopted.

For better or worse, private mappings are more flexible. In its Mac 
mapping Apple "replaced the generic currency sign ¤ with the euro sign 
€". (See Latin-1 reference.) Great if you use Euros, not so great if you 
were using the previous sign for something else.

Microsoft changed an unneeded code to the Euro for Windows cp-1252.
https://en.wikipedia.org/wiki/Windows-1252
"It is very common to mislabel Windows-1252 text with the charset label 
ISO-8859-1. A common result was that all the quotes and apostrophes 
(produced by "smart quotes" in Microsoft software) were replaced with 
question marks or boxes on non-Windows operating systems, making text 
difficult to read. Most modern web browsers and e-mail clients treat the 
MIME charset ISO-8859-1 as Windows-1252 in order to accommodate such 
mislabeling. This is now standard behavior in the draft HTML 5 
specification, which requires that documents advertised as ISO-8859-1 
actually be parsed with the Windows-1252 encoding.[1]"

Lots of fun. Too bad Microsoft won't push utf-8 so we can all 
communicate text with much less chance of ambiguity.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#27463

Fromrusi <rustompmody@gmail.com>
Date2012-08-19 23:13 -0700
Message-ID<7e85ae06-9340-4677-bde8-a7bd4aec8a0b@j2g2000pbg.googlegroups.com>
In reply to#27416
On Aug 19, 11:11 pm, wxjmfa...@gmail.com wrote:
> Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
>
>
>
> > But they are not ascii pages, they are (as stated) MOSTLY ascii.
>
> > E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
>
> > a much more memory-expensive encoding than UTF-8.

>
>
> Well, it seems some software producers know what they
> are doing.
>
> >>> '€'.encode('cp1252')
> b'\x80'
> >>> '€'.encode('mac-roman')
> b'\xdb'
> >>> '€'.encode('iso-8859-1')
>
> Traceback (most recent call last):
>   File "<eta last command>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
> in position 0: ordinal not in range(256)

<facetious>
You want the Euro-sign in iso-8859-1??
I object. I want the rupee sign ( ₹ ) http://en.wikipedia.org/wiki/Indian_rupee_sign

And while we are at it, why not move it (both?) into ASCII?
</facetious>

The problem(s) are:
1. We dont really understand what you are objecting to.
2. Utf-8 like Huffman coding is a prefix code
http://en.wikipedia.org/wiki/Prefix_code#Prefix_codes_in_use_today
Like Huffman coding, it compresses based on a statistical argument.
3. Unlike Huffman coding the statistics is very political: "Is the
Euro more important or Chinese ideograms?" depends on whom you ask

[toc] | [prev] | [next] | [standalone]


#27426 — Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()]

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-08-19 20:15 +0000
SubjectAbuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()]
Message-ID<50314968$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to#27405
On Sun, 19 Aug 2012 10:48:06 -0700, Paul Rubin wrote:

> Terry Reedy <tjreedy@udel.edu> writes:

>> I would call it O(k), where k is a selectable constant. Slowing access
>> by a factor of 100 is hardly acceptable to me.
> 
> If k is constant then O(k) is the same as O(1).  That is how O notation
> works.

You might as well say that if N is constant, O(N**2) is constant too and 
just like magic you have now made Bubble Sort a constant-time sort 
function!

That's not how it works.

Of course *if* k is constant, O(k) is constant too, but k is not 
constant. In context we are talking about string indexing and slicing. 
There is no value of k, say, k = 2, for which you can say "People will 
sometimes ask for string[2] but never ask for string[3]". That is absurd.

Since k can vary from 0 to N-1, we can say that the average string index 
lookup is k = (N-1)//2 which clearly depends on N.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#27442 — Re: Abuse of Big Oh notation

FromPaul Rubin <no.email@nospam.invalid>
Date2012-08-19 16:42 -0700
SubjectRe: Abuse of Big Oh notation
Message-ID<7xwr0ua1pw.fsf@ruckus.brouhaha.com>
In reply to#27426
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> Of course *if* k is constant, O(k) is constant too, but k is not 
> constant. In context we are talking about string indexing and slicing. 
> There is no value of k, say, k = 2, for which you can say "People will 
> sometimes ask for string[2] but never ask for string[3]". That is absurd.

The context was parsing, e.g. recognizing a token like "a" or "foo" in a
human-written chunk of text.  Occasionally it might be "sesquipidalian"
or some even worse outlier, but one can reasonably put a fixed and
relatively small upper bound on the expected value of k.  That makes the
amortized complexity O(1), I think.

[toc] | [prev] | [next] | [standalone]


#27477 — Re: Abuse of Big Oh notation

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2012-08-20 09:24 +0100
SubjectRe: Abuse of Big Oh notation
Message-ID<mailman.3547.1345451405.4697.python-list@python.org>
In reply to#27442
On Sun, 19 Aug 2012 16:42:03 -0700, Paul Rubin 
<no.email@nospam.invalid> wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> > Of course *if* k is constant, O(k) is constant too, but k is not 
> > constant. In context we are talking about string indexing and 
slicing. 
> > There is no value of k, say, k = 2, for which you can say "People 
will 
> > sometimes ask for string[2] but never ask for string[3]". That is 
absurd.


> The context was parsing, e.g. recognizing a token like "a" or "foo" 
in a
> human-written chunk of text.  Occasionally it might be 
"sesquipidalian"
> or some even worse outlier, but one can reasonably put a fixed and
> relatively small upper bound on the expected value of k.  That 
makes the
> amortized complexity O(1), I think.

No it doen't. It is still O(k). The point of big O notation is to 
understand the asymptotic behaviour of one variable as it becomes 
large because of changes in other variables. If k is small then you 
can often guess that O(k) is small. To say that an operation is O(k), 
however, is a statement about what happens when k is big (and is not 
refuted by saying that k is typically not big).

Oscar

[toc] | [prev] | [next] | [standalone]


#27504 — Re: Abuse of Big Oh notation

FromPaul Rubin <no.email@nospam.invalid>
Date2012-08-20 09:01 -0700
SubjectRe: Abuse of Big Oh notation
Message-ID<7xr4r1pn72.fsf@ruckus.brouhaha.com>
In reply to#27477
Oscar Benjamin <oscar.j.benjamin@gmail.com> writes:
> No it doen't. It is still O(k). The point of big O notation is to
> understand the asymptotic behaviour of one variable as it becomes
> large because of changes in other variables. 

Actually, two separate problems got mixed together late at night.  In
neither case is k an independent variable that ranges over all possible
values.  In both cases it is selected or observed by measurement (i.e.
it is a dependent variable determined by something that is itself not
independent).

1) Access in a rope: here, k is basically determined by the pointer size
of the computer, which in CPython (the implementation we're discussing)
the pointer size is 4 or 8 bytes (constants) in all instances AFAIK.  k
should be a big enough that the pointer and allocation overhead is small
compared to bloating the strings with UCS-2 or UCS-4, and small enough
to not add much scan time.  It seems realistic to say k<=128 for this
(several times smaller is probably fine).  128 is of course a constant
and not a variable.  We are not concerned about hypothetical computers
with billion bit pointers.

2) Parsing tokens: here, k is drawn from a fixed probability
distribution such that its expectation value is again a small constant
(this is an assertion about the data looks like in typical parsing
applications in the real world, not what is mathematically possible).
The average-case complexity (I said "amortized" earlier but should have
said "average-case") is proportional to that constant, which is to say,
O(1).  Of course there is still more wiggle room about what is
"typical".

Analogy: how big a box is required to hold a pair of shoes?  In a purely
theoretical sense we might say O(S) where S is the shoe size, treating
shoe size as an arbitrary independent variable.  But in the real world,
shoe size is controlled by the size of human feet, which is bounded by a
constant for biological reasons.  You don't have to consider shoes the
size of Jupiter.  So it is O(1).

[toc] | [prev] | [next] | [standalone]


#27505 — Re: Abuse of Big Oh notation

FromChris Angelico <rosuav@gmail.com>
Date2012-08-21 02:09 +1000
SubjectRe: Abuse of Big Oh notation
Message-ID<mailman.3563.1345478961.4697.python-list@python.org>
In reply to#27504
On Tue, Aug 21, 2012 at 2:01 AM, Paul Rubin <no.email@nospam.invalid> wrote:
> Analogy: how big a box is required to hold a pair of shoes?  In a purely
> theoretical sense we might say O(S) where S is the shoe size, treating
> shoe size as an arbitrary independent variable.  But in the real world,
> shoe size is controlled by the size of human feet, which is bounded by a
> constant for biological reasons.  You don't have to consider shoes the
> size of Jupiter.  So it is O(1).

By that argument, everything is amortized O(1), because there's a
limit on every variable. You can't possibly be working with a data set
greater than the total sum of storage space in the entire world. That
still doesn't mean that bubble sort and heap sort are equivalently
efficient.

ChrisA

[toc] | [prev] | [next] | [standalone]


#27510 — Re: Abuse of Big Oh notation

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-08-20 11:12 -0600
SubjectRe: Abuse of Big Oh notation
Message-ID<mailman.3567.1345482779.4697.python-list@python.org>
In reply to#27504
On Mon, Aug 20, 2012 at 10:09 AM, Chris Angelico <rosuav@gmail.com> wrote:
> On Tue, Aug 21, 2012 at 2:01 AM, Paul Rubin <no.email@nospam.invalid> wrote:
>> Analogy: how big a box is required to hold a pair of shoes?  In a purely
>> theoretical sense we might say O(S) where S is the shoe size, treating
>> shoe size as an arbitrary independent variable.  But in the real world,
>> shoe size is controlled by the size of human feet, which is bounded by a
>> constant for biological reasons.  You don't have to consider shoes the
>> size of Jupiter.  So it is O(1).
>
> By that argument, everything is amortized O(1), because there's a
> limit on every variable. You can't possibly be working with a data set
> greater than the total sum of storage space in the entire world. That
> still doesn't mean that bubble sort and heap sort are equivalently
> efficient.

The difference between the two is that the former is bounded by a
constant that is fundamental to the algorithm at hand, whereas the
latter is bounded only by available resources.  By any practical
consideration the latter must be considered a variable, but the former
need not be.

Paul discusses above the asymptotic growth of a variable as O(S) where
S is shoe size, but really this measure makes no sense in the first
place.  The question that this notation seeks to answer is, "What is
the big-Oh behaviour of this variable as shoe size increases
indefinitely (i.e. to infinity)?" and the answer in this case is "Shoe
size does not increase indefinitely"; the question is invalid.

A more logical question might be, "How much material do I need to
construct N shoes of size S?"  The answer to this question would
presumably be some constant factor of N * S**2, which is O(N * S**2).
Although N can be assumed to vary freely (up to nonsensical quantities
like the mass of the entire universe), S is clearly bounded by the
constraints of actual shoes, so we can safely treat S as a constant
and call it O(N).

[toc] | [prev] | [next] | [standalone]


#27527 — Re: Abuse of Big Oh notation

FromPaul Rubin <no.email@nospam.invalid>
Date2012-08-20 12:29 -0700
SubjectRe: Abuse of Big Oh notation
Message-ID<7xfw7hl5vb.fsf@ruckus.brouhaha.com>
In reply to#27510
Ian Kelly <ian.g.kelly@gmail.com> writes:
> The difference between the two is that the former is bounded by a
> constant that is fundamental to the algorithm at hand,...  S is
> clearly bounded by the constraints of actual shoes, so we can safely
> treat S as a constant and call it O(N).

Thanks, that is a good explain of what I was trying to get at.  One
quibble is in the parsing example, the constant is inherent in the
distribution of input data one expects to normally encounter, rather
than in the algorithm itself.  It's sort of like saying dictionary
access (based on hashing) is O(1), based on the variety of input data
that is normally used.  There is such a thing as pathological
(e.g. malicious) input data that causes a lot of hash collisions, making
O(n) access in the size of the dictionary.  

I suppose abnormal input should also be taken into account in examples
involving parsing if one parses potentially hostile data.

[toc] | [prev] | [next] | [standalone]


#27534 — Re: Abuse of Big Oh notation

From88888 Dihedral <dihedral88888@googlemail.com>
Date2012-08-20 15:16 -0700
SubjectRe: Abuse of Big Oh notation
Message-ID<464d6549-7fc5-4405-92ec-285bc2afdb7b@googlegroups.com>
In reply to#27527
Paul Rubin於 2012年8月21日星期二UTC+8上午3時29分12秒寫道:
> Ian Kelly <ian.g.kelly@gmail.com> writes:
> 
> > The difference between the two is that the former is bounded by a
> 
> > constant that is fundamental to the algorithm at hand,...  S is
> 
> > clearly bounded by the constraints of actual shoes, so we can safely
> 
> > treat S as a constant and call it O(N).
> 
> 
> 
> Thanks, that is a good explain of what I was trying to get at.  One
> 
> quibble is in the parsing example, the constant is inherent in the
> 
> distribution of input data one expects to normally encounter, rather
> 
> than in the algorithm itself.  It's sort of like saying dictionary
> 
> access (based on hashing) is O(1), based on the variety of input data
> 
> that is normally used.  There is such a thing as pathological
> 
> (e.g. malicious) input data that causes a lot of hash collisions, making
> 
> O(n) access in the size of the dictionary.  
> 
> 
> 
> I suppose abnormal input should also be taken into account in examples
> 
> involving parsing if one parses potentially hostile data.

OK, the hash key function has to be seeded with some randomization 
in the begining of a hash table from some data independent factors.

Also for those items hashed to the same key with a length >=16
I think an insertion sort is good in soring a sorted list of items hashed to
the same key of a length 16 to 256 or even larger which indicates 
the hash function should be changed from time to time in the 
occasions of resizing the hash table when the table is under a lot operations
of  item inserstions and deletions which are much greater than the number 
of item searches in the same period.
 

[toc] | [prev] | [next] | [standalone]


#27538 — Re: Abuse of Big Oh notation

From88888 Dihedral <dihedral88888@googlemail.com>
Date2012-08-20 15:20 -0700
SubjectRe: Abuse of Big Oh notation
Message-ID<e8631b7a-776c-4680-a3c5-9fe731748ddc@googlegroups.com>
In reply to#27527
Paul Rubin於 2012年8月21日星期二UTC+8上午3時29分12秒寫道:
> Ian Kelly <ian.g.kelly@gmail.com> writes:
> 
> > The difference between the two is that the former is bounded by a
> 
> > constant that is fundamental to the algorithm at hand,...  S is
> 
> > clearly bounded by the constraints of actual shoes, so we can safely
> 
> > treat S as a constant and call it O(N).
> 
> 
> 
> Thanks, that is a good explain of what I was trying to get at.  One
> 
> quibble is in the parsing example, the constant is inherent in the
> 
> distribution of input data one expects to normally encounter, rather
> 
> than in the algorithm itself.  It's sort of like saying dictionary
> 
> access (based on hashing) is O(1), based on the variety of input data
> 
> that is normally used.  There is such a thing as pathological
> 
> (e.g. malicious) input data that causes a lot of hash collisions, making
> 
> O(n) access in the size of the dictionary.  
> 
> 
> 
> I suppose abnormal input should also be taken into account in examples
> 
> involving parsing if one parses potentially hostile data.

OK, the hash key function has to be seeded with some randomization 
in the creation of a hash table from some data independent factors.

Also for those items hashed to the same key with a length >=16
I think an insertion sort and a heap sort are good in storing a sorted list of items hashed to the same key of a length 16 to 256 or even larger which indicates  the hash function should be changed from time to time in the
occasions of resizing the hash table when the table is under a lot operations
of  item insertions and deletions which  are operated  much frequently 
than the number of item searches in the same period.


 

[toc] | [prev] | [next] | [standalone]


#27562 — Re: Abuse of Big Oh notation

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-08-21 09:53 +0000
SubjectRe: Abuse of Big Oh notation
Message-ID<50335aa8$0$1645$c3e8da3$76491128@news.astraweb.com>
In reply to#27510
On Mon, 20 Aug 2012 11:12:22 -0600, Ian Kelly wrote:

> On Mon, Aug 20, 2012 at 10:09 AM, Chris Angelico <rosuav@gmail.com>
> wrote:
>> On Tue, Aug 21, 2012 at 2:01 AM, Paul Rubin <no.email@nospam.invalid>
>> wrote:
>>> Analogy: how big a box is required to hold a pair of shoes?  In a
>>> purely theoretical sense we might say O(S) where S is the shoe size,
>>> treating shoe size as an arbitrary independent variable.  But in the
>>> real world, shoe size is controlled by the size of human feet, which
>>> is bounded by a constant for biological reasons.  You don't have to
>>> consider shoes the size of Jupiter.  So it is O(1).
>>
>> By that argument, everything is amortized O(1), because there's a limit
>> on every variable. You can't possibly be working with a data set
>> greater than the total sum of storage space in the entire world. That
>> still doesn't mean that bubble sort and heap sort are equivalently
>> efficient.
> 
> The difference between the two is that the former is bounded by a
> constant that is fundamental to the algorithm at hand, whereas the
> latter is bounded only by available resources.  By any practical
> consideration the latter must be considered a variable, but the former
> need not be.
> 
> Paul discusses above the asymptotic growth of a variable as O(S) where S
> is shoe size, but really this measure makes no sense in the first place.
>  The question that this notation seeks to answer is, "What is the big-Oh
> behaviour of this variable as shoe size increases indefinitely (i.e. to
> infinity)?" and the answer in this case is "Shoe size does not increase
> indefinitely"; the question is invalid.
> 
> A more logical question might be, "How much material do I need to
> construct N shoes of size S?"  The answer to this question would
> presumably be some constant factor of N * S**2, which is O(N * S**2).
> Although N can be assumed to vary freely (up to nonsensical quantities
> like the mass of the entire universe), S is clearly bounded by the
> constraints of actual shoes, so we can safely treat S as a constant and
> call it O(N).

Why the hell are we talking about shoe sizes?

Let's not lose sight of the actual question in hand: how expensive is it 
to fetch a character at some arbitrary index in a string?

With fixed-width characters, average time is O(1), best-case time is O(1) 
and worst-case time is O(1).

With non-fixed width characters, average time and worst-case time are 
both O(N), and best-case ("give me the character at index 0") is O(1). 
But that's a degenerate case that gives us absolutely no insight into the 
cost of the operation. It's like the observation that "sorting a list of 
one item takes constant time, regardless of the algorithm". Um, yes, it 
does. So what?

Invented constraints like "people only care about indexes really close to 
zero, or to the end of the string", effectively confuse the best possible 
case for the average case. In real life, people do care about random 
access to strings and the average case is more important than the best 
case. That's why strings are typically implemented as arrays of fixed-
width characters rather than linked lists, and Haskell (which doesn't) 
explicitly warns you that they don't and that your string-handling code 
is likely to be slow.

Of course, you can swap space and complexity for time, e.g. ropes, or use 
cached pointers to character locations in the hope that indexes will be 
clustered rather than scattered randomly. These more complex 
implementations typically use as much memory than, and performance is 
often no better than, a dead simple implementation that uses a simple 
array of fixed width characters. Most harmful, it means that a simple 
indexing operation may be fast on average, and occasionally degenerate to 
slow. The only thing worse than "This program is slow" is "This program 
is *occasionally* slow, and I can't predict when".

Not surprising, almost all programming languages in common usage use 
arrays of fixed-width characters as their native string type. Python 3.3 
will now do the same thing, with the memory optimization that the fixed-
width will be per string and selected from 1, 2, or 4 bytes according to 
the minimum width needed, rather than choosing between 2 and 4 bytes when 
you build the Python compiler.

This is a big positive step forward with a pretty simple algorithm and 
will strongly help encourage the use of Unicode by taking much of the 
sting out of the perennial complaints that "Unicode wastes space". Not so 
wasteful any more.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#27524 — Re: Abuse of Big Oh notation

Fromwxjmfauth@gmail.com
Date2012-08-20 11:42 -0700
SubjectRe: Abuse of Big Oh notation
Message-ID<f0878bc4-539b-4570-a138-ea413e6d9fbb@googlegroups.com>
In reply to#27477
By chance and luckily, first attempt.

IDLE, Windows 7.0 Pro 32, Pentium Dual Core 2.6, RAM 2 Go

Py 3.2.3
>>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')")
[1.6939567134893707, 1.672874290786993, 1.6761219212298073]

Py 3.3.0b2
>>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')")
[7.924470733910917, 7.8554985620787345, 7.878623849091914]


Console

c:\python32\python -m timeit "('€'*100+'€'*100).replace('€'
, 'œ')"
1000000 loops, best of 3: 1.48 usec per loop
c:\python33\python -m timeit "('€'*100+'€'*100).replace('€'
, 'œ')"
100000 loops, best of 3: 7.62 usec per loop

Note
The used characters are not members of the latin-1 coding
scheme (btw an *unusable* coding).
They are however charaters in cp1252 and mac-roman.

jmf

[toc] | [prev] | [next] | [standalone]


#27542 — Re: Abuse of Big Oh notation

FromNed Deily <nad@acm.org>
Date2012-08-20 18:19 -0700
SubjectRe: Abuse of Big Oh notation
Message-ID<mailman.3586.1345511985.4697.python-list@python.org>
In reply to#27524
In article <f0878bc4-539b-4570-a138-ea413e6d9fbb@googlegroups.com>,
 wxjmfauth@gmail.com wrote:
> Note
> The used characters are not members of the latin-1 coding
> scheme (btw an *unusable* coding).
> They are however charaters in cp1252 and mac-roman.

mac-roman is an obsolete encoding that was used in MacOS 9 and MacOS 
Classic systems of previous decades. Except in a very small and 
shrinking number of legacy applications, it isn't used in modern systems 
and hasn't been widely used for a long time.  MacOS X systems generally 
use standard Unicode encodings, usually UTF-8, for external 
representations.  Forget about mac-roman; it is not relevant.

-- 
 Ned Deily,
 nad@acm.org

[toc] | [prev] | [next] | [standalone]


#27556 — Abuse of subject, was Re: Abuse of Big Oh notation

FromPeter Otten <__peter__@web.de>
Date2012-08-21 09:52 +0200
SubjectAbuse of subject, was Re: Abuse of Big Oh notation
Message-ID<mailman.3594.1345535550.4697.python-list@python.org>
In reply to#27524
wxjmfauth@gmail.com wrote:

> By chance and luckily, first attempt.
 
> c:\python32\python -m timeit "('€'*100+'€'*100).replace('€'
> , 'œ')"
> 1000000 loops, best of 3: 1.48 usec per loop
> c:\python33\python -m timeit "('€'*100+'€'*100).replace('€'
> , 'œ')"
> 100000 loops, best of 3: 7.62 usec per loop

OK, that is roughly factor 5. Let's see what I get:

$ python3.2 -m timeit '("€"*100+"€"*100).replace("€", "œ")'
100000 loops, best of 3: 1.8 usec per loop
$ python3.3 -m timeit '("€"*100+"€"*100).replace("€", "œ")'
10000 loops, best of 3: 9.11 usec per loop

That is factor 5, too. So I can replicate your measurement on an AMD64 Linux 
system with self-built 3.3 versus system 3.2.

> Note
> The used characters are not members of the latin-1 coding
> scheme (btw an *unusable* coding).
> They are however charaters in cp1252 and mac-roman.

You seem to imply that the slowdown is connected to the inability of latin-1 
to encode "œ" and "€" (to take the examples relevant to the above 
microbench). So let's repeat with latin-1 characters:

$ python3.2 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")'
100000 loops, best of 3: 1.76 usec per loop
$ python3.3 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")'
10000 loops, best of 3: 10.3 usec per loop

Hm, the slowdown is even a tad bigger. So we can safely dismiss your theory 
that an unfortunate choice of the 8 bit encoding is causing it. Do you 
agree?

[toc] | [prev] | [next] | [standalone]


Page 6 of 8 — ← Prev page 1 2 3 4 5 [6] 7 8  Next page →

Back to top | Article view | comp.lang.python


csiph-web