Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #50503 > unrolled thread

RE Module Performance

Started byDevyn Collier Johnson <devyncjohnson@gmail.com>
First post2013-07-11 19:44 -0400
Last post2013-07-18 13:17 -0700
Articles 20 on this page of 136 — 25 participants

Back to article view | Back to comp.lang.python


Contents

  RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-11 19:44 -0400
    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-12 02:23 -0700
      Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:27 +1000
      Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 10:39 +0100
      Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:40 +1000
      Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 06:45 -0400
      Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 16:59 +0100
      Re: RE Module Performance Peter Otten <__peter__@web.de> - 2013-07-12 18:15 +0200
      Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-13 02:21 +1000
      Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 13:58 -0400
        Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 05:37 +0000
          Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-14 11:17 -0700
            Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 06:06 -0400
              Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-15 12:36 +0000
                Dihedral Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 08:52 -0400
                Re: Dihedral Joel Goldstick <joel.goldstick@gmail.com> - 2013-07-15 09:03 -0400
                Re: Dihedral Wayne Werner <wayne@waynewerner.com> - 2013-07-15 17:43 -0500
                Re: Dihedral Fábio Santos <fabiosantosart@gmail.com> - 2013-07-15 23:54 +0100
                Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-16 08:59 +1000
                Re: Dihedral Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-16 16:06 +1000
                Re: Dihedral Stefan Behnel <stefan_ml@behnel.de> - 2013-07-24 20:08 +0200
                Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:23 +1000
                Re: Dihedral Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-24 20:15 -0400
      Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-13 08:16 +1000
      Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-12 17:13 -0600
        Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-24 06:40 -0700
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-24 23:48 +1000
          Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:17 -0400
          Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:19 -0400
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:34 +1000
            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:02 +0000
              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:39 +1000
          Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 08:47 -0600
            Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 02:27 -0700
              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:14 +1000
                Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 12:07 -0700
                  Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 05:18 +1000
                  RE: RE Module Performance "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2013-07-25 19:30 +0000
                  Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:06 -0600
          Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 09:00 -0600
            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 05:56 +0000
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:56 +1000
          Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 13:52 -0400
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:15 +1000
            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:15 +0000
              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:58 +1000
                Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 09:22 +0000
                  Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:07 +1000
          Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 18:09 -0400
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 08:19 +1000
          Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 16:59 -0600
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 09:24 +1000
          Re: RE Module Performance Serhiy Storchaka <storchaka@gmail.com> - 2013-07-25 08:49 +0300
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 15:58 +1000
          Re: RE Module Performance Jeremy Sanders <jeremy@jeremysanders.net> - 2013-07-25 14:36 +0100
            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 15:26 +0000
              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 01:36 +1000
                Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 17:18 +0000
                  Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 03:27 +1000
                  Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:45 -0500
                    Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-26 02:48 +0000
                      Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 21:20 -0600
                        Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:36 -0700
                        Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 08:46 -0700
                          Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 06:28 +0000
                        Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 03:37 +0000
                          Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-26 22:12 -0600
                            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 05:04 +0000
                          Re: RE Module Performance Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-27 12:13 -0400
                    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:19 -0700
                  Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:09 -0600
                    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:21 -0700
                      Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-26 20:05 -0600
                        Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-27 11:21 -0700
                          Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-27 21:53 -0600
                            Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 11:13 -0700
                              Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:04 +0100
                                Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:30 -0700
                                  Re: RE Module Performance Lele Gaifax <lele@metapensiero.it> - 2013-07-28 22:45 +0200
                                  Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 22:01 +0200
                            Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 07:01 -0700
                              Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 16:38 +0200
                              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 15:45 +0100
                              Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 17:13 +0100
                              Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 18:39 +0200
                              Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 18:14 +0100
                                Re: RE Module Performance Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-31 13:09 +1000
                              Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-31 03:27 +1000
                              Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-30 18:40 +0100
                              Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 20:19 +0200
                                Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 12:09 -0700
                                  Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 21:04 +0100
                                  Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:54 -0600
                                  Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-31 05:45 +0000
                                    Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 08:17 +0100
                                    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 13:15 -0700
                                      Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 21:41 +0100
                                  Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:11 +0200
                                    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 01:32 -0700
                                      Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:59 +0200
                                      Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:44 -0600
                              Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-30 17:05 -0400
                              Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:30 -0600
                              Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 09:23 +0200
                              Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:27 -0600
                          Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 10:45 +0200
                          FSR and unicode compliance - was Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-28 09:52 -0600
                            Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:23 -0700
                              Re: FSR and unicode compliance - was Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:44 +0100
                              Re: FSR and unicode compliance - was Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 21:55 +0200
                              Re: FSR and unicode compliance - was Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-28 20:52 +0000
                                Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 04:43 -0700
                                  Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 12:57 +0100
                                    Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 05:56 -0700
                                    Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 07:20 -0700
                                      Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 15:49 +0100
                                        Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 09:31 -0700
                                  Re: FSR and unicode compliance - was Re: RE Module Performance Heiko Wundram <modelnine@modelnine.org> - 2013-07-29 14:06 +0200
                                  Re: FSR and unicode compliance - was Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-29 08:43 -0400
                          Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 18:03 +0100
                          Re: FSR and unicode compliance - was Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 13:36 -0400
                            Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 06:36 -0700
                          Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:03 +0100
                          Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 19:19 +0100
                          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:29 +0100
                          Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 15:06 -0400
                          Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 23:14 +0100
                          Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 20:51 +0200
                          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 00:07 +0100
                      Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-26 22:38 +0200
          Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-25 09:44 -0400
          Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:53 -0500
      Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-13 00:16 +0100
      Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-14 05:34 +1000
      Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-16 06:30 -0400
        Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-18 13:17 -0700

Page 4 of 7 — ← Prev page 1 2 3 [4] 5 6 7  Next page →


#51272

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-07-26 02:48 +0000
Message-ID<51f1e371$0$29971$c3e8da3$5496439d@news.astraweb.com>
In reply to#51260
On Thu, 25 Jul 2013 15:45:38 -0500, Ian Kelly wrote:

> On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:
>>
>>> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
>>> <steve+comp.lang.python@pearwood.info> wrote:
>>>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
>>>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers
>>>>> that are codepoints of text characters within buffers and strings.
>>>>> Rather, Emacs uses a variable-length internal representation of
>>>>> characters, that stores each character as a sequence of 1 to 5 8-bit
>>>>> bytes, depending on the magnitude of its codepoint[1]. For example,
>>>>> any ASCII character takes up only 1 byte, a Latin-1 character takes
>>>>> up 2 bytes, etc. We call this representation of text multibyte.
>>>>
>>>> Well, you've just proven what Vim users have always suspected: Emacs
>>>> doesn't really exist.
>>>
>>> ... lolwut?
>>
>>
>> JMF has explained that it is impossible, impossible I say!, to write an
>> editor using a flexible string representation. Since Emacs uses such a
>> flexible string representation, Emacs is impossible, and therefore
>> Emacs doesn't exist.
>>
>> QED.
> 
> Except that the described representation used by Emacs is a variant of
> UTF-8, not an FSR.  It doesn't have three different possible encodings
> for the letter 'a' depending on what other characters happen to be in
> the string.
> 
> As I understand it, jfm would be perfectly happy if Python used UTF-8
> (or presumably the Emacs variant) as its internal string representation.


UTF-8 uses a flexible representation on a character-by-character basis. 
When parsing UTF-8, one needs to look at EVERY character to decide how 
many bytes you need to read. In Python 3, the flexible representation is 
on a string-by-string basis: once Python has looked at the string header, 
it can tell whether the *entire* string takes 1, 2 or 4 bytes per 
character, and the string is then fixed-width. You can't do that with 
UTF-8.

To put it in terms of pseudo-code:

# Python 3.3
def parse_string(astring):
    # Decision gets made once per string.
    if astring uses 1 byte:
        count = 1
    elif astring uses 2 bytes:
        count = 2
    else: 
        count = 4
    while not done:
        char = convert(next(count bytes))


# UTF-8
def parse_string(astring):
    while not done:
        b = next(1 byte)
        # Decision gets made for every single char
        if uses 1 byte:
            char = convert(b)
        elif uses 2 bytes:
            char = convert(b, next(1 byte))
        elif uses 3 bytes:
            char = convert(b, next(2 bytes))
        else:
            char = convert(b, next(3 bytes))


So UTF-8 requires much more runtime overhead than Python 3.3, and Emac's 
variation can in fact require more bytes per character than either. 
(UTF-8 and Python 3.3 can require up to four bytes, Emacs up to five.) 
I'm not surprised that JMF would prefer UTF-8 -- he is completely out of 
his depth, and is a fine example of the Dunning-Kruger effect in action. 
He is so sure he is right based on so little evidence.

One advantage of UTF-8 is that for some BMP characters, you can get away 
with only three bytes instead of four. For transmitting data over the 
wire, or storage on disk, that's potentially up to a 25% reduction in 
space, which is not to be sneezed at. (Although in practice it's usually 
much less than that, since the most common characters are encoded to 1 or 
2 bytes, not 4). But that comes at the cost of much more runtime 
overhead, which in my opinion makes UTF-8 a second-class string 
representation compared to fixed-width representations.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#51277

FromIan Kelly <ian.g.kelly@gmail.com>
Date2013-07-25 21:20 -0600
Message-ID<mailman.5129.1374808894.3114.python-list@python.org>
In reply to#51272
On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> UTF-8 uses a flexible representation on a character-by-character basis.
> When parsing UTF-8, one needs to look at EVERY character to decide how
> many bytes you need to read. In Python 3, the flexible representation is
> on a string-by-string basis: once Python has looked at the string header,
> it can tell whether the *entire* string takes 1, 2 or 4 bytes per
> character, and the string is then fixed-width. You can't do that with
> UTF-8.

UTF-8 does not use a flexible representation.  A codec that is
encoding a string in UTF-8 and examining a particular character does
not have any choice of how to encode that character; there is exactly
one sequence of bits that is the UTF-8 encoding for the character.
Further, for any given sequence of code points there is exactly one
sequence of bytes that is the UTF-8 encoding of those code points.  In
contrast, with the FSR there are as many as three different sequences
of bytes that encode a sequence of code points, with one of them (the
shortest) being canonical.  That's what makes it flexible.

Anyway, my point was just that Emacs is not a counter-example to jmf's
claim about implementing text editors, because UTF-8 is not what he
(or anybody else) is referring to when speaking of the FSR or
"something like the FSR".

[toc] | [prev] | [next] | [standalone]


#51301

Fromwxjmfauth@gmail.com
Date2013-07-26 06:36 -0700
Message-ID<1ca6bb15-ce10-4a23-82fc-aa0af0f7ac97@googlegroups.com>
In reply to#51277
Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit :
> On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
> 
> <steve+comp.lang.python@pearwood.info> wrote:
> 
> > UTF-8 uses a flexible representation on a character-by-character basis.
> 
> > When parsing UTF-8, one needs to look at EVERY character to decide how
> 
> > many bytes you need to read. In Python 3, the flexible representation is
> 
> > on a string-by-string basis: once Python has looked at the string header,
> 
> > it can tell whether the *entire* string takes 1, 2 or 4 bytes per
> 
> > character, and the string is then fixed-width. You can't do that with
> 
> > UTF-8.
> 
> 
> 
> UTF-8 does not use a flexible representation.  A codec that is
> 
> encoding a string in UTF-8 and examining a particular character does
> 
> not have any choice of how to encode that character; there is exactly
> 
> one sequence of bits that is the UTF-8 encoding for the character.
> 
> Further, for any given sequence of code points there is exactly one
> 
> sequence of bytes that is the UTF-8 encoding of those code points.  In
> 
> contrast, with the FSR there are as many as three different sequences
> 
> of bytes that encode a sequence of code points, with one of them (the
> 
> shortest) being canonical.  That's what makes it flexible.
> 
> 
> 
> Anyway, my point was just that Emacs is not a counter-example to jmf's
> 
> claim about implementing text editors, because UTF-8 is not what he
> 
> (or anybody else) is referring to when speaking of the FSR or
> 
> "something like the FSR".

--------


BTW, it is not necessary to use an endorsed Unicode coding
scheme (utf*), a string literal would have been possible,
but then one falls on memory issures.

All these utf are following the basic coding scheme.

I repeat again.
A coding scheme works with a unique set of characters
and its implementation works with a unique set of
encoded code points (the utf's, in case of Unicode).

And again, that why we live today with all these coding
schemes, or, to take the problem from the other side,
that's because one has to work with a unique set of
encoded code points, that all these coding schemes had to
be created.

utf's have not been created by newbies ;-)

jmf

[toc] | [prev] | [next] | [standalone]


#51311

Fromwxjmfauth@gmail.com
Date2013-07-26 08:46 -0700
Message-ID<d790ab57-2b96-4ae7-a86d-4229484115e1@googlegroups.com>
In reply to#51277
Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit :
> On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
> 
> <steve+comp.lang.python@pearwood.info> wrote:
> 
> > UTF-8 uses a flexible representation on a character-by-character basis.
> 
> > When parsing UTF-8, one needs to look at EVERY character to decide how
> 
> > many bytes you need to read. In Python 3, the flexible representation is
> 
> > on a string-by-string basis: once Python has looked at the string header,
> 
> > it can tell whether the *entire* string takes 1, 2 or 4 bytes per
> 
> > character, and the string is then fixed-width. You can't do that with
> 
> > UTF-8.
> 
> 
> 
> UTF-8 does not use a flexible representation.  A codec that is
> 
> encoding a string in UTF-8 and examining a particular character does
> 
> not have any choice of how to encode that character; there is exactly
> 
> one sequence of bits that is the UTF-8 encoding for the character.
> 
> Further, for any given sequence of code points there is exactly one
> 
> sequence of bytes that is the UTF-8 encoding of those code points.  In
> 
> contrast, with the FSR there are as many as three different sequences
> 
> of bytes that encode a sequence of code points, with one of them (the
> 
> shortest) being canonical.  That's what makes it flexible.
> 
> 
> 
> Anyway, my point was just that Emacs is not a counter-example to jmf's
> 
> claim about implementing text editors, because UTF-8 is not what he
> 
> (or anybody else) is referring to when speaking of the FSR or
> 
> "something like the FSR".

-----

Let's be clear. I'm perfectly understanding what is utf-8
and that's for that precise reason, I put the "editor"
as an exemple on the table.

This FSR is not *a* coding scheme. It is more a composite
coding scheme. (And form there, all the problems).

BTW, I'm pleased to read "sequence of bits" and not bytes.
Again, utf transformers are producing sequence of bits,
call Unicode Transformation Units, with lengths of
8/16/32 *bits*, from there the names utf8/16/32.
UCS transformers are (were) producing bytes, from there
the names ucs-2/4.

jmf

[toc] | [prev] | [next] | [standalone]


#51337

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-07-27 06:28 +0000
Message-ID<51f368a8$0$29971$c3e8da3$5496439d@news.astraweb.com>
In reply to#51311
On Fri, 26 Jul 2013 08:46:58 -0700, wxjmfauth wrote:

> BTW, I'm pleased to read "sequence of bits" and not bytes. Again, utf
> transformers are producing sequence of bits, call Unicode Transformation
> Units, with lengths of 8/16/32 *bits*, from there the names utf8/16/32.
> UCS transformers are (were) producing bytes, from there the names
> ucs-2/4.


Not only does your distinction between bits and bytes make no practical 
difference on nearly all hardware in common use today[1], but the Unicode 
Consortium disagrees with you, and defines UTC in terms of bytes:

"A Unicode transformation format (UTF) is an algorithmic mapping from 
every Unicode code point (except surrogate code points) to a unique byte 
sequence."

http://www.unicode.org/faq/utf_bom.html#gen2




[1] There may still be some old supercomputers where a byte is more than 
8 bits in use, but they're unlikely to support Unicode.

-- 
Steven

[toc] | [prev] | [next] | [standalone]


#51331

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-07-27 03:37 +0000
Message-ID<51f3406f$0$29971$c3e8da3$5496439d@news.astraweb.com>
In reply to#51277
On Thu, 25 Jul 2013 21:20:45 -0600, Ian Kelly wrote:

> On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> UTF-8 uses a flexible representation on a character-by-character basis.
>> When parsing UTF-8, one needs to look at EVERY character to decide how
>> many bytes you need to read. In Python 3, the flexible representation
>> is on a string-by-string basis: once Python has looked at the string
>> header, it can tell whether the *entire* string takes 1, 2 or 4 bytes
>> per character, and the string is then fixed-width. You can't do that
>> with UTF-8.
> 
> UTF-8 does not use a flexible representation.

I disagree, and so does Jeremy Sanders who first pointed out the 
similarity between Emacs' UTF-8 and Python's FSR. I'll quote from the 
Emacs documentation again:

"To conserve memory, Emacs does not hold fixed-length 22-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Emacs uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 5 8-bit bytes, depending on
the magnitude of its codepoint. For example, any ASCII character takes
up only 1 byte, a Latin-1 character takes up 2 bytes, etc."

And the Python FSR:

"To conserve memory, Python does not hold fixed-length 21-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Python uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 4 8-bit bytes, depending on
the magnitude of the largest codepoint in the string. For example, any 
all-ASCII or all-Latin1 string takes up only 1 byte per character, an all-
BMP string takes up 2 bytes per character, etc."

See the similarity now? Both flexibly change the width used by code-
points, UTF-8 based on the code-point itself regardless of the rest of 
the string, Python based on the largest code-point in the string.


[...]
> Anyway, my point was just that Emacs is not a counter-example to jmf's
> claim about implementing text editors, because UTF-8 is not what he (or
> anybody else) is referring to when speaking of the FSR or "something
> like the FSR".

Whether JMF can see the similarities between different implementations of 
strings or not is beside the point, those similarities do exist. As do 
the differences, of course, but in this case the differences are in 
favour of Python's FSR. Even if your string is entirely Latin1, a UTF-8 
implementation *cannot know that*, and still has to walk the string byte-
by-byte checking whether the current code point requires 1, 2, 3, or 4 
bytes, while a FSR implementation can simply record the fact that the 
string is pure Latin1 at creation time, and then treat it as fixed-width 
from then on.

JMF claims that FSR is "impossible" to use efficiently, and yet he 
supports encoding schemes which are *less* efficient. Go figure. He tells 
us he has no problem with any of the established UTF encodings, and yet 
the FSR internally uses UTF-16 and UTF-32. (Technically, it's UCS-2, not 
UTF-16, since there are no surrogate pairs. But the difference is 
insignificant.)

Having watched this issue from Day One when JMF first complained about 
it, I believe this is entirely about denying any benefit to ASCII users. 
Had Python implemented a system identical to the current FSR except that 
it added a fourth category, "all ASCII", which used an eight-byte 
encoding scheme (thus making ASCII strings twice as expensive as strings 
including code points from the Supplementary Multilingual Planes), JMF 
would be the scheme's number one champion.

I cannot see any other rational explanation for why JMF prefers broken, 
buggy Unicode implementations, or implementations which are equally 
expensive for all strings, over one which is demonstrably correct, 
demonstrably saves memory, and for realistic, non-contrived benchmarks, 
demonstrably faster, except that he wants to punish ASCII users more than 
he wants to support Unicode users.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#51334

FromIan Kelly <ian.g.kelly@gmail.com>
Date2013-07-26 22:12 -0600
Message-ID<mailman.5161.1374898818.3114.python-list@python.org>
In reply to#51331
On Fri, Jul 26, 2013 at 9:37 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> See the similarity now? Both flexibly change the width used by code-
> points, UTF-8 based on the code-point itself regardless of the rest of
> the string, Python based on the largest code-point in the string.

No, I think we're just using the word "flexible" differently.  In my
view, simply being variable-width does not make an encoding "flexible"
in the sense of the FSR.  But I'm not going to keep repeating myself
in order to argue about it.

> Having watched this issue from Day One when JMF first complained about
> it, I believe this is entirely about denying any benefit to ASCII users.
> Had Python implemented a system identical to the current FSR except that
> it added a fourth category, "all ASCII", which used an eight-byte
> encoding scheme (thus making ASCII strings twice as expensive as strings
> including code points from the Supplementary Multilingual Planes), JMF
> would be the scheme's number one champion.

I agree.  In fact I made a similar observation back in December:

http://mail.python.org/pipermail/python-list/2012-December/636942.html

[toc] | [prev] | [next] | [standalone]


#51335

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-07-27 05:04 +0000
Message-ID<51f354c3$0$29971$c3e8da3$5496439d@news.astraweb.com>
In reply to#51334
On Fri, 26 Jul 2013 22:12:36 -0600, Ian Kelly wrote:

> On Fri, Jul 26, 2013 at 9:37 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> See the similarity now? Both flexibly change the width used by code-
>> points, UTF-8 based on the code-point itself regardless of the rest of
>> the string, Python based on the largest code-point in the string.
> 
> No, I think we're just using the word "flexible" differently.  In my
> view, simply being variable-width does not make an encoding "flexible"
> in the sense of the FSR.  But I'm not going to keep repeating myself in
> order to argue about it.

But I paid for the full half hour!

http://en.wikipedia.org/wiki/The_Argument_Sketch


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#51366

FromDennis Lee Bieber <wlfraed@ix.netcom.com>
Date2013-07-27 12:13 -0400
Message-ID<mailman.5179.1374941632.3114.python-list@python.org>
In reply to#51331
On 27 Jul 2013 03:37:20 GMT, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> declaimed the following:


>I disagree, and so does Jeremy Sanders who first pointed out the 
>similarity between Emacs' UTF-8 and Python's FSR. I'll quote from the 
>Emacs documentation again:
>
>"To conserve memory, Emacs does not hold fixed-length 22-bit numbers that
>are codepoints of text characters within buffers and strings. Rather,
>Emacs uses a variable-length internal representation of characters, that
>stores each character as a sequence of 1 to 5 8-bit bytes, depending on
>the magnitude of its codepoint. For example, any ASCII character takes
>up only 1 byte, a Latin-1 character takes up 2 bytes, etc."
>
>And the Python FSR:
>
>"To conserve memory, Python does not hold fixed-length 21-bit numbers that
>are codepoints of text characters within buffers and strings. Rather,
>Python uses a variable-length internal representation of characters, that
>stores each character as a sequence of 1 to 4 8-bit bytes, depending on
>the magnitude of the largest codepoint in the string. For example, any 
>all-ASCII or all-Latin1 string takes up only 1 byte per character, an all-
>BMP string takes up 2 bytes per character, etc."
>

	As I read those: Python states "any all-ASCII or all-Latin1 string
takes up only 1 byte per character", etc. IE; the entire STRING is based
upon the minimal size that can encode all characters in the string.

	The EMACS statement doesn't specify a "string", it implies, in "any
ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes,
etc.", that a string can contain mixed length characters.

-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]


#51299

Fromwxjmfauth@gmail.com
Date2013-07-26 06:19 -0700
Message-ID<606b75ca-e1eb-4a69-a23d-6f0372004114@googlegroups.com>
In reply to#51260
Le jeudi 25 juillet 2013 22:45:38 UTC+2, Ian a écrit :
> On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano
> 
> <steve+comp.lang.python@pearwood.info> wrote:
> 
> > On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:
> 
> >
> 
> >> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
> 
> >> <steve+comp.lang.python@pearwood.info> wrote:
> 
> >>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
> 
> >>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers
> 
> >>>> that are codepoints of text characters within buffers and strings.
> 
> >>>> Rather, Emacs uses a variable-length internal representation of
> 
> >>>> characters, that stores each character as a sequence of 1 to 5 8-bit
> 
> >>>> bytes, depending on the magnitude of its codepoint[1]. For example,
> 
> >>>> any ASCII character takes up only 1 byte, a Latin-1 character takes up
> 
> >>>> 2 bytes, etc. We call this representation of text multibyte.
> 
> >>>
> 
> >>> Well, you've just proven what Vim users have always suspected: Emacs
> 
> >>> doesn't really exist.
> 
> >>
> 
> >> ... lolwut?
> 
> >
> 
> >
> 
> > JMF has explained that it is impossible, impossible I say!, to write an
> 
> > editor using a flexible string representation. Since Emacs uses such a
> 
> > flexible string representation, Emacs is impossible, and therefore Emacs
> 
> > doesn't exist.
> 
> >
> 
> > QED.
> 
> 
> 
> Except that the described representation used by Emacs is a variant of
> 
> UTF-8, not an FSR.  It doesn't have three different possible encodings
> 
> for the letter 'a' depending on what other characters happen to be in
> 
> the string.
> 
> 
> 
> As I understand it, jfm would be perfectly happy if Python used UTF-8
> 
> (or presumably the Emacs variant) as its internal string
> 
> representation.

------

And emacs it probably working smoothly.

Your comment summarized all this stuff very correctly and
very shortly.

utf8/16/32? I do not care. There are all working correctly,
smoothly and efficiently. In fact, these utf's are already
doing correctly, what this FSR is doing in a wrong way.

My preference? utf32. Why? It is the most simple and
consequently performing choice. I'm not a narrow minded
ascii user. (I do not pretend to belong to those who
are solving the quadrature of the circle, I pretend to
belong to those who know, the quadrature of the circle
is not solvable).

Note: text processing tools or tools that have to process
characters — and the tools to build these tools — are all
moving to utf32, if not already done. There are technical
reasons behind this, which are going beyond the
pure raw unicode. There are however still 100% Unicode
compliant.

jmf

[toc] | [prev] | [next] | [standalone]


#51274

FromMichael Torrie <torriem@gmail.com>
Date2013-07-25 21:09 -0600
Message-ID<mailman.5127.1374808181.3114.python-list@python.org>
In reply to#51247
On 07/25/2013 11:18 AM, Steven D'Aprano wrote:
> JMF has explained that it is impossible, impossible I say!, to write an 
> editor using a flexible string representation. Since Emacs uses such a 
> flexible string representation, Emacs is impossible, and therefore Emacs 
> doesn't exist.

Now I'm even more confused.  He once pointed to Go as an example of how
unicode should be done in a language.  yet Go uses UTF-8 I think.

But I don't think UTF-8 is what JMF refers to as "flexible string
representation."  FSR does use 1,2 or 4 bytes per character, but each
character in the string uses the same width.  That's different from
UTF-8 or UTF-16, which is variable width per character.

[toc] | [prev] | [next] | [standalone]


#51300

Fromwxjmfauth@gmail.com
Date2013-07-26 06:21 -0700
Message-ID<8203e802-9dc5-44c5-9547-6e1947ee224b@googlegroups.com>
In reply to#51274
Le vendredi 26 juillet 2013 05:09:34 UTC+2, Michael Torrie a écrit :
> On 07/25/2013 11:18 AM, Steven D'Aprano wrote:
> 
> > JMF has explained that it is impossible, impossible I say!, to write an 
> 
> > editor using a flexible string representation. Since Emacs uses such a 
> 
> > flexible string representation, Emacs is impossible, and therefore Emacs 
> 
> > doesn't exist.
> 
> 
> 
> Now I'm even more confused.  He once pointed to Go as an example of how
> 
> unicode should be done in a language.  yet Go uses UTF-8 I think.
> 
> 
> 
> But I don't think UTF-8 is what JMF refers to as "flexible string
> 
> representation."  FSR does use 1,2 or 4 bytes per character, but each
> 
> character in the string uses the same width.  That's different from
> 
> UTF-8 or UTF-16, which is variable width per character.

-----

>>> sys.getsizeof('––') - sys.getsizeof('–')

I have already explained / commented this.

--------


Hint: To understand Unicode (and every coding scheme), you should
understand "utf". The how and the *why*.

jmf

[toc] | [prev] | [next] | [standalone]


#51328

FromMichael Torrie <torriem@gmail.com>
Date2013-07-26 20:05 -0600
Message-ID<mailman.5160.1374890711.3114.python-list@python.org>
In reply to#51300
On 07/26/2013 07:21 AM, wxjmfauth@gmail.com wrote:
>>>> sys.getsizeof('––') - sys.getsizeof('–')
> 
> I have already explained / commented this.

Maybe it got lost in translation, but I don't understand your point with
that.

> Hint: To understand Unicode (and every coding scheme), you should
> understand "utf". The how and the *why*.

Hmm, so if python used utf-8 internally to represent unicode strings
would not that punish *all* users (not just non-ascii users) since
searching a string for a certain character position requires an O(n)
operation?  UTF-32 I could see (and indeed that's essentially what FSR
uses when necessary does it not?), but not utf-8 or utf-16.

[toc] | [prev] | [next] | [standalone]


#51340

Fromwxjmfauth@gmail.com
Date2013-07-27 11:21 -0700
Message-ID<f4bb2528-930e-4c0a-820e-66f00ac2b5b6@googlegroups.com>
In reply to#51328
Le samedi 27 juillet 2013 04:05:03 UTC+2, Michael Torrie a écrit :
> On 07/26/2013 07:21 AM, wxjmfauth@gmail.com wrote:
> 
> >>>> sys.getsizeof('––') - sys.getsizeof('–')
> 
> > 
> 
> > I have already explained / commented this.
> 
> 
> 
> Maybe it got lost in translation, but I don't understand your point with
> 
> that.
> 
> 
> 
> > Hint: To understand Unicode (and every coding scheme), you should
> 
> > understand "utf". The how and the *why*.
> 
> 
> 
> Hmm, so if python used utf-8 internally to represent unicode strings
> 
> would not that punish *all* users (not just non-ascii users) since
> 
> searching a string for a certain character position requires an O(n)
> 
> operation?  UTF-32 I could see (and indeed that's essentially what FSR
> 
> uses when necessary does it not?), but not utf-8 or utf-16.

------

Did you read my previous link? Unicode Character Encoding Model.
Did you understand it?

Unicode only - No FSR (I skip some points and I still attempt to
be still correct.)

Unicode is a four-steps process.
[ {unique set of characters}  --> {unique set of code points, the
"labels"} -->  {unique set of encoded code points} ] --> implementation
(bytes)

First point to notice. "pure unicode", [...], is different from
the "implementation". *This is a deliberate choice*.

The critical step is the path {unique set of characters} --->
{unique set of encoded code points} in such a way so that
the implementation can "work comfortably" with this *unique* set
of encoded code points. Conceptualy, the implementation works
with an unique set of "already prepared encoded code points".
This is a very critical step. To explain it in a dirty way:
in the above chain, this problem is "already" eliminated and
solved. Like a byte/char coding schemes where this step is
a no-op.

Now, and if you wish this is a seperated/different problem.
To create this unique set of encoded code points, "Unicode"
uses these "utf(s)". I repeat again, a confusing name, for the
process and the result of the process. (I neglect ucs).
What are these? Chunks of bits, group of 8/16/32 bits, words.
It is up to the implementation to convert these sequences
of bits into bytes, ***if you wish to convert these in bytes!***.
Suprise! Why not putting two of the 32-bits words in a 64-bits
"machine"? (see golang / rune / int32).

Back to utf. utfs are not only elements of a unique set of encoded
code points. They have an interesting feature. Each "utf chunk"
holds intrisically the character (in fact the code point) it is
supposed to represent. In utf-32, the obvious case, it is just
the code point. In utf-8, that's the first chunk which helps and
utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
implementation using bytes, for any pointer position it is always
possible to find the corresponding encoded code point and from this
the corresponding character without any "programmed" information. See
my editor example, how to find the char under the caret? In fact,
a silly example, how can the caret can be positioned or moved, if
the underlying corresponding encoded code point can not be
dicerned!

Next step and one another separated problem.
Why all these utf versions? It is always the
same story. Some prefer the universality (utf-32) and
some prefer, well, some kind of conservatism. utf-8 is
more complicated, it demands more work and logically,
in an expected way, some performance regression.
utf-8 is more suited to produce bytes, utf16/32 for
internal processing. utf-8 had no choice to lose the
indexing. And so on.
Fact: all these coding schemes are working with a unique
set of encoded code points (suprise again, it's like byte
string!). The loss of performance of utf-8 is very minimal
compared to the loss of performance one can get compare to
a multiple coding scheme. This kind of work has been done,
and if my informations are correct, even by the creators
of utf-8. (There are sometimes good scientists).

There are plenty of advantages in using utf instead of
something else and advantages in other fields than just
the pure coding.
utf-16/32 schemes have the advantages to ditch ascii
for ever. The ascii concept is no more existing.

One should also understand that all this stuff has
not been created from scratch. It was a balance between
existing technologies. MS sticked with the idea, no more
ascii, let's use ucs-2 and the *x world breaks the unicode
adoption as possible. utf-8 is one of the compromise for
the adoption of Unicode. Retrospectivly, a not so good
compromise.

Computer scientists are funny scientists. They do love
to solve the problems they created themselves.

-----

Quickly. sys.getsizeof() at the light of what I explained.

1) As this FSR works with multiple encoding, it has to keep
track of the encoding. it puts is in the overhead of str
class (overhead = real overhead + encoding). In such
a absurd way, that a 

>>> sys.getsizeof('€')
40

needs 14 bytes more than a

>>> sys.getsizeof('z')
26

You may vary the length of the str. The problem is
still here. Not bad for a coding scheme.

2) Take a look at this. Get rid of the overhead.

>>> sys.getsizeof('b'*1000000 + 'c')
1000026
>>> sys.getsizeof('b'*1000000 + '€')
2000040

What does it mean? It means that Python has to
reencode a str every time it is necessary because
it works with multiple codings.

This FSR is not even a copy of the utf-8.
>>> len(('b'*1000000 + '€').encode('utf-8'))
1000003

utf-8 or any (utf) never need and never spend their time
in reencoding.

3) Unicode compliance. We know retrospectively, latin-1,
is was a bad choice. Unusable for 17 European languages.
Believe of not. 20 years of Unicode of incubation is not
long enough to learn it. When discussing once with a French
Python core dev, one with commit access, he did not know one
can not use latin-1 for the French language! BTW, I proposed
to the French devs, to test the FST with the set of characters,
recognized by the "Imprimerie Nationale", some kind of 
the legal French authority regarding characters and typography.
Never heared about it. Of course, I dit it.


In short
FSR = bad performance + bad memory mangement + non unicode
compliance.

Good point. FSR, nice tool for those who wish to teach
Unicode. It is not every day, one has such an opportunity.

---------

I'm practicaly no more programming, writing applications.
I'm still active and observing since a decade and plus all this 
unicode world, languages (go, c#, Python, Ruby), text
processing systems (esp. Unicode TeX engines) and font technology.
Very, very interesting.


jmf

[toc] | [prev] | [next] | [standalone]


#51380

FromIan Kelly <ian.g.kelly@gmail.com>
Date2013-07-27 21:53 -0600
Message-ID<mailman.5188.1374983652.3114.python-list@python.org>
In reply to#51340
On Sat, Jul 27, 2013 at 12:21 PM,  <wxjmfauth@gmail.com> wrote:
> Back to utf. utfs are not only elements of a unique set of encoded
> code points. They have an interesting feature. Each "utf chunk"
> holds intrisically the character (in fact the code point) it is
> supposed to represent. In utf-32, the obvious case, it is just
> the code point. In utf-8, that's the first chunk which helps and
> utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
> implementation using bytes, for any pointer position it is always
> possible to find the corresponding encoded code point and from this
> the corresponding character without any "programmed" information. See
> my editor example, how to find the char under the caret? In fact,
> a silly example, how can the caret can be positioned or moved, if
> the underlying corresponding encoded code point can not be
> dicerned!

Yes, given a pointer location into a utf-8 or utf-16 string, it is
easy to determine the identity of the code point at that location.
But this is not often a useful operation, save for resynchronization
in the case that the string data is corrupted.  The caret of an editor
does not conceptually correspond to a pointer location, but to a
character index.  Given a particular character index (e.g. 127504), an
editor must be able to determine the identity and/or the memory
location of the character at that index, and for UTF-8 and UTF-16
without an auxiliary data structure that is a O(n) operation.

> 2) Take a look at this. Get rid of the overhead.
>
>>>> sys.getsizeof('b'*1000000 + 'c')
> 1000026
>>>> sys.getsizeof('b'*1000000 + '€')
> 2000040
>
> What does it mean? It means that Python has to
> reencode a str every time it is necessary because
> it works with multiple codings.

Large strings in practical usage do not need to be resized like this
often.  Python 3.3 has been in production use for months now, and you
still have yet to produce any real-world application code that
demonstrates a performance regression.  If there is no real-world
regression, then there is no problem.

> 3) Unicode compliance. We know retrospectively, latin-1,
> is was a bad choice. Unusable for 17 European languages.
> Believe of not. 20 years of Unicode of incubation is not
> long enough to learn it. When discussing once with a French
> Python core dev, one with commit access, he did not know one
> can not use latin-1 for the French language!

Probably because for many French strings, one can.  As far as I am
aware, the only characters that are missing from Latin-1 are the Euro
sign (an unfortunate victim of history), the ligature œ (I have no
doubt that many users just type oe anyway), and the rare capital Ÿ
(the miniscule version is present in Latin-1).  All French strings
that are fortunate enough to be absent these characters can be
represented in Latin-1 and so will have a 1-byte width in the FSR.

[toc] | [prev] | [next] | [standalone]


#51392

Fromwxjmfauth@gmail.com
Date2013-07-28 11:13 -0700
Message-ID<4117e08f-941a-42d5-87b6-09e66f8c7b60@googlegroups.com>
In reply to#51380
Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
> On Sat, Jul 27, 2013 at 12:21 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > Back to utf. utfs are not only elements of a unique set of encoded
> 
> > code points. They have an interesting feature. Each "utf chunk"
> 
> > holds intrisically the character (in fact the code point) it is
> 
> > supposed to represent. In utf-32, the obvious case, it is just
> 
> > the code point. In utf-8, that's the first chunk which helps and
> 
> > utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
> 
> > implementation using bytes, for any pointer position it is always
> 
> > possible to find the corresponding encoded code point and from this
> 
> > the corresponding character without any "programmed" information. See
> 
> > my editor example, how to find the char under the caret? In fact,
> 
> > a silly example, how can the caret can be positioned or moved, if
> 
> > the underlying corresponding encoded code point can not be
> 
> > dicerned!
> 
> 
> 
> Yes, given a pointer location into a utf-8 or utf-16 string, it is
> 
> easy to determine the identity of the code point at that location.
> 
> But this is not often a useful operation, save for resynchronization
> 
> in the case that the string data is corrupted.  The caret of an editor
> 
> does not conceptually correspond to a pointer location, but to a
> 
> character index.  Given a particular character index (e.g. 127504), an
> 
> editor must be able to determine the identity and/or the memory
> 
> location of the character at that index, and for UTF-8 and UTF-16
> 
> without an auxiliary data structure that is a O(n) operation.
> 
> 
> 
> > 2) Take a look at this. Get rid of the overhead.
> 
> >
> 
> >>>> sys.getsizeof('b'*1000000 + 'c')
> 
> > 1000026
> 
> >>>> sys.getsizeof('b'*1000000 + '€')
> 
> > 2000040
> 
> >
> 
> > What does it mean? It means that Python has to
> 
> > reencode a str every time it is necessary because
> 
> > it works with multiple codings.
> 
> 
> 
> Large strings in practical usage do not need to be resized like this
> 
> often.  Python 3.3 has been in production use for months now, and you
> 
> still have yet to produce any real-world application code that
> 
> demonstrates a performance regression.  If there is no real-world
> 
> regression, then there is no problem.
> 
> 
> 
> > 3) Unicode compliance. We know retrospectively, latin-1,
> 
> > is was a bad choice. Unusable for 17 European languages.
> 
> > Believe of not. 20 years of Unicode of incubation is not
> 
> > long enough to learn it. When discussing once with a French
> 
> > Python core dev, one with commit access, he did not know one
> 
> > can not use latin-1 for the French language!
> 
> 
> 
> Probably because for many French strings, one can.  As far as I am
> 
> aware, the only characters that are missing from Latin-1 are the Euro
> 
> sign (an unfortunate victim of history), the ligature œ (I have no
> 
> doubt that many users just type oe anyway), and the rare capital Ÿ
> 
> (the miniscule version is present in Latin-1).  All French strings
> 
> that are fortunate enough to be absent these characters can be
> 
> represented in Latin-1 and so will have a 1-byte width in the FSR.

------

latin-1? that's not even truth.

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('ü')
38
>>> sys.getsizeof('aa')
27
>>> sys.getsizeof('aü')
39


jmf

[toc] | [prev] | [next] | [standalone]


#51397

FromMRAB <python@mrabarnett.plus.com>
Date2013-07-28 20:04 +0100
Message-ID<mailman.5200.1375038295.3114.python-list@python.org>
In reply to#51392
On 28/07/2013 19:13, wxjmfauth@gmail.com wrote:
> Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
>> On Sat, Jul 27, 2013 at 12:21 PM,  <wxjmfauth@gmail.com> wrote:
>>
>> > Back to utf. utfs are not only elements of a unique set of encoded
>>
>> > code points. They have an interesting feature. Each "utf chunk"
>>
>> > holds intrisically the character (in fact the code point) it is
>>
>> > supposed to represent. In utf-32, the obvious case, it is just
>>
>> > the code point. In utf-8, that's the first chunk which helps and
>>
>> > utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
>>
>> > implementation using bytes, for any pointer position it is always
>>
>> > possible to find the corresponding encoded code point and from this
>>
>> > the corresponding character without any "programmed" information. See
>>
>> > my editor example, how to find the char under the caret? In fact,
>>
>> > a silly example, how can the caret can be positioned or moved, if
>>
>> > the underlying corresponding encoded code point can not be
>>
>> > dicerned!
>>
>>
>>
>> Yes, given a pointer location into a utf-8 or utf-16 string, it is
>>
>> easy to determine the identity of the code point at that location.
>>
>> But this is not often a useful operation, save for resynchronization
>>
>> in the case that the string data is corrupted.  The caret of an editor
>>
>> does not conceptually correspond to a pointer location, but to a
>>
>> character index.  Given a particular character index (e.g. 127504), an
>>
>> editor must be able to determine the identity and/or the memory
>>
>> location of the character at that index, and for UTF-8 and UTF-16
>>
>> without an auxiliary data structure that is a O(n) operation.
>>
>>
>>
>> > 2) Take a look at this. Get rid of the overhead.
>>
>> >
>>
>> >>>> sys.getsizeof('b'*1000000 + 'c')
>>
>> > 1000026
>>
>> >>>> sys.getsizeof('b'*1000000 + '€')
>>
>> > 2000040
>>
>> >
>>
>> > What does it mean? It means that Python has to
>>
>> > reencode a str every time it is necessary because
>>
>> > it works with multiple codings.
>>
>>
>>
>> Large strings in practical usage do not need to be resized like this
>>
>> often.  Python 3.3 has been in production use for months now, and you
>>
>> still have yet to produce any real-world application code that
>>
>> demonstrates a performance regression.  If there is no real-world
>>
>> regression, then there is no problem.
>>
>>
>>
>> > 3) Unicode compliance. We know retrospectively, latin-1,
>>
>> > is was a bad choice. Unusable for 17 European languages.
>>
>> > Believe of not. 20 years of Unicode of incubation is not
>>
>> > long enough to learn it. When discussing once with a French
>>
>> > Python core dev, one with commit access, he did not know one
>>
>> > can not use latin-1 for the French language!
>>
>>
>>
>> Probably because for many French strings, one can.  As far as I am
>>
>> aware, the only characters that are missing from Latin-1 are the Euro
>>
>> sign (an unfortunate victim of history), the ligature œ (I have no
>>
>> doubt that many users just type oe anyway), and the rare capital Ÿ
>>
>> (the miniscule version is present in Latin-1).  All French strings
>>
>> that are fortunate enough to be absent these characters can be
>>
>> represented in Latin-1 and so will have a 1-byte width in the FSR.
>
> ------
>
> latin-1? that's not even truth.
>
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('ü')
> 38
>>>> sys.getsizeof('aa')
> 27
>>>> sys.getsizeof('aü')
> 39
>

 >>> sys.getsizeof('aa') - sys.getsizeof('a')
1

One byte per codepoint.

 >>> sys.getsizeof('üü') - sys.getsizeof('ü')
1

Also one byte per codepoint.

 >>> sys.getsizeof('ü') - sys.getsizeof('a')
12

Clearly there's more going on here.

FSR is an optimisation. You'll always be able to find some
circumstances where an optimisation makes things worse, but what
matters is the overall result.

[toc] | [prev] | [next] | [standalone]


#51402

Fromwxjmfauth@gmail.com
Date2013-07-28 12:30 -0700
Message-ID<95b91473-b707-4288-860c-d02fda7af1ea@googlegroups.com>
In reply to#51397
Le dimanche 28 juillet 2013 21:04:56 UTC+2, MRAB a écrit :
> On 28/07/2013 19:13, wxjmfauth@gmail.com wrote:
> 
> > Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
> 
> >> On Sat, Jul 27, 2013 at 12:21 PM,  <wxjmfauth@gmail.com> wrote:
> 
> >>
> 
> >> > Back to utf. utfs are not only elements of a unique set of encoded
> 
> >>
> 
> >> > code points. They have an interesting feature. Each "utf chunk"
> 
> >>
> 
> >> > holds intrisically the character (in fact the code point) it is
> 
> >>
> 
> >> > supposed to represent. In utf-32, the obvious case, it is just
> 
> >>
> 
> >> > the code point. In utf-8, that's the first chunk which helps and
> 
> >>
> 
> >> > utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
> 
> >>
> 
> >> > implementation using bytes, for any pointer position it is always
> 
> >>
> 
> >> > possible to find the corresponding encoded code point and from this
> 
> >>
> 
> >> > the corresponding character without any "programmed" information. See
> 
> >>
> 
> >> > my editor example, how to find the char under the caret? In fact,
> 
> >>
> 
> >> > a silly example, how can the caret can be positioned or moved, if
> 
> >>
> 
> >> > the underlying corresponding encoded code point can not be
> 
> >>
> 
> >> > dicerned!
> 
> >>
> 
> >>
> 
> >>
> 
> >> Yes, given a pointer location into a utf-8 or utf-16 string, it is
> 
> >>
> 
> >> easy to determine the identity of the code point at that location.
> 
> >>
> 
> >> But this is not often a useful operation, save for resynchronization
> 
> >>
> 
> >> in the case that the string data is corrupted.  The caret of an editor
> 
> >>
> 
> >> does not conceptually correspond to a pointer location, but to a
> 
> >>
> 
> >> character index.  Given a particular character index (e.g. 127504), an
> 
> >>
> 
> >> editor must be able to determine the identity and/or the memory
> 
> >>
> 
> >> location of the character at that index, and for UTF-8 and UTF-16
> 
> >>
> 
> >> without an auxiliary data structure that is a O(n) operation.
> 
> >>
> 
> >>
> 
> >>
> 
> >> > 2) Take a look at this. Get rid of the overhead.
> 
> >>
> 
> >> >
> 
> >>
> 
> >> >>>> sys.getsizeof('b'*1000000 + 'c')
> 
> >>
> 
> >> > 1000026
> 
> >>
> 
> >> >>>> sys.getsizeof('b'*1000000 + '€')
> 
> >>
> 
> >> > 2000040
> 
> >>
> 
> >> >
> 
> >>
> 
> >> > What does it mean? It means that Python has to
> 
> >>
> 
> >> > reencode a str every time it is necessary because
> 
> >>
> 
> >> > it works with multiple codings.
> 
> >>
> 
> >>
> 
> >>
> 
> >> Large strings in practical usage do not need to be resized like this
> 
> >>
> 
> >> often.  Python 3.3 has been in production use for months now, and you
> 
> >>
> 
> >> still have yet to produce any real-world application code that
> 
> >>
> 
> >> demonstrates a performance regression.  If there is no real-world
> 
> >>
> 
> >> regression, then there is no problem.
> 
> >>
> 
> >>
> 
> >>
> 
> >> > 3) Unicode compliance. We know retrospectively, latin-1,
> 
> >>
> 
> >> > is was a bad choice. Unusable for 17 European languages.
> 
> >>
> 
> >> > Believe of not. 20 years of Unicode of incubation is not
> 
> >>
> 
> >> > long enough to learn it. When discussing once with a French
> 
> >>
> 
> >> > Python core dev, one with commit access, he did not know one
> 
> >>
> 
> >> > can not use latin-1 for the French language!
> 
> >>
> 
> >>
> 
> >>
> 
> >> Probably because for many French strings, one can.  As far as I am
> 
> >>
> 
> >> aware, the only characters that are missing from Latin-1 are the Euro
> 
> >>
> 
> >> sign (an unfortunate victim of history), the ligature œ (I have no
> 
> >>
> 
> >> doubt that many users just type oe anyway), and the rare capital Ÿ
> 
> >>
> 
> >> (the miniscule version is present in Latin-1).  All French strings
> 
> >>
> 
> >> that are fortunate enough to be absent these characters can be
> 
> >>
> 
> >> represented in Latin-1 and so will have a 1-byte width in the FSR.
> 
> >
> 
> > ------
> 
> >
> 
> > latin-1? that's not even truth.
> 
> >
> 
> >>>> sys.getsizeof('a')
> 
> > 26
> 
> >>>> sys.getsizeof('ü')
> 
> > 38
> 
> >>>> sys.getsizeof('aa')
> 
> > 27
> 
> >>>> sys.getsizeof('aü')
> 
> > 39
> 
> >
> 
> 
> 
>  >>> sys.getsizeof('aa') - sys.getsizeof('a')
> 
> 1
> 
> 
> 
> One byte per codepoint.
> 
> 
> 
>  >>> sys.getsizeof('üü') - sys.getsizeof('ü')
> 
> 1
> 
> 
> 
> Also one byte per codepoint.
> 
> 
> 
>  >>> sys.getsizeof('ü') - sys.getsizeof('a')
> 
> 12
> 
> 
> 
> Clearly there's more going on here.
> 
> 
> 
> FSR is an optimisation. You'll always be able to find some
> 
> circumstances where an optimisation makes things worse, but what
> 
> matters is the overall result.


----

Yes, I know my examples are always wrong, never
real examples.

I can point long strings, I should point short strings.
I point a short string (char), it is not long enough.
Strings as dict keys, no the problem is in Python dict.
Performance? no that's a memory issue.
Memory? no, it's a question to keep perfomance.
I am using this char, no you should not, it's no common.
The nabla operator in TeX file, who is so stupid to use
that char?
Many time, I'm just mimicking 'BDFL' examples, just
by replacing "his" ascii chars by non ascii char ;-)
And so on.

To be short, this is *never* the FSR, always something
else.

Suggestion. Start by solving all these "micro-benchmarks".
all the memory cases. It a good start, no?


jmf

[toc] | [prev] | [next] | [standalone]


#51406

FromLele Gaifax <lele@metapensiero.it>
Date2013-07-28 22:45 +0200
Message-ID<mailman.5205.1375044339.3114.python-list@python.org>
In reply to#51402
wxjmfauth@gmail.com writes:

> Suggestion. Start by solving all these "micro-benchmarks".
> all the memory cases. It a good start, no?

Since you seem the only one who has this dramatic problem with such
micro-benchmarks, that BTW have nothing to do with "unicode compliance",
I'd suggest *you* should find a better implementation and propose it to
the core devs.

An even better suggestion, with due respect, is to get a life and find
something more interesting to do, or at least better arguments :-)

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
lele@metapensiero.it  |                 -- Fortunato Depero, 1929.

[toc] | [prev] | [next] | [standalone]


#51426

FromAntoon Pardon <antoon.pardon@rece.vub.ac.be>
Date2013-07-28 22:01 +0200
Message-ID<mailman.5216.1375082822.3114.python-list@python.org>
In reply to#51402
Op 28-07-13 21:30, wxjmfauth@gmail.com schreef:

> To be short, this is *never* the FSR, always something
> else.
>
> Suggestion. Start by solving all these "micro-benchmarks".
> all the memory cases. It a good start, no?
>

There is nothing to solve. Unicode doesn't force implementations
to use the same size of memory for strings of the same length.

So you pointing out examples of same length strings that don't
use the same size of memory doesn't point at something that
must be solved.

-- 
Antoon Pardon

[toc] | [prev] | [next] | [standalone]


Page 4 of 7 — ← Prev page 1 2 3 [4] 5 6 7  Next page →

Back to top | Article view | comp.lang.python


csiph-web