Groups > comp.lang.python > #50503 > unrolled thread

RE Module Performance

Started by	Devyn Collier Johnson <devyncjohnson@gmail.com>
First post	2013-07-11 19:44 -0400
Last post	2013-07-18 13:17 -0700
Articles	20 on this page of 136 — 25 participants

Back to article view | Back to comp.lang.python

  RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-11 19:44 -0400
    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-12 02:23 -0700
      Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:27 +1000
      Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 10:39 +0100
      Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:40 +1000
      Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 06:45 -0400
      Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 16:59 +0100
      Re: RE Module Performance Peter Otten <__peter__@web.de> - 2013-07-12 18:15 +0200
      Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-13 02:21 +1000
      Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 13:58 -0400
        Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 05:37 +0000
          Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-14 11:17 -0700
            Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 06:06 -0400
              Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-15 12:36 +0000
                Dihedral Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 08:52 -0400
                Re: Dihedral Joel Goldstick <joel.goldstick@gmail.com> - 2013-07-15 09:03 -0400
                Re: Dihedral Wayne Werner <wayne@waynewerner.com> - 2013-07-15 17:43 -0500
                Re: Dihedral Fábio Santos <fabiosantosart@gmail.com> - 2013-07-15 23:54 +0100
                Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-16 08:59 +1000
                Re: Dihedral Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-16 16:06 +1000
                Re: Dihedral Stefan Behnel <stefan_ml@behnel.de> - 2013-07-24 20:08 +0200
                Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:23 +1000
                Re: Dihedral Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-24 20:15 -0400
      Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-13 08:16 +1000
      Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-12 17:13 -0600
        Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-24 06:40 -0700
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-24 23:48 +1000
          Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:17 -0400
          Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:19 -0400
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:34 +1000
            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:02 +0000
              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:39 +1000
          Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 08:47 -0600
            Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 02:27 -0700
              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:14 +1000
                Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 12:07 -0700
                  Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 05:18 +1000
                  RE: RE Module Performance "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2013-07-25 19:30 +0000
                  Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:06 -0600
          Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 09:00 -0600
            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 05:56 +0000
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:56 +1000
          Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 13:52 -0400
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:15 +1000
            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:15 +0000
              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:58 +1000
                Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 09:22 +0000
                  Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:07 +1000
          Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 18:09 -0400
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 08:19 +1000
          Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 16:59 -0600
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 09:24 +1000
          Re: RE Module Performance Serhiy Storchaka <storchaka@gmail.com> - 2013-07-25 08:49 +0300
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 15:58 +1000
          Re: RE Module Performance Jeremy Sanders <jeremy@jeremysanders.net> - 2013-07-25 14:36 +0100
            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 15:26 +0000
              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 01:36 +1000
                Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 17:18 +0000
                  Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 03:27 +1000
                  Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:45 -0500
                    Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-26 02:48 +0000
                      Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 21:20 -0600
                        Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:36 -0700
                        Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 08:46 -0700
                          Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 06:28 +0000
                        Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 03:37 +0000
                          Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-26 22:12 -0600
                            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 05:04 +0000
                          Re: RE Module Performance Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-27 12:13 -0400
                    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:19 -0700
                  Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:09 -0600
                    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:21 -0700
                      Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-26 20:05 -0600
                        Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-27 11:21 -0700
                          Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-27 21:53 -0600
                            Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 11:13 -0700
                              Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:04 +0100
                                Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:30 -0700
                                  Re: RE Module Performance Lele Gaifax <lele@metapensiero.it> - 2013-07-28 22:45 +0200
                                  Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 22:01 +0200
                            Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 07:01 -0700
                              Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 16:38 +0200
                              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 15:45 +0100
                              Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 17:13 +0100
                              Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 18:39 +0200
                              Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 18:14 +0100
                                Re: RE Module Performance Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-31 13:09 +1000
                              Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-31 03:27 +1000
                              Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-30 18:40 +0100
                              Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 20:19 +0200
                                Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 12:09 -0700
                                  Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 21:04 +0100
                                  Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:54 -0600
                                  Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-31 05:45 +0000
                                    Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 08:17 +0100
                                    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 13:15 -0700
                                      Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 21:41 +0100
                                  Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:11 +0200
                                    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 01:32 -0700
                                      Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:59 +0200
                                      Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:44 -0600
                              Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-30 17:05 -0400
                              Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:30 -0600
                              Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 09:23 +0200
                              Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:27 -0600
                          Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 10:45 +0200
                          FSR and unicode compliance - was Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-28 09:52 -0600
                            Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:23 -0700
                              Re: FSR and unicode compliance - was Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:44 +0100
                              Re: FSR and unicode compliance - was Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 21:55 +0200
                              Re: FSR and unicode compliance - was Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-28 20:52 +0000
                                Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 04:43 -0700
                                  Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 12:57 +0100
                                    Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 05:56 -0700
                                    Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 07:20 -0700
                                      Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 15:49 +0100
                                        Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 09:31 -0700
                                  Re: FSR and unicode compliance - was Re: RE Module Performance Heiko Wundram <modelnine@modelnine.org> - 2013-07-29 14:06 +0200
                                  Re: FSR and unicode compliance - was Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-29 08:43 -0400
                          Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 18:03 +0100
                          Re: FSR and unicode compliance - was Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 13:36 -0400
                            Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 06:36 -0700
                          Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:03 +0100
                          Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 19:19 +0100
                          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:29 +0100
                          Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 15:06 -0400
                          Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 23:14 +0100
                          Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 20:51 +0200
                          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 00:07 +0100
                      Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-26 22:38 +0200
          Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-25 09:44 -0400
          Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:53 -0500
      Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-13 00:16 +0100
      Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-14 05:34 +1000
      Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-16 06:30 -0400
        Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-18 13:17 -0700

Page 3 of 7 — ← Prev page 1 2 [3] 4 5 6 7 Next page →

#51190

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-07-25 05:56 +0000
Message-ID	<51f0be1e$0$29971$c3e8da3$5496439d@news.astraweb.com>
In reply to	#51139

On Wed, 24 Jul 2013 09:00:39 -0600, Michael Torrie wrote about JMF:

> His most recent argument that Python should use UTF as a representation
> is very strange to be honest.

He's not arguing for anything, he is just hating on anything that gives 
even the tiniest benefit to ASCII users. This isn't about Python 3.3. 
hurting non-ASCII users, because that is demonstrably untrue: they are 
*better off* in Python 3.3. This is about denying even a tiny benefit to 
ASCII users.

In Python 3.3, non-ASCII users have these advantages compared to previous 
versions:

- strings will usually take less memory, and aside from trivial changes 
to the object header, they never take more memory than a wide build would 
use;

- consequently nearly all objects will take less memory (especially 
builtins and standard library objects, which are all ASCII), since 
objects contain dozens of internal strings (attribute and method names in 
__dict__, class name, etc.);

- consequently whole-application benchmarks show most applications will 
use significantly less memory, which leads to faster speeds;

- you cannot break surrogate pairs apart by accident, which you can do in 
narrow builds;

- in previous versions, code which works when run in a wide build may 
fail in a narrow build, but that is no longer an issue since the 
distinction between wide and narrow builds is gone;

- Latin1 users, which includes JMF himself, will likewise see memory 
savings, since Latin1 strings will take half the size of narrow builds 
and a quarter the size of wide builds.


The cost of all these benefits is a small overhead when creating a string 
in the first place, and some purely internal added complication to the 
string implementation.

I'm the first to argue against complication unless there is a 
corresponding benefit. This is a case where the benefit has proven itself 
doubly: Python 3.3's Unicode implementation is *more correct* than 
before, and it uses less memory to do so.

> The cons of UTF are apparent and widely
> known.  The main con is that UTF strings are O(n) for indexing a
> position within the string.

Not so for UTF-32.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#51141

From	Chris Angelico <rosuav@gmail.com>
Date	2013-07-25 00:56 +1000
Message-ID	<mailman.5043.1374678213.3114.python-list@python.org>
In reply to	#51131

On Thu, Jul 25, 2013 at 12:47 AM, Michael Torrie <torriem@gmail.com> wrote:
> On 07/24/2013 07:40 AM, wxjmfauth@gmail.com wrote:
>> Sorry, you are not understanding Unicode. What is a Unicode
>> Transformation Format (UTF), what is the goal of a UTF and
>> why it is important for an implementation to work with a UTF.
>
> Really?  Enlighten me.
>
> Personally, I would never use UTF as a representation *in memory* for a
> unicode string if it were up to me.  Why?  Because UTF characters are
> not uniform in byte width so accessing positions within the string is
> terribly slow and has to always be done by starting at the beginning of
> the string.  That's at minimum O(n) compared to FSR's O(1).  Surely you
> understand this.  Do you dispute this fact?

Take care here; UTF is a general term for Unicode Translation Formats,
of which one (UTF-32) is fixed-width. Every other UTF-n is variable
width, though, so your point still stands. UTF-32 is the basis for
Python's FSR.

ChrisA

[toc] | [prev] | [next] | [standalone]

#51155

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-07-24 13:52 -0400
Message-ID	<mailman.5056.1374688374.3114.python-list@python.org>
In reply to	#51131

On 7/24/2013 11:00 AM, Michael Torrie wrote:
> On 07/24/2013 08:34 AM, Chris Angelico wrote:
>> Frankly, Python's strings are a *terrible* internal representation
>> for an editor widget - not because of PEP 393, but simply because
>> they are immutable, and every keypress would result in a rebuilding
>> of the string. On the flip side, I could quite plausibly imagine
>> using a list of strings;

I used exactly this, a list of strings, for a Python-coded text-only 
mock editor to replace the tk Text widget in idle tests. It works fine 
for the purpose. For small test texts, the inefficiency of immutable 
strings is not relevant.

Tk apparently uses a C-coded btree rather than a Python list. All 
details are hidden, unless one finds and reads the source ;-), but but 
it uses C arrays rather than Python strings.

>> In this usage, the FSR is beneficial, as it's possible to have
>> different strings at different widths.

For my purpose, the mock Text works the same in 2.7 and 3.3+.

> Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in
> pros and cons,

They both have the pro that indexing is direct *and correct*. The cons 
are different.

> and the cons of using UCS-2 (the old narrow builds) are
> well known.  UCS-2 simply cannot represent all of unicode correctly.

Python's narrow builds, at least for several releases, were in between 
USC-2 and UTF-16 in that they used surrogates to represent all unicodes 
but did not correct indexing for the presence of astral chars. This is a 
nuisance for those who do use astral chars, such as emotes and CJK name 
chars, on an everyday basis.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#51159

From	Chris Angelico <rosuav@gmail.com>
Date	2013-07-25 04:15 +1000
Message-ID	<mailman.5059.1374689751.3114.python-list@python.org>
In reply to	#51131

On Thu, Jul 25, 2013 at 3:52 AM, Terry Reedy <tjreedy@udel.edu> wrote:
> On 7/24/2013 11:00 AM, Michael Torrie wrote:
>>
>> On 07/24/2013 08:34 AM, Chris Angelico wrote:
>>>
>>> Frankly, Python's strings are a *terrible* internal representation
>>> for an editor widget - not because of PEP 393, but simply because
>>> they are immutable, and every keypress would result in a rebuilding
>>> of the string. On the flip side, I could quite plausibly imagine
>>> using a list of strings;
>
>
> I used exactly this, a list of strings, for a Python-coded text-only mock
> editor to replace the tk Text widget in idle tests. It works fine for the
> purpose. For small test texts, the inefficiency of immutable strings is not
> relevant.
>
> Tk apparently uses a C-coded btree rather than a Python list. All details
> are hidden, unless one finds and reads the source ;-), but but it uses C
> arrays rather than Python strings.
>
>
>>> In this usage, the FSR is beneficial, as it's possible to have
>>> different strings at different widths.
>
>
> For my purpose, the mock Text works the same in 2.7 and 3.3+.

Thanks for that report! And yes, it's going to behave exactly the same
way, because its underlying structure is an ordered list of ordered
lists of Unicode codepoints, ergo 3.3/PEP 393 is merely a question of
performance. But if you put your code onto a narrow build, you'll have
issues as seen below.

>> Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in
>> pros and cons,
>
> They both have the pro that indexing is direct *and correct*. The cons are
> different.

They're close enough, though. It's simply a performance tradeoff - use
the memory all the time, or take a bit of overhead to give yourself
the option of using less memory. The difference is negligible compared
to...

>> and the cons of using UCS-2 (the old narrow builds) are
>> well known.  UCS-2 simply cannot represent all of unicode correctly.
>
> Python's narrow builds, at least for several releases, were in between USC-2
> and UTF-16 in that they used surrogates to represent all unicodes but did
> not correct indexing for the presence of astral chars. This is a nuisance
> for those who do use astral chars, such as emotes and CJK name chars, on an
> everyday basis.

... this. If nobody had ever thought of doing a multi-format string
representation, I could well imagine the Python core devs debating
whether the cost of UTF-32 strings is worth the correctness and
consistency improvements... and most likely concluding that narrow
builds get abolished. And if any other language (eg ECMAScript)
decides to move from UTF-16 to UTF-32, I would wholeheartedly support
the move, even if it broke code to do so. To my mind, exposing UTF-16
surrogates to the application is a bug to be fixed, not a feature to
be maintained. But since we can get the best of both worlds with only
a small amount of overhead, I really don't see why anyone should be
objecting.

ChrisA

[toc] | [prev] | [next] | [standalone]

#51200

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-07-25 07:15 +0000
Message-ID	<51f0d0a0$0$29971$c3e8da3$5496439d@news.astraweb.com>
In reply to	#51159

On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:

> If nobody had ever thought of doing a multi-format string
> representation, I could well imagine the Python core devs debating
> whether the cost of UTF-32 strings is worth the correctness and
> consistency improvements... and most likely concluding that narrow
> builds get abolished. And if any other language (eg ECMAScript) decides
> to move from UTF-16 to UTF-32, I would wholeheartedly support the move,
> even if it broke code to do so.

Unfortunately, so long as most language designers are European-centric, 
there is going to be a lot of push-back against any attempt to fix (say) 
Javascript, or Java just for the sake of "a bunch of dead languages" in 
the SMPs. Thank goodness for emoji. Wait til the young kids start 
complaining that their emoticons and emoji are broken in Javascript, and 
eventually it will get fixed. It may take a decade, for the young kids to 
grow up and take over Javascript from the old-codgers, but it will happen.

> To my mind, exposing UTF-16 surrogates
> to the application is a bug to be fixed, not a feature to be maintained.

This, times a thousand.

It is *possible* to have non-buggy string routines using UTF-16, but the 
implementation is a lot more complex than most language developers can be 
bothered with. I'm not aware of any language that uses UTF-16 internally 
that doesn't give wrong results for surrogate pairs.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#51203

From	Chris Angelico <rosuav@gmail.com>
Date	2013-07-25 17:58 +1000
Message-ID	<mailman.5084.1374739093.3114.python-list@python.org>
In reply to	#51200

On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:
>
>> If nobody had ever thought of doing a multi-format string
>> representation, I could well imagine the Python core devs debating
>> whether the cost of UTF-32 strings is worth the correctness and
>> consistency improvements... and most likely concluding that narrow
>> builds get abolished. And if any other language (eg ECMAScript) decides
>> to move from UTF-16 to UTF-32, I would wholeheartedly support the move,
>> even if it broke code to do so.
>
> Unfortunately, so long as most language designers are European-centric,
> there is going to be a lot of push-back against any attempt to fix (say)
> Javascript, or Java just for the sake of "a bunch of dead languages" in
> the SMPs. Thank goodness for emoji. Wait til the young kids start
> complaining that their emoticons and emoji are broken in Javascript, and
> eventually it will get fixed. It may take a decade, for the young kids to
> grow up and take over Javascript from the old-codgers, but it will happen.

I don't know that that'll happen like that. Emoticons aren't broken in
Javascript - you can use them just fine. You only start seeing
problems when you index into that string. People will start to wonder
why, for instance, a "500 character maximum" field deducts two from
the limit when an emoticon goes in. Example:

Type here:<br><textarea id=content oninput="showlimit(this)"></textarea>
<br>You have <span id=limit1>500</span> characters left (self.value.length).
<br>You have <span id=limit2>500</span> characters left (self.textLength).
<script>
function showlimit(self)
{
	document.getElementById("limit1").innerHTML=500-self.value.length;
	document.getElementById("limit2").innerHTML=500-self.textLength;
}
</script>

I've included an attribute documented here[1] as the "codepoint length
of the control's value", but in Chrome on Windows, it still counts
UTF-16 code units. However, I very much doubt that this will result in
language changes. People will just live with it. Chinese and Japanese
users will complain, perhaps, and the developers will write it off as
whinging, and just say "That's what the internet does". Maybe, if
you're really lucky, they'll acknowledge that "that's what JavaScript
does", but even then I doubt it'd result in language changes.

>> To my mind, exposing UTF-16 surrogates
>> to the application is a bug to be fixed, not a feature to be maintained.
>
> This, times a thousand.
>
> It is *possible* to have non-buggy string routines using UTF-16, but the
> implementation is a lot more complex than most language developers can be
> bothered with. I'm not aware of any language that uses UTF-16 internally
> that doesn't give wrong results for surrogate pairs.

The problem isn't the underlying representation, the problem is what
gets exposed to the application. Once you've decided to expose
codepoints to the app (abstracting over your UTF-16 underlying
representation), the change to using UTF-32, or mimicking PEP 393, or
some other structure, is purely internal and an optimization. So I
doubt any language will use UTF-16 internally and UTF-32 to the app.
It'd be needlessly complex.

ChrisA

[1] https://developer.mozilla.org/en-US/docs/Web/API/HTMLTextAreaElement

[toc] | [prev] | [next] | [standalone]

#51208

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-07-25 09:22 +0000
Message-ID	<51f0ee48$0$29971$c3e8da3$5496439d@news.astraweb.com>
In reply to	#51203

On Thu, 25 Jul 2013 17:58:10 +1000, Chris Angelico wrote:

> On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:
>>
>>> If nobody had ever thought of doing a multi-format string
>>> representation, I could well imagine the Python core devs debating
>>> whether the cost of UTF-32 strings is worth the correctness and
>>> consistency improvements... and most likely concluding that narrow
>>> builds get abolished. And if any other language (eg ECMAScript)
>>> decides to move from UTF-16 to UTF-32, I would wholeheartedly support
>>> the move, even if it broke code to do so.
>>
>> Unfortunately, so long as most language designers are European-centric,
>> there is going to be a lot of push-back against any attempt to fix
>> (say) Javascript, or Java just for the sake of "a bunch of dead
>> languages" in the SMPs. Thank goodness for emoji. Wait til the young
>> kids start complaining that their emoticons and emoji are broken in
>> Javascript, and eventually it will get fixed. It may take a decade, for
>> the young kids to grow up and take over Javascript from the
>> old-codgers, but it will happen.
> 
> I don't know that that'll happen like that. Emoticons aren't broken in
> Javascript - you can use them just fine. You only start seeing problems
> when you index into that string. People will start to wonder why, for
> instance, a "500 character maximum" field deducts two from the limit
> when an emoticon goes in.

I get that. I meant *Javascript developers*, not end-users. The young 
kids today who become Javascript developers tomorrow will grow up in a 
world where they expect to be able to write band names like
"▼□■□■□■" (yes, really, I didn't make that one up) and have it just work.
Okay, all those characters are in the BMP, but emoji aren't, and I 
guarantee that even as we speak some new hipster band is trying to decide 
whether to name themselves "Smiling 😢" or "Crying 😊".

:-)

>> It is *possible* to have non-buggy string routines using UTF-16, but
>> the implementation is a lot more complex than most language developers
>> can be bothered with. I'm not aware of any language that uses UTF-16
>> internally that doesn't give wrong results for surrogate pairs.
> 
> The problem isn't the underlying representation, the problem is what
> gets exposed to the application. Once you've decided to expose
> codepoints to the app (abstracting over your UTF-16 underlying
> representation), the change to using UTF-32, or mimicking PEP 393, or
> some other structure, is purely internal and an optimization. So I doubt
> any language will use UTF-16 internally and UTF-32 to the app. It'd be
> needlessly complex.

To be honest, I don't understand what you are trying to say.

What I'm trying to say is that it is possible to use UTF-16 internally, 
but *not* assume that every code point (character) is represented by a 
single 2-byte unit. For example, the len() of a UTF-16 string should not 
be calculated by counting the number of bytes and dividing by two. You 
actually need to walk the string, inspecting each double-byte:

# calculate length
count = 0
inside_surrogate = False
for bb in buffer:  # get two bytes at a time
    if is_lower_surrogate(bb):
        inside_surrogate = True
        continue
    if is_upper_surrogate(bb):
        if inside_surrogate:
            count += 1
            inside_surrogate = False
            continue
        raise ValueError("missing lower surrogate")
    if inside_surrogate:
        break
    count += 1
if inside_surrogate:
    raise ValueError("missing upper surrogate")

Given immutable strings, you could validate the string once, on creation, 
and from then on assume they are well-formed:

# calculate length, assuming the string is well-formed:
count = 0
skip = False
for bb in buffer:  # get two bytes at a time
    if skip:
        count += 1
        skip = False
        continue
    if is_surrogate(bb):
        skip = True
    count += 1

String operations such as slicing become much more complex once you can 
no longer assume a 1:1 relationship between code points and code units, 
whether they are 1, 2 or 4 bytes. Most (all?) language developers don't 
handle that complexity, and push responsibility for it back onto the 
coder using the language. 

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#51211

From	Chris Angelico <rosuav@gmail.com>
Date	2013-07-25 20:07 +1000
Message-ID	<mailman.5089.1374746869.3114.python-list@python.org>
In reply to	#51208

On Thu, Jul 25, 2013 at 7:22 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> What I'm trying to say is that it is possible to use UTF-16 internally,
> but *not* assume that every code point (character) is represented by a
> single 2-byte unit. For example, the len() of a UTF-16 string should not
> be calculated by counting the number of bytes and dividing by two. You
> actually need to walk the string, inspecting each double-byte

Anything's possible. But since underlying representations can be
changed fairly easily (relative term of course - it's a lot of work,
but it can be changed in a single release, no deprecation required or
anything), there's very little reason to continue using UTF-16
underneath. May as well switch to UTF-32 for convenience, or PEP 393
for convenience and efficiency, or maybe some other system that's
still mostly fixed-width.

ChrisA

[toc] | [prev] | [next] | [standalone]

#51170

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-07-24 18:09 -0400
Message-ID	<mailman.5067.1374703769.3114.python-list@python.org>
In reply to	#51131

On 7/24/2013 2:15 PM, Chris Angelico wrote:
> On Thu, Jul 25, 2013 at 3:52 AM, Terry Reedy <tjreedy@udel.edu> wrote:

>> For my purpose, the mock Text works the same in 2.7 and 3.3+.
>
> Thanks for that report! And yes, it's going to behave exactly the same
> way, because its underlying structure is an ordered list of ordered
> lists of Unicode codepoints, ergo 3.3/PEP 393 is merely a question of
> performance. But if you put your code onto a narrow build, you'll have
> issues as seen below.

I carefully said 'For my purpose', which is to replace the tk Text 
widget. Up to 8.5, Tk's text is something like Python's narrow-build 
unicode.

If put astral chars into the toy editor, then yes, it would not work on 
narrow builds, but would on 3.3+.

  ...

 > If nobody had ever thought of doing a multi-format string
> representation, I could well imagine the Python core devs debating
> whether the cost of UTF-32 strings is worth the correctness and
> consistency improvements... and most likely concluding that narrow
> builds get abolished. And if any other language (eg ECMAScript)
> decides to move from UTF-16 to UTF-32, I would wholeheartedly support
> the move, even if it broke code to do so.

Making a UTF-16 implementation correct requires converting abstract 
'character' array indexes to concrete double byte array indexes. The 
simple O(n) method of scanning the string from the beginning for each 
index operation is too slow. When PEP393 was being discussed, I devised 
a much faster way to do the conversion.

The key idea is to add an auxiliary array of the abstract indexes of the 
astral chars in the abstract array. This is easily created when the 
string is created and can be done afterward with one linear scan (which 
is how I experimented with Python code). The length of that array is the 
number of surrogate pairs in the concrete 16-bit codepoint array. 
Subtracting that number from the length of the concrete array gives the 
length of the abstract array.

Given a target index of a character in the abstract array, use the 
auxiliary array to determine k, the number of astral characters that 
precede the target character. That can be done with either a O(k) linear 
scan or O(log k) binary search. Add 2 * k to the abstract index to get 
the corresponding index in the concrete array. When slicing a string 
with i0 and i1, slice the auxiliary array with k0 and k1 and adjusting 
the contained indexes downward to get the corresponding auxiliary array.

> To my mind, exposing UTF-16 surrogates to the application is a bug
 > to be fixed, not a feature to be maintained.

It is definitely not a feature, but a proper UTF-16 implementation would 
not expose them except to codecs, just as with the PEP 393 
implementation. (In both cases, I am excluding the sys size function as 
'exposing to the application'.)

 > But since we can get the best of both worlds with only
> a small amount of overhead, I really don't see why anyone should be
> objecting.

I presume you are referring to the PEP 393 1-2-4 byte implementation. 
Given how well it has been optimized, I think it was the right choice 
for Python. But a language that now uses USC2 or defective UTF-16 on all 
platforms might find the auxiliary array an easier fix.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#51171

From	Chris Angelico <rosuav@gmail.com>
Date	2013-07-25 08:19 +1000
Message-ID	<mailman.5068.1374704365.3114.python-list@python.org>
In reply to	#51131

On Thu, Jul 25, 2013 at 8:09 AM, Terry Reedy <tjreedy@udel.edu> wrote:
> On 7/24/2013 2:15 PM, Chris Angelico wrote:
>> To my mind, exposing UTF-16 surrogates to the application is a bug
>> to be fixed, not a feature to be maintained.
>
> It is definitely not a feature, but a proper UTF-16 implementation would not
> expose them except to codecs, just as with the PEP 393 implementation. (In
> both cases, I am excluding the sys size function as 'exposing to the
> application'.)
>
>> But since we can get the best of both worlds with only
>> a small amount of overhead, I really don't see why anyone should be
>> objecting.
>
> I presume you are referring to the PEP 393 1-2-4 byte implementation. Given
> how well it has been optimized, I think it was the right choice for Python.
> But a language that now uses USC2 or defective UTF-16 on all platforms might
> find the auxiliary array an easier fix.
>

I'm referring here to objections like jmf's, and also to threads like this:

http://mozilla.6506.n7.nabble.com/Flexible-String-Representation-full-Unicode-for-ES6-td267585.html

According to the ECMAScript people, UTF-16 and exposing surrogates to
the application is a critical feature to be maintained. I disagree.
But it's not my language, so I'm stuck with it. (I ended up writing a
little wrapper function in C that detects unpaired surrogates, but
that still doesn't deal with the possibility that character indexing
can create a new character that was never there to start with.)

ChrisA

[toc] | [prev] | [next] | [standalone]

#51172

From	Michael Torrie <torriem@gmail.com>
Date	2013-07-24 16:59 -0600
Message-ID	<mailman.5069.1374706766.3114.python-list@python.org>
In reply to	#51131

On 07/24/2013 04:19 PM, Chris Angelico wrote:
> I'm referring here to objections like jmf's, and also to threads like this:
> 
> http://mozilla.6506.n7.nabble.com/Flexible-String-Representation-full-Unicode-for-ES6-td267585.html
> 
> According to the ECMAScript people, UTF-16 and exposing surrogates to
> the application is a critical feature to be maintained. I disagree.
> But it's not my language, so I'm stuck with it. (I ended up writing a
> little wrapper function in C that detects unpaired surrogates, but
> that still doesn't deal with the possibility that character indexing
> can create a new character that was never there to start with.)

This is starting to drift off topic here now, but after reading your
comments on that post, and others objections, I don't fully understand
why making strings simply "unicode" in javascript breaks compatibility
with older scripts.  What operations are performed on strings that
making unicode an abstract type would break?  Is it just in the input
and output of text that must be decoded and encode?  Why should a script
care about the internal representation of unicode strings?  Is it
because the incorrect behavior of UTF-16 and the exposed surrogates (and
subsequent incorrect indexing) are actually depended on by some scripts?

[toc] | [prev] | [next] | [standalone]

#51173

From	Chris Angelico <rosuav@gmail.com>
Date	2013-07-25 09:24 +1000
Message-ID	<mailman.5070.1374708283.3114.python-list@python.org>
In reply to	#51131

On Thu, Jul 25, 2013 at 8:59 AM, Michael Torrie <torriem@gmail.com> wrote:
> I don't fully understand
> why making strings simply "unicode" in javascript breaks compatibility
> with older scripts.  What operations are performed on strings that
> making unicode an abstract type would break?

Imagine this in JavaScript and Python (apart from  the fact that JS
doesn't do backslash escapes past 0x10000):

a = "asdf\U00012345qwer";
b = a[[..10];

What will this do? It depends on whether UTF-16 is used or not.

ChrisA

[toc] | [prev] | [next] | [standalone]

#51188

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2013-07-25 08:49 +0300
Message-ID	<mailman.5080.1374731378.3114.python-list@python.org>
In reply to	#51131

24.07.13 21:15, Chris Angelico написав(ла):
> To my mind, exposing UTF-16
> surrogates to the application is a bug to be fixed, not a feature to
> be maintained.

Python 3 uses code points from U+DC80 to U+DCFF (which are in surrogates 
area) to represent undecodable bytes with surrogateescape error handler.

[toc] | [prev] | [next] | [standalone]

#51194

From	Chris Angelico <rosuav@gmail.com>
Date	2013-07-25 15:58 +1000
Message-ID	<mailman.5082.1374732265.3114.python-list@python.org>
In reply to	#51131

On Thu, Jul 25, 2013 at 3:49 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:
> 24.07.13 21:15, Chris Angelico написав(ла):
>
>> To my mind, exposing UTF-16
>> surrogates to the application is a bug to be fixed, not a feature to
>> be maintained.
>
>
> Python 3 uses code points from U+DC80 to U+DCFF (which are in surrogates
> area) to represent undecodable bytes with surrogateescape error handler.

That's a deliberate and conscious use of the codepoints; that's not
what I'm talking about here. Suppose you read a UTF-8 stream of bytes
from a file, and decode them into your language's standard string
type. At this point, you should be working with a string of Unicode
codepoints:

"\22\341\210\264\360\222\215\205"

-->

"\x12\u1234\U00012345"

The incoming byte stream has a length of 8, the resulting character
stream has a length of 3. Now, if the language wants to use UTF-16
internally, it's free to do so:

0012 1234 d808 df45

When I referred to exposing surrogates to the application, this is
what I'm talking about. If decoding the above byte stream results in a
length 4 string where the last two are \xd808 and \xdf45, then it's
exposing them. If it's a length 3 string where the last is \U00012345,
then it's hiding them. To be honest, I don't imagine I'll ever see a
language that stores strings in UTF-16 and then exposes them to the
application as UTF-32; there's very little point. But such *is*
possible, and if it's working closely with libraries that demand
UTF-16, it might well make sense to do things that way.

ChrisA

[toc] | [prev] | [next] | [standalone]

#51217

From	Jeremy Sanders <jeremy@jeremysanders.net>
Date	2013-07-25 14:36 +0100
Message-ID	<mailman.5094.1374759404.3114.python-list@python.org>
In reply to	#51131

wxjmfauth@gmail.com wrote:

> Short example. Writing an editor with something like the
> FSR is simply impossible (properly).

http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html#Text-Representations

"To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are 
codepoints of text characters within buffers and strings. Rather, Emacs uses a 
variable-length internal representation of characters, that stores each 
character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of 
its codepoint[1]. For example, any ASCII character takes up only 1 byte, a 
Latin-1 character takes up 2 bytes, etc. We call this representation of text 
multibyte.

...

[1] This internal representation is based on one of the encodings defined by 
the Unicode Standard, called UTF-8, for representing any Unicode codepoint, but 
Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8-
bit bytes and characters not unified with Unicode.

"

Jeremy

[toc] | [prev] | [next] | [standalone]

#51233

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-07-25 15:26 +0000
Message-ID	<51f14395$0$29971$c3e8da3$5496439d@news.astraweb.com>
In reply to	#51217

On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:

> wxjmfauth@gmail.com wrote:
> 
>> Short example. Writing an editor with something like the FSR is simply
>> impossible (properly).
> 
> http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-
Representations.html#Text-Representations
> 
> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers
> that are codepoints of text characters within buffers and strings.
> Rather, Emacs uses a variable-length internal representation of
> characters, that stores each character as a sequence of 1 to 5 8-bit
> bytes, depending on the magnitude of its codepoint[1]. For example, any
> ASCII character takes up only 1 byte, a Latin-1 character takes up 2
> bytes, etc. We call this representation of text multibyte.

Well, you've just proven what Vim users have always suspected: Emacs 
doesn't really exist.


> [1] This internal representation is based on one of the encodings
> defined by the Unicode Standard, called UTF-8, for representing any
> Unicode codepoint, but Emacs extends UTF-8 to represent the additional
> codepoints it uses for raw 8- bit bytes and characters not unified with
> Unicode.
> "

Do you know what those characters not unified with Unicode are? Is there 
a list somewhere? I've read all of the pages from here to no avail:

http://www.gnu.org/software/emacs/manual/html_node/elisp/Non_002dASCII-Characters.html



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#51234

From	Chris Angelico <rosuav@gmail.com>
Date	2013-07-26 01:36 +1000
Message-ID	<mailman.5106.1374766576.3114.python-list@python.org>
In reply to	#51233

On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers
>> that are codepoints of text characters within buffers and strings.
>> Rather, Emacs uses a variable-length internal representation of
>> characters, that stores each character as a sequence of 1 to 5 8-bit
>> bytes, depending on the magnitude of its codepoint[1]. For example, any
>> ASCII character takes up only 1 byte, a Latin-1 character takes up 2
>> bytes, etc. We call this representation of text multibyte.
>
> Well, you've just proven what Vim users have always suspected: Emacs
> doesn't really exist.

... lolwut?

ChrisA

[toc] | [prev] | [next] | [standalone]

#51247

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-07-25 17:18 +0000
Message-ID	<51f15e03$0$29971$c3e8da3$5496439d@news.astraweb.com>
In reply to	#51234

On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:

> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers
>>> that are codepoints of text characters within buffers and strings.
>>> Rather, Emacs uses a variable-length internal representation of
>>> characters, that stores each character as a sequence of 1 to 5 8-bit
>>> bytes, depending on the magnitude of its codepoint[1]. For example,
>>> any ASCII character takes up only 1 byte, a Latin-1 character takes up
>>> 2 bytes, etc. We call this representation of text multibyte.
>>
>> Well, you've just proven what Vim users have always suspected: Emacs
>> doesn't really exist.
> 
> ... lolwut?


JMF has explained that it is impossible, impossible I say!, to write an 
editor using a flexible string representation. Since Emacs uses such a 
flexible string representation, Emacs is impossible, and therefore Emacs 
doesn't exist.

QED.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#51248

From	Chris Angelico <rosuav@gmail.com>
Date	2013-07-26 03:27 +1000
Message-ID	<mailman.5113.1374773662.3114.python-list@python.org>
In reply to	#51247

On Fri, Jul 26, 2013 at 3:18 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:
>
>> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
>> <steve+comp.lang.python@pearwood.info> wrote:
>>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
>>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers
>>>> that are codepoints of text characters within buffers and strings.
>>>> Rather, Emacs uses a variable-length internal representation of
>>>> characters, that stores each character as a sequence of 1 to 5 8-bit
>>>> bytes, depending on the magnitude of its codepoint[1]. For example,
>>>> any ASCII character takes up only 1 byte, a Latin-1 character takes up
>>>> 2 bytes, etc. We call this representation of text multibyte.
>>>
>>> Well, you've just proven what Vim users have always suspected: Emacs
>>> doesn't really exist.
>>
>> ... lolwut?
>
>
> JMF has explained that it is impossible, impossible I say!, to write an
> editor using a flexible string representation. Since Emacs uses such a
> flexible string representation, Emacs is impossible, and therefore Emacs
> doesn't exist.
>
> QED.

Quad Error Demonstrated.

I never got past the level of Canis Latinicus in debating class.

ChrisA

[toc] | [prev] | [next] | [standalone]

#51260

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2013-07-25 15:45 -0500
Message-ID	<mailman.5121.1374785646.3114.python-list@python.org>
In reply to	#51247

On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:
>
>> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
>> <steve+comp.lang.python@pearwood.info> wrote:
>>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
>>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers
>>>> that are codepoints of text characters within buffers and strings.
>>>> Rather, Emacs uses a variable-length internal representation of
>>>> characters, that stores each character as a sequence of 1 to 5 8-bit
>>>> bytes, depending on the magnitude of its codepoint[1]. For example,
>>>> any ASCII character takes up only 1 byte, a Latin-1 character takes up
>>>> 2 bytes, etc. We call this representation of text multibyte.
>>>
>>> Well, you've just proven what Vim users have always suspected: Emacs
>>> doesn't really exist.
>>
>> ... lolwut?
>
>
> JMF has explained that it is impossible, impossible I say!, to write an
> editor using a flexible string representation. Since Emacs uses such a
> flexible string representation, Emacs is impossible, and therefore Emacs
> doesn't exist.
>
> QED.

Except that the described representation used by Emacs is a variant of
UTF-8, not an FSR.  It doesn't have three different possible encodings
for the letter 'a' depending on what other characters happen to be in
the string.

As I understand it, jfm would be perfectly happy if Python used UTF-8
(or presumably the Emacs variant) as its internal string
representation.

[toc] | [prev] | [next] | [standalone]

Page 3 of 7 — ← Prev page 1 2 [3] 4 5 6 7 Next page →

csiph-web

RE Module Performance

Contents

#51190

#51141

#51155

#51159

#51200

#51203

#51208

#51211

#51170

#51171

#51172

#51173

#51188

#51194

#51217

#51233

#51234

#51247

#51248

#51260