Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #50503 > unrolled thread
| Started by | Devyn Collier Johnson <devyncjohnson@gmail.com> |
|---|---|
| First post | 2013-07-11 19:44 -0400 |
| Last post | 2013-07-18 13:17 -0700 |
| Articles | 20 on this page of 136 — 25 participants |
Back to article view | Back to comp.lang.python
RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-11 19:44 -0400
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-12 02:23 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:27 +1000
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 10:39 +0100
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:40 +1000
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 06:45 -0400
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 16:59 +0100
Re: RE Module Performance Peter Otten <__peter__@web.de> - 2013-07-12 18:15 +0200
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-13 02:21 +1000
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 13:58 -0400
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 05:37 +0000
Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-14 11:17 -0700
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 06:06 -0400
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-15 12:36 +0000
Dihedral Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 08:52 -0400
Re: Dihedral Joel Goldstick <joel.goldstick@gmail.com> - 2013-07-15 09:03 -0400
Re: Dihedral Wayne Werner <wayne@waynewerner.com> - 2013-07-15 17:43 -0500
Re: Dihedral Fábio Santos <fabiosantosart@gmail.com> - 2013-07-15 23:54 +0100
Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-16 08:59 +1000
Re: Dihedral Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-16 16:06 +1000
Re: Dihedral Stefan Behnel <stefan_ml@behnel.de> - 2013-07-24 20:08 +0200
Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:23 +1000
Re: Dihedral Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-24 20:15 -0400
Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-13 08:16 +1000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-12 17:13 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-24 06:40 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-24 23:48 +1000
Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:17 -0400
Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:19 -0400
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:34 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:02 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:39 +1000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 08:47 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 02:27 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:14 +1000
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 12:07 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 05:18 +1000
RE: RE Module Performance "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2013-07-25 19:30 +0000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:06 -0600
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 09:00 -0600
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 05:56 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:56 +1000
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 13:52 -0400
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:15 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:15 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:58 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 09:22 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:07 +1000
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 18:09 -0400
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 08:19 +1000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 16:59 -0600
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 09:24 +1000
Re: RE Module Performance Serhiy Storchaka <storchaka@gmail.com> - 2013-07-25 08:49 +0300
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 15:58 +1000
Re: RE Module Performance Jeremy Sanders <jeremy@jeremysanders.net> - 2013-07-25 14:36 +0100
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 15:26 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 01:36 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 17:18 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 03:27 +1000
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:45 -0500
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-26 02:48 +0000
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 21:20 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:36 -0700
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 08:46 -0700
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 06:28 +0000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 03:37 +0000
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-26 22:12 -0600
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 05:04 +0000
Re: RE Module Performance Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-27 12:13 -0400
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:19 -0700
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:09 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:21 -0700
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-26 20:05 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-27 11:21 -0700
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-27 21:53 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 11:13 -0700
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:04 +0100
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:30 -0700
Re: RE Module Performance Lele Gaifax <lele@metapensiero.it> - 2013-07-28 22:45 +0200
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 22:01 +0200
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 07:01 -0700
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 16:38 +0200
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 15:45 +0100
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 17:13 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 18:39 +0200
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 18:14 +0100
Re: RE Module Performance Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-31 13:09 +1000
Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-31 03:27 +1000
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-30 18:40 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 20:19 +0200
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 12:09 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 21:04 +0100
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:54 -0600
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-31 05:45 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 08:17 +0100
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 13:15 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 21:41 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:11 +0200
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 01:32 -0700
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:59 +0200
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:44 -0600
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-30 17:05 -0400
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:30 -0600
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 09:23 +0200
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:27 -0600
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 10:45 +0200
FSR and unicode compliance - was Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-28 09:52 -0600
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:23 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:44 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 21:55 +0200
Re: FSR and unicode compliance - was Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-28 20:52 +0000
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 04:43 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 12:57 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 05:56 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 07:20 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 15:49 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 09:31 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Heiko Wundram <modelnine@modelnine.org> - 2013-07-29 14:06 +0200
Re: FSR and unicode compliance - was Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-29 08:43 -0400
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 18:03 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 13:36 -0400
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 06:36 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:03 +0100
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 19:19 +0100
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:29 +0100
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 15:06 -0400
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 23:14 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 20:51 +0200
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 00:07 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-26 22:38 +0200
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-25 09:44 -0400
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:53 -0500
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-13 00:16 +0100
Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-14 05:34 +1000
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-16 06:30 -0400
Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-18 13:17 -0700
Page 3 of 7 — ← Prev page 1 2 [3] 4 5 6 7 Next page →
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-25 05:56 +0000 |
| Message-ID | <51f0be1e$0$29971$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #51139 |
On Wed, 24 Jul 2013 09:00:39 -0600, Michael Torrie wrote about JMF: > His most recent argument that Python should use UTF as a representation > is very strange to be honest. He's not arguing for anything, he is just hating on anything that gives even the tiniest benefit to ASCII users. This isn't about Python 3.3. hurting non-ASCII users, because that is demonstrably untrue: they are *better off* in Python 3.3. This is about denying even a tiny benefit to ASCII users. In Python 3.3, non-ASCII users have these advantages compared to previous versions: - strings will usually take less memory, and aside from trivial changes to the object header, they never take more memory than a wide build would use; - consequently nearly all objects will take less memory (especially builtins and standard library objects, which are all ASCII), since objects contain dozens of internal strings (attribute and method names in __dict__, class name, etc.); - consequently whole-application benchmarks show most applications will use significantly less memory, which leads to faster speeds; - you cannot break surrogate pairs apart by accident, which you can do in narrow builds; - in previous versions, code which works when run in a wide build may fail in a narrow build, but that is no longer an issue since the distinction between wide and narrow builds is gone; - Latin1 users, which includes JMF himself, will likewise see memory savings, since Latin1 strings will take half the size of narrow builds and a quarter the size of wide builds. The cost of all these benefits is a small overhead when creating a string in the first place, and some purely internal added complication to the string implementation. I'm the first to argue against complication unless there is a corresponding benefit. This is a case where the benefit has proven itself doubly: Python 3.3's Unicode implementation is *more correct* than before, and it uses less memory to do so. > The cons of UTF are apparent and widely > known. The main con is that UTF strings are O(n) for indexing a > position within the string. Not so for UTF-32. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-25 00:56 +1000 |
| Message-ID | <mailman.5043.1374678213.3114.python-list@python.org> |
| In reply to | #51131 |
On Thu, Jul 25, 2013 at 12:47 AM, Michael Torrie <torriem@gmail.com> wrote: > On 07/24/2013 07:40 AM, wxjmfauth@gmail.com wrote: >> Sorry, you are not understanding Unicode. What is a Unicode >> Transformation Format (UTF), what is the goal of a UTF and >> why it is important for an implementation to work with a UTF. > > Really? Enlighten me. > > Personally, I would never use UTF as a representation *in memory* for a > unicode string if it were up to me. Why? Because UTF characters are > not uniform in byte width so accessing positions within the string is > terribly slow and has to always be done by starting at the beginning of > the string. That's at minimum O(n) compared to FSR's O(1). Surely you > understand this. Do you dispute this fact? Take care here; UTF is a general term for Unicode Translation Formats, of which one (UTF-32) is fixed-width. Every other UTF-n is variable width, though, so your point still stands. UTF-32 is the basis for Python's FSR. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-07-24 13:52 -0400 |
| Message-ID | <mailman.5056.1374688374.3114.python-list@python.org> |
| In reply to | #51131 |
On 7/24/2013 11:00 AM, Michael Torrie wrote: > On 07/24/2013 08:34 AM, Chris Angelico wrote: >> Frankly, Python's strings are a *terrible* internal representation >> for an editor widget - not because of PEP 393, but simply because >> they are immutable, and every keypress would result in a rebuilding >> of the string. On the flip side, I could quite plausibly imagine >> using a list of strings; I used exactly this, a list of strings, for a Python-coded text-only mock editor to replace the tk Text widget in idle tests. It works fine for the purpose. For small test texts, the inefficiency of immutable strings is not relevant. Tk apparently uses a C-coded btree rather than a Python list. All details are hidden, unless one finds and reads the source ;-), but but it uses C arrays rather than Python strings. >> In this usage, the FSR is beneficial, as it's possible to have >> different strings at different widths. For my purpose, the mock Text works the same in 2.7 and 3.3+. > Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in > pros and cons, They both have the pro that indexing is direct *and correct*. The cons are different. > and the cons of using UCS-2 (the old narrow builds) are > well known. UCS-2 simply cannot represent all of unicode correctly. Python's narrow builds, at least for several releases, were in between USC-2 and UTF-16 in that they used surrogates to represent all unicodes but did not correct indexing for the presence of astral chars. This is a nuisance for those who do use astral chars, such as emotes and CJK name chars, on an everyday basis. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-25 04:15 +1000 |
| Message-ID | <mailman.5059.1374689751.3114.python-list@python.org> |
| In reply to | #51131 |
On Thu, Jul 25, 2013 at 3:52 AM, Terry Reedy <tjreedy@udel.edu> wrote: > On 7/24/2013 11:00 AM, Michael Torrie wrote: >> >> On 07/24/2013 08:34 AM, Chris Angelico wrote: >>> >>> Frankly, Python's strings are a *terrible* internal representation >>> for an editor widget - not because of PEP 393, but simply because >>> they are immutable, and every keypress would result in a rebuilding >>> of the string. On the flip side, I could quite plausibly imagine >>> using a list of strings; > > > I used exactly this, a list of strings, for a Python-coded text-only mock > editor to replace the tk Text widget in idle tests. It works fine for the > purpose. For small test texts, the inefficiency of immutable strings is not > relevant. > > Tk apparently uses a C-coded btree rather than a Python list. All details > are hidden, unless one finds and reads the source ;-), but but it uses C > arrays rather than Python strings. > > >>> In this usage, the FSR is beneficial, as it's possible to have >>> different strings at different widths. > > > For my purpose, the mock Text works the same in 2.7 and 3.3+. Thanks for that report! And yes, it's going to behave exactly the same way, because its underlying structure is an ordered list of ordered lists of Unicode codepoints, ergo 3.3/PEP 393 is merely a question of performance. But if you put your code onto a narrow build, you'll have issues as seen below. >> Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in >> pros and cons, > > They both have the pro that indexing is direct *and correct*. The cons are > different. They're close enough, though. It's simply a performance tradeoff - use the memory all the time, or take a bit of overhead to give yourself the option of using less memory. The difference is negligible compared to... >> and the cons of using UCS-2 (the old narrow builds) are >> well known. UCS-2 simply cannot represent all of unicode correctly. > > Python's narrow builds, at least for several releases, were in between USC-2 > and UTF-16 in that they used surrogates to represent all unicodes but did > not correct indexing for the presence of astral chars. This is a nuisance > for those who do use astral chars, such as emotes and CJK name chars, on an > everyday basis. ... this. If nobody had ever thought of doing a multi-format string representation, I could well imagine the Python core devs debating whether the cost of UTF-32 strings is worth the correctness and consistency improvements... and most likely concluding that narrow builds get abolished. And if any other language (eg ECMAScript) decides to move from UTF-16 to UTF-32, I would wholeheartedly support the move, even if it broke code to do so. To my mind, exposing UTF-16 surrogates to the application is a bug to be fixed, not a feature to be maintained. But since we can get the best of both worlds with only a small amount of overhead, I really don't see why anyone should be objecting. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-25 07:15 +0000 |
| Message-ID | <51f0d0a0$0$29971$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #51159 |
On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote: > If nobody had ever thought of doing a multi-format string > representation, I could well imagine the Python core devs debating > whether the cost of UTF-32 strings is worth the correctness and > consistency improvements... and most likely concluding that narrow > builds get abolished. And if any other language (eg ECMAScript) decides > to move from UTF-16 to UTF-32, I would wholeheartedly support the move, > even if it broke code to do so. Unfortunately, so long as most language designers are European-centric, there is going to be a lot of push-back against any attempt to fix (say) Javascript, or Java just for the sake of "a bunch of dead languages" in the SMPs. Thank goodness for emoji. Wait til the young kids start complaining that their emoticons and emoji are broken in Javascript, and eventually it will get fixed. It may take a decade, for the young kids to grow up and take over Javascript from the old-codgers, but it will happen. > To my mind, exposing UTF-16 surrogates > to the application is a bug to be fixed, not a feature to be maintained. This, times a thousand. It is *possible* to have non-buggy string routines using UTF-16, but the implementation is a lot more complex than most language developers can be bothered with. I'm not aware of any language that uses UTF-16 internally that doesn't give wrong results for surrogate pairs. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-25 17:58 +1000 |
| Message-ID | <mailman.5084.1374739093.3114.python-list@python.org> |
| In reply to | #51200 |
On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:
>
>> If nobody had ever thought of doing a multi-format string
>> representation, I could well imagine the Python core devs debating
>> whether the cost of UTF-32 strings is worth the correctness and
>> consistency improvements... and most likely concluding that narrow
>> builds get abolished. And if any other language (eg ECMAScript) decides
>> to move from UTF-16 to UTF-32, I would wholeheartedly support the move,
>> even if it broke code to do so.
>
> Unfortunately, so long as most language designers are European-centric,
> there is going to be a lot of push-back against any attempt to fix (say)
> Javascript, or Java just for the sake of "a bunch of dead languages" in
> the SMPs. Thank goodness for emoji. Wait til the young kids start
> complaining that their emoticons and emoji are broken in Javascript, and
> eventually it will get fixed. It may take a decade, for the young kids to
> grow up and take over Javascript from the old-codgers, but it will happen.
I don't know that that'll happen like that. Emoticons aren't broken in
Javascript - you can use them just fine. You only start seeing
problems when you index into that string. People will start to wonder
why, for instance, a "500 character maximum" field deducts two from
the limit when an emoticon goes in. Example:
Type here:<br><textarea id=content oninput="showlimit(this)"></textarea>
<br>You have <span id=limit1>500</span> characters left (self.value.length).
<br>You have <span id=limit2>500</span> characters left (self.textLength).
<script>
function showlimit(self)
{
document.getElementById("limit1").innerHTML=500-self.value.length;
document.getElementById("limit2").innerHTML=500-self.textLength;
}
</script>
I've included an attribute documented here[1] as the "codepoint length
of the control's value", but in Chrome on Windows, it still counts
UTF-16 code units. However, I very much doubt that this will result in
language changes. People will just live with it. Chinese and Japanese
users will complain, perhaps, and the developers will write it off as
whinging, and just say "That's what the internet does". Maybe, if
you're really lucky, they'll acknowledge that "that's what JavaScript
does", but even then I doubt it'd result in language changes.
>> To my mind, exposing UTF-16 surrogates
>> to the application is a bug to be fixed, not a feature to be maintained.
>
> This, times a thousand.
>
> It is *possible* to have non-buggy string routines using UTF-16, but the
> implementation is a lot more complex than most language developers can be
> bothered with. I'm not aware of any language that uses UTF-16 internally
> that doesn't give wrong results for surrogate pairs.
The problem isn't the underlying representation, the problem is what
gets exposed to the application. Once you've decided to expose
codepoints to the app (abstracting over your UTF-16 underlying
representation), the change to using UTF-32, or mimicking PEP 393, or
some other structure, is purely internal and an optimization. So I
doubt any language will use UTF-16 internally and UTF-32 to the app.
It'd be needlessly complex.
ChrisA
[1] https://developer.mozilla.org/en-US/docs/Web/API/HTMLTextAreaElement
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-25 09:22 +0000 |
| Message-ID | <51f0ee48$0$29971$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #51203 |
On Thu, 25 Jul 2013 17:58:10 +1000, Chris Angelico wrote:
> On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:
>>
>>> If nobody had ever thought of doing a multi-format string
>>> representation, I could well imagine the Python core devs debating
>>> whether the cost of UTF-32 strings is worth the correctness and
>>> consistency improvements... and most likely concluding that narrow
>>> builds get abolished. And if any other language (eg ECMAScript)
>>> decides to move from UTF-16 to UTF-32, I would wholeheartedly support
>>> the move, even if it broke code to do so.
>>
>> Unfortunately, so long as most language designers are European-centric,
>> there is going to be a lot of push-back against any attempt to fix
>> (say) Javascript, or Java just for the sake of "a bunch of dead
>> languages" in the SMPs. Thank goodness for emoji. Wait til the young
>> kids start complaining that their emoticons and emoji are broken in
>> Javascript, and eventually it will get fixed. It may take a decade, for
>> the young kids to grow up and take over Javascript from the
>> old-codgers, but it will happen.
>
> I don't know that that'll happen like that. Emoticons aren't broken in
> Javascript - you can use them just fine. You only start seeing problems
> when you index into that string. People will start to wonder why, for
> instance, a "500 character maximum" field deducts two from the limit
> when an emoticon goes in.
I get that. I meant *Javascript developers*, not end-users. The young
kids today who become Javascript developers tomorrow will grow up in a
world where they expect to be able to write band names like
"▼□■□■□■" (yes, really, I didn't make that one up) and have it just work.
Okay, all those characters are in the BMP, but emoji aren't, and I
guarantee that even as we speak some new hipster band is trying to decide
whether to name themselves "Smiling 😢" or "Crying 😊".
:-)
>> It is *possible* to have non-buggy string routines using UTF-16, but
>> the implementation is a lot more complex than most language developers
>> can be bothered with. I'm not aware of any language that uses UTF-16
>> internally that doesn't give wrong results for surrogate pairs.
>
> The problem isn't the underlying representation, the problem is what
> gets exposed to the application. Once you've decided to expose
> codepoints to the app (abstracting over your UTF-16 underlying
> representation), the change to using UTF-32, or mimicking PEP 393, or
> some other structure, is purely internal and an optimization. So I doubt
> any language will use UTF-16 internally and UTF-32 to the app. It'd be
> needlessly complex.
To be honest, I don't understand what you are trying to say.
What I'm trying to say is that it is possible to use UTF-16 internally,
but *not* assume that every code point (character) is represented by a
single 2-byte unit. For example, the len() of a UTF-16 string should not
be calculated by counting the number of bytes and dividing by two. You
actually need to walk the string, inspecting each double-byte:
# calculate length
count = 0
inside_surrogate = False
for bb in buffer: # get two bytes at a time
if is_lower_surrogate(bb):
inside_surrogate = True
continue
if is_upper_surrogate(bb):
if inside_surrogate:
count += 1
inside_surrogate = False
continue
raise ValueError("missing lower surrogate")
if inside_surrogate:
break
count += 1
if inside_surrogate:
raise ValueError("missing upper surrogate")
Given immutable strings, you could validate the string once, on creation,
and from then on assume they are well-formed:
# calculate length, assuming the string is well-formed:
count = 0
skip = False
for bb in buffer: # get two bytes at a time
if skip:
count += 1
skip = False
continue
if is_surrogate(bb):
skip = True
count += 1
String operations such as slicing become much more complex once you can
no longer assume a 1:1 relationship between code points and code units,
whether they are 1, 2 or 4 bytes. Most (all?) language developers don't
handle that complexity, and push responsibility for it back onto the
coder using the language.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-25 20:07 +1000 |
| Message-ID | <mailman.5089.1374746869.3114.python-list@python.org> |
| In reply to | #51208 |
On Thu, Jul 25, 2013 at 7:22 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > What I'm trying to say is that it is possible to use UTF-16 internally, > but *not* assume that every code point (character) is represented by a > single 2-byte unit. For example, the len() of a UTF-16 string should not > be calculated by counting the number of bytes and dividing by two. You > actually need to walk the string, inspecting each double-byte Anything's possible. But since underlying representations can be changed fairly easily (relative term of course - it's a lot of work, but it can be changed in a single release, no deprecation required or anything), there's very little reason to continue using UTF-16 underneath. May as well switch to UTF-32 for convenience, or PEP 393 for convenience and efficiency, or maybe some other system that's still mostly fixed-width. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-07-24 18:09 -0400 |
| Message-ID | <mailman.5067.1374703769.3114.python-list@python.org> |
| In reply to | #51131 |
On 7/24/2013 2:15 PM, Chris Angelico wrote: > On Thu, Jul 25, 2013 at 3:52 AM, Terry Reedy <tjreedy@udel.edu> wrote: >> For my purpose, the mock Text works the same in 2.7 and 3.3+. > > Thanks for that report! And yes, it's going to behave exactly the same > way, because its underlying structure is an ordered list of ordered > lists of Unicode codepoints, ergo 3.3/PEP 393 is merely a question of > performance. But if you put your code onto a narrow build, you'll have > issues as seen below. I carefully said 'For my purpose', which is to replace the tk Text widget. Up to 8.5, Tk's text is something like Python's narrow-build unicode. If put astral chars into the toy editor, then yes, it would not work on narrow builds, but would on 3.3+. ... > If nobody had ever thought of doing a multi-format string > representation, I could well imagine the Python core devs debating > whether the cost of UTF-32 strings is worth the correctness and > consistency improvements... and most likely concluding that narrow > builds get abolished. And if any other language (eg ECMAScript) > decides to move from UTF-16 to UTF-32, I would wholeheartedly support > the move, even if it broke code to do so. Making a UTF-16 implementation correct requires converting abstract 'character' array indexes to concrete double byte array indexes. The simple O(n) method of scanning the string from the beginning for each index operation is too slow. When PEP393 was being discussed, I devised a much faster way to do the conversion. The key idea is to add an auxiliary array of the abstract indexes of the astral chars in the abstract array. This is easily created when the string is created and can be done afterward with one linear scan (which is how I experimented with Python code). The length of that array is the number of surrogate pairs in the concrete 16-bit codepoint array. Subtracting that number from the length of the concrete array gives the length of the abstract array. Given a target index of a character in the abstract array, use the auxiliary array to determine k, the number of astral characters that precede the target character. That can be done with either a O(k) linear scan or O(log k) binary search. Add 2 * k to the abstract index to get the corresponding index in the concrete array. When slicing a string with i0 and i1, slice the auxiliary array with k0 and k1 and adjusting the contained indexes downward to get the corresponding auxiliary array. > To my mind, exposing UTF-16 surrogates to the application is a bug > to be fixed, not a feature to be maintained. It is definitely not a feature, but a proper UTF-16 implementation would not expose them except to codecs, just as with the PEP 393 implementation. (In both cases, I am excluding the sys size function as 'exposing to the application'.) > But since we can get the best of both worlds with only > a small amount of overhead, I really don't see why anyone should be > objecting. I presume you are referring to the PEP 393 1-2-4 byte implementation. Given how well it has been optimized, I think it was the right choice for Python. But a language that now uses USC2 or defective UTF-16 on all platforms might find the auxiliary array an easier fix. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-25 08:19 +1000 |
| Message-ID | <mailman.5068.1374704365.3114.python-list@python.org> |
| In reply to | #51131 |
On Thu, Jul 25, 2013 at 8:09 AM, Terry Reedy <tjreedy@udel.edu> wrote: > On 7/24/2013 2:15 PM, Chris Angelico wrote: >> To my mind, exposing UTF-16 surrogates to the application is a bug >> to be fixed, not a feature to be maintained. > > It is definitely not a feature, but a proper UTF-16 implementation would not > expose them except to codecs, just as with the PEP 393 implementation. (In > both cases, I am excluding the sys size function as 'exposing to the > application'.) > >> But since we can get the best of both worlds with only >> a small amount of overhead, I really don't see why anyone should be >> objecting. > > I presume you are referring to the PEP 393 1-2-4 byte implementation. Given > how well it has been optimized, I think it was the right choice for Python. > But a language that now uses USC2 or defective UTF-16 on all platforms might > find the auxiliary array an easier fix. > I'm referring here to objections like jmf's, and also to threads like this: http://mozilla.6506.n7.nabble.com/Flexible-String-Representation-full-Unicode-for-ES6-td267585.html According to the ECMAScript people, UTF-16 and exposing surrogates to the application is a critical feature to be maintained. I disagree. But it's not my language, so I'm stuck with it. (I ended up writing a little wrapper function in C that detects unpaired surrogates, but that still doesn't deal with the possibility that character indexing can create a new character that was never there to start with.) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Michael Torrie <torriem@gmail.com> |
|---|---|
| Date | 2013-07-24 16:59 -0600 |
| Message-ID | <mailman.5069.1374706766.3114.python-list@python.org> |
| In reply to | #51131 |
On 07/24/2013 04:19 PM, Chris Angelico wrote: > I'm referring here to objections like jmf's, and also to threads like this: > > http://mozilla.6506.n7.nabble.com/Flexible-String-Representation-full-Unicode-for-ES6-td267585.html > > According to the ECMAScript people, UTF-16 and exposing surrogates to > the application is a critical feature to be maintained. I disagree. > But it's not my language, so I'm stuck with it. (I ended up writing a > little wrapper function in C that detects unpaired surrogates, but > that still doesn't deal with the possibility that character indexing > can create a new character that was never there to start with.) This is starting to drift off topic here now, but after reading your comments on that post, and others objections, I don't fully understand why making strings simply "unicode" in javascript breaks compatibility with older scripts. What operations are performed on strings that making unicode an abstract type would break? Is it just in the input and output of text that must be decoded and encode? Why should a script care about the internal representation of unicode strings? Is it because the incorrect behavior of UTF-16 and the exposed surrogates (and subsequent incorrect indexing) are actually depended on by some scripts?
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-25 09:24 +1000 |
| Message-ID | <mailman.5070.1374708283.3114.python-list@python.org> |
| In reply to | #51131 |
On Thu, Jul 25, 2013 at 8:59 AM, Michael Torrie <torriem@gmail.com> wrote: > I don't fully understand > why making strings simply "unicode" in javascript breaks compatibility > with older scripts. What operations are performed on strings that > making unicode an abstract type would break? Imagine this in JavaScript and Python (apart from the fact that JS doesn't do backslash escapes past 0x10000): a = "asdf\U00012345qwer"; b = a[[..10]; What will this do? It depends on whether UTF-16 is used or not. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Serhiy Storchaka <storchaka@gmail.com> |
|---|---|
| Date | 2013-07-25 08:49 +0300 |
| Message-ID | <mailman.5080.1374731378.3114.python-list@python.org> |
| In reply to | #51131 |
24.07.13 21:15, Chris Angelico написав(ла): > To my mind, exposing UTF-16 > surrogates to the application is a bug to be fixed, not a feature to > be maintained. Python 3 uses code points from U+DC80 to U+DCFF (which are in surrogates area) to represent undecodable bytes with surrogateescape error handler.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-25 15:58 +1000 |
| Message-ID | <mailman.5082.1374732265.3114.python-list@python.org> |
| In reply to | #51131 |
On Thu, Jul 25, 2013 at 3:49 PM, Serhiy Storchaka <storchaka@gmail.com> wrote: > 24.07.13 21:15, Chris Angelico написав(ла): > >> To my mind, exposing UTF-16 >> surrogates to the application is a bug to be fixed, not a feature to >> be maintained. > > > Python 3 uses code points from U+DC80 to U+DCFF (which are in surrogates > area) to represent undecodable bytes with surrogateescape error handler. That's a deliberate and conscious use of the codepoints; that's not what I'm talking about here. Suppose you read a UTF-8 stream of bytes from a file, and decode them into your language's standard string type. At this point, you should be working with a string of Unicode codepoints: "\22\341\210\264\360\222\215\205" --> "\x12\u1234\U00012345" The incoming byte stream has a length of 8, the resulting character stream has a length of 3. Now, if the language wants to use UTF-16 internally, it's free to do so: 0012 1234 d808 df45 When I referred to exposing surrogates to the application, this is what I'm talking about. If decoding the above byte stream results in a length 4 string where the last two are \xd808 and \xdf45, then it's exposing them. If it's a length 3 string where the last is \U00012345, then it's hiding them. To be honest, I don't imagine I'll ever see a language that stores strings in UTF-16 and then exposes them to the application as UTF-32; there's very little point. But such *is* possible, and if it's working closely with libraries that demand UTF-16, it might well make sense to do things that way. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Jeremy Sanders <jeremy@jeremysanders.net> |
|---|---|
| Date | 2013-07-25 14:36 +0100 |
| Message-ID | <mailman.5094.1374759404.3114.python-list@python.org> |
| In reply to | #51131 |
wxjmfauth@gmail.com wrote: > Short example. Writing an editor with something like the > FSR is simply impossible (properly). http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html#Text-Representations "To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint[1]. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. We call this representation of text multibyte. ... [1] This internal representation is based on one of the encodings defined by the Unicode Standard, called UTF-8, for representing any Unicode codepoint, but Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8- bit bytes and characters not unified with Unicode. " Jeremy
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-25 15:26 +0000 |
| Message-ID | <51f14395$0$29971$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #51217 |
On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: > wxjmfauth@gmail.com wrote: > >> Short example. Writing an editor with something like the FSR is simply >> impossible (properly). > > http://www.gnu.org/software/emacs/manual/html_node/elisp/Text- Representations.html#Text-Representations > > "To conserve memory, Emacs does not hold fixed-length 22-bit numbers > that are codepoints of text characters within buffers and strings. > Rather, Emacs uses a variable-length internal representation of > characters, that stores each character as a sequence of 1 to 5 8-bit > bytes, depending on the magnitude of its codepoint[1]. For example, any > ASCII character takes up only 1 byte, a Latin-1 character takes up 2 > bytes, etc. We call this representation of text multibyte. Well, you've just proven what Vim users have always suspected: Emacs doesn't really exist. > [1] This internal representation is based on one of the encodings > defined by the Unicode Standard, called UTF-8, for representing any > Unicode codepoint, but Emacs extends UTF-8 to represent the additional > codepoints it uses for raw 8- bit bytes and characters not unified with > Unicode. > " Do you know what those characters not unified with Unicode are? Is there a list somewhere? I've read all of the pages from here to no avail: http://www.gnu.org/software/emacs/manual/html_node/elisp/Non_002dASCII-Characters.html -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-26 01:36 +1000 |
| Message-ID | <mailman.5106.1374766576.3114.python-list@python.org> |
| In reply to | #51233 |
On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: >> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers >> that are codepoints of text characters within buffers and strings. >> Rather, Emacs uses a variable-length internal representation of >> characters, that stores each character as a sequence of 1 to 5 8-bit >> bytes, depending on the magnitude of its codepoint[1]. For example, any >> ASCII character takes up only 1 byte, a Latin-1 character takes up 2 >> bytes, etc. We call this representation of text multibyte. > > Well, you've just proven what Vim users have always suspected: Emacs > doesn't really exist. ... lolwut? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-25 17:18 +0000 |
| Message-ID | <51f15e03$0$29971$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #51234 |
On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote: > On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: >>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers >>> that are codepoints of text characters within buffers and strings. >>> Rather, Emacs uses a variable-length internal representation of >>> characters, that stores each character as a sequence of 1 to 5 8-bit >>> bytes, depending on the magnitude of its codepoint[1]. For example, >>> any ASCII character takes up only 1 byte, a Latin-1 character takes up >>> 2 bytes, etc. We call this representation of text multibyte. >> >> Well, you've just proven what Vim users have always suspected: Emacs >> doesn't really exist. > > ... lolwut? JMF has explained that it is impossible, impossible I say!, to write an editor using a flexible string representation. Since Emacs uses such a flexible string representation, Emacs is impossible, and therefore Emacs doesn't exist. QED. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-26 03:27 +1000 |
| Message-ID | <mailman.5113.1374773662.3114.python-list@python.org> |
| In reply to | #51247 |
On Fri, Jul 26, 2013 at 3:18 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote: > >> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano >> <steve+comp.lang.python@pearwood.info> wrote: >>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: >>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers >>>> that are codepoints of text characters within buffers and strings. >>>> Rather, Emacs uses a variable-length internal representation of >>>> characters, that stores each character as a sequence of 1 to 5 8-bit >>>> bytes, depending on the magnitude of its codepoint[1]. For example, >>>> any ASCII character takes up only 1 byte, a Latin-1 character takes up >>>> 2 bytes, etc. We call this representation of text multibyte. >>> >>> Well, you've just proven what Vim users have always suspected: Emacs >>> doesn't really exist. >> >> ... lolwut? > > > JMF has explained that it is impossible, impossible I say!, to write an > editor using a flexible string representation. Since Emacs uses such a > flexible string representation, Emacs is impossible, and therefore Emacs > doesn't exist. > > QED. Quad Error Demonstrated. I never got past the level of Canis Latinicus in debating class. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2013-07-25 15:45 -0500 |
| Message-ID | <mailman.5121.1374785646.3114.python-list@python.org> |
| In reply to | #51247 |
On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote: > >> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano >> <steve+comp.lang.python@pearwood.info> wrote: >>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: >>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers >>>> that are codepoints of text characters within buffers and strings. >>>> Rather, Emacs uses a variable-length internal representation of >>>> characters, that stores each character as a sequence of 1 to 5 8-bit >>>> bytes, depending on the magnitude of its codepoint[1]. For example, >>>> any ASCII character takes up only 1 byte, a Latin-1 character takes up >>>> 2 bytes, etc. We call this representation of text multibyte. >>> >>> Well, you've just proven what Vim users have always suspected: Emacs >>> doesn't really exist. >> >> ... lolwut? > > > JMF has explained that it is impossible, impossible I say!, to write an > editor using a flexible string representation. Since Emacs uses such a > flexible string representation, Emacs is impossible, and therefore Emacs > doesn't exist. > > QED. Except that the described representation used by Emacs is a variant of UTF-8, not an FSR. It doesn't have three different possible encodings for the letter 'a' depending on what other characters happen to be in the string. As I understand it, jfm would be perfectly happy if Python used UTF-8 (or presumably the Emacs variant) as its internal string representation.
[toc] | [prev] | [next] | [standalone]
Page 3 of 7 — ← Prev page 1 2 [3] 4 5 6 7 Next page →
Back to top | Article view | comp.lang.python
csiph-web