Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #50503 > unrolled thread
| Started by | Devyn Collier Johnson <devyncjohnson@gmail.com> |
|---|---|
| First post | 2013-07-11 19:44 -0400 |
| Last post | 2013-07-18 13:17 -0700 |
| Articles | 20 on this page of 136 — 25 participants |
Back to article view | Back to comp.lang.python
RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-11 19:44 -0400
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-12 02:23 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:27 +1000
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 10:39 +0100
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:40 +1000
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 06:45 -0400
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 16:59 +0100
Re: RE Module Performance Peter Otten <__peter__@web.de> - 2013-07-12 18:15 +0200
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-13 02:21 +1000
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 13:58 -0400
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 05:37 +0000
Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-14 11:17 -0700
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 06:06 -0400
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-15 12:36 +0000
Dihedral Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 08:52 -0400
Re: Dihedral Joel Goldstick <joel.goldstick@gmail.com> - 2013-07-15 09:03 -0400
Re: Dihedral Wayne Werner <wayne@waynewerner.com> - 2013-07-15 17:43 -0500
Re: Dihedral Fábio Santos <fabiosantosart@gmail.com> - 2013-07-15 23:54 +0100
Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-16 08:59 +1000
Re: Dihedral Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-16 16:06 +1000
Re: Dihedral Stefan Behnel <stefan_ml@behnel.de> - 2013-07-24 20:08 +0200
Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:23 +1000
Re: Dihedral Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-24 20:15 -0400
Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-13 08:16 +1000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-12 17:13 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-24 06:40 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-24 23:48 +1000
Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:17 -0400
Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:19 -0400
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:34 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:02 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:39 +1000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 08:47 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 02:27 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:14 +1000
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 12:07 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 05:18 +1000
RE: RE Module Performance "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2013-07-25 19:30 +0000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:06 -0600
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 09:00 -0600
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 05:56 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:56 +1000
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 13:52 -0400
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:15 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:15 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:58 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 09:22 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:07 +1000
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 18:09 -0400
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 08:19 +1000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 16:59 -0600
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 09:24 +1000
Re: RE Module Performance Serhiy Storchaka <storchaka@gmail.com> - 2013-07-25 08:49 +0300
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 15:58 +1000
Re: RE Module Performance Jeremy Sanders <jeremy@jeremysanders.net> - 2013-07-25 14:36 +0100
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 15:26 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 01:36 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 17:18 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 03:27 +1000
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:45 -0500
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-26 02:48 +0000
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 21:20 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:36 -0700
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 08:46 -0700
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 06:28 +0000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 03:37 +0000
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-26 22:12 -0600
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 05:04 +0000
Re: RE Module Performance Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-27 12:13 -0400
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:19 -0700
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:09 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:21 -0700
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-26 20:05 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-27 11:21 -0700
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-27 21:53 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 11:13 -0700
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:04 +0100
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:30 -0700
Re: RE Module Performance Lele Gaifax <lele@metapensiero.it> - 2013-07-28 22:45 +0200
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 22:01 +0200
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 07:01 -0700
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 16:38 +0200
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 15:45 +0100
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 17:13 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 18:39 +0200
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 18:14 +0100
Re: RE Module Performance Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-31 13:09 +1000
Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-31 03:27 +1000
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-30 18:40 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 20:19 +0200
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 12:09 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 21:04 +0100
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:54 -0600
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-31 05:45 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 08:17 +0100
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 13:15 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 21:41 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:11 +0200
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 01:32 -0700
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:59 +0200
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:44 -0600
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-30 17:05 -0400
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:30 -0600
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 09:23 +0200
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:27 -0600
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 10:45 +0200
FSR and unicode compliance - was Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-28 09:52 -0600
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:23 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:44 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 21:55 +0200
Re: FSR and unicode compliance - was Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-28 20:52 +0000
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 04:43 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 12:57 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 05:56 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 07:20 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 15:49 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 09:31 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Heiko Wundram <modelnine@modelnine.org> - 2013-07-29 14:06 +0200
Re: FSR and unicode compliance - was Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-29 08:43 -0400
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 18:03 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 13:36 -0400
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 06:36 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:03 +0100
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 19:19 +0100
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:29 +0100
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 15:06 -0400
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 23:14 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 20:51 +0200
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 00:07 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-26 22:38 +0200
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-25 09:44 -0400
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:53 -0500
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-13 00:16 +0100
Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-14 05:34 +1000
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-16 06:30 -0400
Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-18 13:17 -0700
Page 4 of 7 — ← Prev page 1 2 3 [4] 5 6 7 Next page →
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-26 02:48 +0000 |
| Message-ID | <51f1e371$0$29971$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #51260 |
On Thu, 25 Jul 2013 15:45:38 -0500, Ian Kelly wrote:
> On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:
>>
>>> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
>>> <steve+comp.lang.python@pearwood.info> wrote:
>>>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
>>>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers
>>>>> that are codepoints of text characters within buffers and strings.
>>>>> Rather, Emacs uses a variable-length internal representation of
>>>>> characters, that stores each character as a sequence of 1 to 5 8-bit
>>>>> bytes, depending on the magnitude of its codepoint[1]. For example,
>>>>> any ASCII character takes up only 1 byte, a Latin-1 character takes
>>>>> up 2 bytes, etc. We call this representation of text multibyte.
>>>>
>>>> Well, you've just proven what Vim users have always suspected: Emacs
>>>> doesn't really exist.
>>>
>>> ... lolwut?
>>
>>
>> JMF has explained that it is impossible, impossible I say!, to write an
>> editor using a flexible string representation. Since Emacs uses such a
>> flexible string representation, Emacs is impossible, and therefore
>> Emacs doesn't exist.
>>
>> QED.
>
> Except that the described representation used by Emacs is a variant of
> UTF-8, not an FSR. It doesn't have three different possible encodings
> for the letter 'a' depending on what other characters happen to be in
> the string.
>
> As I understand it, jfm would be perfectly happy if Python used UTF-8
> (or presumably the Emacs variant) as its internal string representation.
UTF-8 uses a flexible representation on a character-by-character basis.
When parsing UTF-8, one needs to look at EVERY character to decide how
many bytes you need to read. In Python 3, the flexible representation is
on a string-by-string basis: once Python has looked at the string header,
it can tell whether the *entire* string takes 1, 2 or 4 bytes per
character, and the string is then fixed-width. You can't do that with
UTF-8.
To put it in terms of pseudo-code:
# Python 3.3
def parse_string(astring):
# Decision gets made once per string.
if astring uses 1 byte:
count = 1
elif astring uses 2 bytes:
count = 2
else:
count = 4
while not done:
char = convert(next(count bytes))
# UTF-8
def parse_string(astring):
while not done:
b = next(1 byte)
# Decision gets made for every single char
if uses 1 byte:
char = convert(b)
elif uses 2 bytes:
char = convert(b, next(1 byte))
elif uses 3 bytes:
char = convert(b, next(2 bytes))
else:
char = convert(b, next(3 bytes))
So UTF-8 requires much more runtime overhead than Python 3.3, and Emac's
variation can in fact require more bytes per character than either.
(UTF-8 and Python 3.3 can require up to four bytes, Emacs up to five.)
I'm not surprised that JMF would prefer UTF-8 -- he is completely out of
his depth, and is a fine example of the Dunning-Kruger effect in action.
He is so sure he is right based on so little evidence.
One advantage of UTF-8 is that for some BMP characters, you can get away
with only three bytes instead of four. For transmitting data over the
wire, or storage on disk, that's potentially up to a 25% reduction in
space, which is not to be sneezed at. (Although in practice it's usually
much less than that, since the most common characters are encoded to 1 or
2 bytes, not 4). But that comes at the cost of much more runtime
overhead, which in my opinion makes UTF-8 a second-class string
representation compared to fixed-width representations.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2013-07-25 21:20 -0600 |
| Message-ID | <mailman.5129.1374808894.3114.python-list@python.org> |
| In reply to | #51272 |
On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > UTF-8 uses a flexible representation on a character-by-character basis. > When parsing UTF-8, one needs to look at EVERY character to decide how > many bytes you need to read. In Python 3, the flexible representation is > on a string-by-string basis: once Python has looked at the string header, > it can tell whether the *entire* string takes 1, 2 or 4 bytes per > character, and the string is then fixed-width. You can't do that with > UTF-8. UTF-8 does not use a flexible representation. A codec that is encoding a string in UTF-8 and examining a particular character does not have any choice of how to encode that character; there is exactly one sequence of bits that is the UTF-8 encoding for the character. Further, for any given sequence of code points there is exactly one sequence of bytes that is the UTF-8 encoding of those code points. In contrast, with the FSR there are as many as three different sequences of bytes that encode a sequence of code points, with one of them (the shortest) being canonical. That's what makes it flexible. Anyway, my point was just that Emacs is not a counter-example to jmf's claim about implementing text editors, because UTF-8 is not what he (or anybody else) is referring to when speaking of the FSR or "something like the FSR".
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-26 06:36 -0700 |
| Message-ID | <1ca6bb15-ce10-4a23-82fc-aa0af0f7ac97@googlegroups.com> |
| In reply to | #51277 |
Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit : > On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano > > <steve+comp.lang.python@pearwood.info> wrote: > > > UTF-8 uses a flexible representation on a character-by-character basis. > > > When parsing UTF-8, one needs to look at EVERY character to decide how > > > many bytes you need to read. In Python 3, the flexible representation is > > > on a string-by-string basis: once Python has looked at the string header, > > > it can tell whether the *entire* string takes 1, 2 or 4 bytes per > > > character, and the string is then fixed-width. You can't do that with > > > UTF-8. > > > > UTF-8 does not use a flexible representation. A codec that is > > encoding a string in UTF-8 and examining a particular character does > > not have any choice of how to encode that character; there is exactly > > one sequence of bits that is the UTF-8 encoding for the character. > > Further, for any given sequence of code points there is exactly one > > sequence of bytes that is the UTF-8 encoding of those code points. In > > contrast, with the FSR there are as many as three different sequences > > of bytes that encode a sequence of code points, with one of them (the > > shortest) being canonical. That's what makes it flexible. > > > > Anyway, my point was just that Emacs is not a counter-example to jmf's > > claim about implementing text editors, because UTF-8 is not what he > > (or anybody else) is referring to when speaking of the FSR or > > "something like the FSR". -------- BTW, it is not necessary to use an endorsed Unicode coding scheme (utf*), a string literal would have been possible, but then one falls on memory issures. All these utf are following the basic coding scheme. I repeat again. A coding scheme works with a unique set of characters and its implementation works with a unique set of encoded code points (the utf's, in case of Unicode). And again, that why we live today with all these coding schemes, or, to take the problem from the other side, that's because one has to work with a unique set of encoded code points, that all these coding schemes had to be created. utf's have not been created by newbies ;-) jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-26 08:46 -0700 |
| Message-ID | <d790ab57-2b96-4ae7-a86d-4229484115e1@googlegroups.com> |
| In reply to | #51277 |
Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit : > On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano > > <steve+comp.lang.python@pearwood.info> wrote: > > > UTF-8 uses a flexible representation on a character-by-character basis. > > > When parsing UTF-8, one needs to look at EVERY character to decide how > > > many bytes you need to read. In Python 3, the flexible representation is > > > on a string-by-string basis: once Python has looked at the string header, > > > it can tell whether the *entire* string takes 1, 2 or 4 bytes per > > > character, and the string is then fixed-width. You can't do that with > > > UTF-8. > > > > UTF-8 does not use a flexible representation. A codec that is > > encoding a string in UTF-8 and examining a particular character does > > not have any choice of how to encode that character; there is exactly > > one sequence of bits that is the UTF-8 encoding for the character. > > Further, for any given sequence of code points there is exactly one > > sequence of bytes that is the UTF-8 encoding of those code points. In > > contrast, with the FSR there are as many as three different sequences > > of bytes that encode a sequence of code points, with one of them (the > > shortest) being canonical. That's what makes it flexible. > > > > Anyway, my point was just that Emacs is not a counter-example to jmf's > > claim about implementing text editors, because UTF-8 is not what he > > (or anybody else) is referring to when speaking of the FSR or > > "something like the FSR". ----- Let's be clear. I'm perfectly understanding what is utf-8 and that's for that precise reason, I put the "editor" as an exemple on the table. This FSR is not *a* coding scheme. It is more a composite coding scheme. (And form there, all the problems). BTW, I'm pleased to read "sequence of bits" and not bytes. Again, utf transformers are producing sequence of bits, call Unicode Transformation Units, with lengths of 8/16/32 *bits*, from there the names utf8/16/32. UCS transformers are (were) producing bytes, from there the names ucs-2/4. jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-27 06:28 +0000 |
| Message-ID | <51f368a8$0$29971$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #51311 |
On Fri, 26 Jul 2013 08:46:58 -0700, wxjmfauth wrote: > BTW, I'm pleased to read "sequence of bits" and not bytes. Again, utf > transformers are producing sequence of bits, call Unicode Transformation > Units, with lengths of 8/16/32 *bits*, from there the names utf8/16/32. > UCS transformers are (were) producing bytes, from there the names > ucs-2/4. Not only does your distinction between bits and bytes make no practical difference on nearly all hardware in common use today[1], but the Unicode Consortium disagrees with you, and defines UTC in terms of bytes: "A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence." http://www.unicode.org/faq/utf_bom.html#gen2 [1] There may still be some old supercomputers where a byte is more than 8 bits in use, but they're unlikely to support Unicode. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-27 03:37 +0000 |
| Message-ID | <51f3406f$0$29971$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #51277 |
On Thu, 25 Jul 2013 21:20:45 -0600, Ian Kelly wrote: > On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> UTF-8 uses a flexible representation on a character-by-character basis. >> When parsing UTF-8, one needs to look at EVERY character to decide how >> many bytes you need to read. In Python 3, the flexible representation >> is on a string-by-string basis: once Python has looked at the string >> header, it can tell whether the *entire* string takes 1, 2 or 4 bytes >> per character, and the string is then fixed-width. You can't do that >> with UTF-8. > > UTF-8 does not use a flexible representation. I disagree, and so does Jeremy Sanders who first pointed out the similarity between Emacs' UTF-8 and Python's FSR. I'll quote from the Emacs documentation again: "To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc." And the Python FSR: "To conserve memory, Python does not hold fixed-length 21-bit numbers that are codepoints of text characters within buffers and strings. Rather, Python uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 4 8-bit bytes, depending on the magnitude of the largest codepoint in the string. For example, any all-ASCII or all-Latin1 string takes up only 1 byte per character, an all- BMP string takes up 2 bytes per character, etc." See the similarity now? Both flexibly change the width used by code- points, UTF-8 based on the code-point itself regardless of the rest of the string, Python based on the largest code-point in the string. [...] > Anyway, my point was just that Emacs is not a counter-example to jmf's > claim about implementing text editors, because UTF-8 is not what he (or > anybody else) is referring to when speaking of the FSR or "something > like the FSR". Whether JMF can see the similarities between different implementations of strings or not is beside the point, those similarities do exist. As do the differences, of course, but in this case the differences are in favour of Python's FSR. Even if your string is entirely Latin1, a UTF-8 implementation *cannot know that*, and still has to walk the string byte- by-byte checking whether the current code point requires 1, 2, 3, or 4 bytes, while a FSR implementation can simply record the fact that the string is pure Latin1 at creation time, and then treat it as fixed-width from then on. JMF claims that FSR is "impossible" to use efficiently, and yet he supports encoding schemes which are *less* efficient. Go figure. He tells us he has no problem with any of the established UTF encodings, and yet the FSR internally uses UTF-16 and UTF-32. (Technically, it's UCS-2, not UTF-16, since there are no surrogate pairs. But the difference is insignificant.) Having watched this issue from Day One when JMF first complained about it, I believe this is entirely about denying any benefit to ASCII users. Had Python implemented a system identical to the current FSR except that it added a fourth category, "all ASCII", which used an eight-byte encoding scheme (thus making ASCII strings twice as expensive as strings including code points from the Supplementary Multilingual Planes), JMF would be the scheme's number one champion. I cannot see any other rational explanation for why JMF prefers broken, buggy Unicode implementations, or implementations which are equally expensive for all strings, over one which is demonstrably correct, demonstrably saves memory, and for realistic, non-contrived benchmarks, demonstrably faster, except that he wants to punish ASCII users more than he wants to support Unicode users. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2013-07-26 22:12 -0600 |
| Message-ID | <mailman.5161.1374898818.3114.python-list@python.org> |
| In reply to | #51331 |
On Fri, Jul 26, 2013 at 9:37 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > See the similarity now? Both flexibly change the width used by code- > points, UTF-8 based on the code-point itself regardless of the rest of > the string, Python based on the largest code-point in the string. No, I think we're just using the word "flexible" differently. In my view, simply being variable-width does not make an encoding "flexible" in the sense of the FSR. But I'm not going to keep repeating myself in order to argue about it. > Having watched this issue from Day One when JMF first complained about > it, I believe this is entirely about denying any benefit to ASCII users. > Had Python implemented a system identical to the current FSR except that > it added a fourth category, "all ASCII", which used an eight-byte > encoding scheme (thus making ASCII strings twice as expensive as strings > including code points from the Supplementary Multilingual Planes), JMF > would be the scheme's number one champion. I agree. In fact I made a similar observation back in December: http://mail.python.org/pipermail/python-list/2012-December/636942.html
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-27 05:04 +0000 |
| Message-ID | <51f354c3$0$29971$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #51334 |
On Fri, 26 Jul 2013 22:12:36 -0600, Ian Kelly wrote: > On Fri, Jul 26, 2013 at 9:37 PM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> See the similarity now? Both flexibly change the width used by code- >> points, UTF-8 based on the code-point itself regardless of the rest of >> the string, Python based on the largest code-point in the string. > > No, I think we're just using the word "flexible" differently. In my > view, simply being variable-width does not make an encoding "flexible" > in the sense of the FSR. But I'm not going to keep repeating myself in > order to argue about it. But I paid for the full half hour! http://en.wikipedia.org/wiki/The_Argument_Sketch -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2013-07-27 12:13 -0400 |
| Message-ID | <mailman.5179.1374941632.3114.python-list@python.org> |
| In reply to | #51331 |
On 27 Jul 2013 03:37:20 GMT, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> declaimed the following:
>I disagree, and so does Jeremy Sanders who first pointed out the
>similarity between Emacs' UTF-8 and Python's FSR. I'll quote from the
>Emacs documentation again:
>
>"To conserve memory, Emacs does not hold fixed-length 22-bit numbers that
>are codepoints of text characters within buffers and strings. Rather,
>Emacs uses a variable-length internal representation of characters, that
>stores each character as a sequence of 1 to 5 8-bit bytes, depending on
>the magnitude of its codepoint. For example, any ASCII character takes
>up only 1 byte, a Latin-1 character takes up 2 bytes, etc."
>
>And the Python FSR:
>
>"To conserve memory, Python does not hold fixed-length 21-bit numbers that
>are codepoints of text characters within buffers and strings. Rather,
>Python uses a variable-length internal representation of characters, that
>stores each character as a sequence of 1 to 4 8-bit bytes, depending on
>the magnitude of the largest codepoint in the string. For example, any
>all-ASCII or all-Latin1 string takes up only 1 byte per character, an all-
>BMP string takes up 2 bytes per character, etc."
>
As I read those: Python states "any all-ASCII or all-Latin1 string
takes up only 1 byte per character", etc. IE; the entire STRING is based
upon the minimal size that can encode all characters in the string.
The EMACS statement doesn't specify a "string", it implies, in "any
ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes,
etc.", that a string can contain mixed length characters.
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-26 06:19 -0700 |
| Message-ID | <606b75ca-e1eb-4a69-a23d-6f0372004114@googlegroups.com> |
| In reply to | #51260 |
Le jeudi 25 juillet 2013 22:45:38 UTC+2, Ian a écrit : > On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano > > <steve+comp.lang.python@pearwood.info> wrote: > > > On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote: > > > > > >> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano > > >> <steve+comp.lang.python@pearwood.info> wrote: > > >>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: > > >>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers > > >>>> that are codepoints of text characters within buffers and strings. > > >>>> Rather, Emacs uses a variable-length internal representation of > > >>>> characters, that stores each character as a sequence of 1 to 5 8-bit > > >>>> bytes, depending on the magnitude of its codepoint[1]. For example, > > >>>> any ASCII character takes up only 1 byte, a Latin-1 character takes up > > >>>> 2 bytes, etc. We call this representation of text multibyte. > > >>> > > >>> Well, you've just proven what Vim users have always suspected: Emacs > > >>> doesn't really exist. > > >> > > >> ... lolwut? > > > > > > > > > JMF has explained that it is impossible, impossible I say!, to write an > > > editor using a flexible string representation. Since Emacs uses such a > > > flexible string representation, Emacs is impossible, and therefore Emacs > > > doesn't exist. > > > > > > QED. > > > > Except that the described representation used by Emacs is a variant of > > UTF-8, not an FSR. It doesn't have three different possible encodings > > for the letter 'a' depending on what other characters happen to be in > > the string. > > > > As I understand it, jfm would be perfectly happy if Python used UTF-8 > > (or presumably the Emacs variant) as its internal string > > representation. ------ And emacs it probably working smoothly. Your comment summarized all this stuff very correctly and very shortly. utf8/16/32? I do not care. There are all working correctly, smoothly and efficiently. In fact, these utf's are already doing correctly, what this FSR is doing in a wrong way. My preference? utf32. Why? It is the most simple and consequently performing choice. I'm not a narrow minded ascii user. (I do not pretend to belong to those who are solving the quadrature of the circle, I pretend to belong to those who know, the quadrature of the circle is not solvable). Note: text processing tools or tools that have to process characters — and the tools to build these tools — are all moving to utf32, if not already done. There are technical reasons behind this, which are going beyond the pure raw unicode. There are however still 100% Unicode compliant. jmf
[toc] | [prev] | [next] | [standalone]
| From | Michael Torrie <torriem@gmail.com> |
|---|---|
| Date | 2013-07-25 21:09 -0600 |
| Message-ID | <mailman.5127.1374808181.3114.python-list@python.org> |
| In reply to | #51247 |
On 07/25/2013 11:18 AM, Steven D'Aprano wrote: > JMF has explained that it is impossible, impossible I say!, to write an > editor using a flexible string representation. Since Emacs uses such a > flexible string representation, Emacs is impossible, and therefore Emacs > doesn't exist. Now I'm even more confused. He once pointed to Go as an example of how unicode should be done in a language. yet Go uses UTF-8 I think. But I don't think UTF-8 is what JMF refers to as "flexible string representation." FSR does use 1,2 or 4 bytes per character, but each character in the string uses the same width. That's different from UTF-8 or UTF-16, which is variable width per character.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-26 06:21 -0700 |
| Message-ID | <8203e802-9dc5-44c5-9547-6e1947ee224b@googlegroups.com> |
| In reply to | #51274 |
Le vendredi 26 juillet 2013 05:09:34 UTC+2, Michael Torrie a écrit :
> On 07/25/2013 11:18 AM, Steven D'Aprano wrote:
>
> > JMF has explained that it is impossible, impossible I say!, to write an
>
> > editor using a flexible string representation. Since Emacs uses such a
>
> > flexible string representation, Emacs is impossible, and therefore Emacs
>
> > doesn't exist.
>
>
>
> Now I'm even more confused. He once pointed to Go as an example of how
>
> unicode should be done in a language. yet Go uses UTF-8 I think.
>
>
>
> But I don't think UTF-8 is what JMF refers to as "flexible string
>
> representation." FSR does use 1,2 or 4 bytes per character, but each
>
> character in the string uses the same width. That's different from
>
> UTF-8 or UTF-16, which is variable width per character.
-----
>>> sys.getsizeof('––') - sys.getsizeof('–')
I have already explained / commented this.
--------
Hint: To understand Unicode (and every coding scheme), you should
understand "utf". The how and the *why*.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Michael Torrie <torriem@gmail.com> |
|---|---|
| Date | 2013-07-26 20:05 -0600 |
| Message-ID | <mailman.5160.1374890711.3114.python-list@python.org> |
| In reply to | #51300 |
On 07/26/2013 07:21 AM, wxjmfauth@gmail.com wrote:
>>>> sys.getsizeof('––') - sys.getsizeof('–')
>
> I have already explained / commented this.
Maybe it got lost in translation, but I don't understand your point with
that.
> Hint: To understand Unicode (and every coding scheme), you should
> understand "utf". The how and the *why*.
Hmm, so if python used utf-8 internally to represent unicode strings
would not that punish *all* users (not just non-ascii users) since
searching a string for a certain character position requires an O(n)
operation? UTF-32 I could see (and indeed that's essentially what FSR
uses when necessary does it not?), but not utf-8 or utf-16.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-27 11:21 -0700 |
| Message-ID | <f4bb2528-930e-4c0a-820e-66f00ac2b5b6@googlegroups.com> |
| In reply to | #51328 |
Le samedi 27 juillet 2013 04:05:03 UTC+2, Michael Torrie a écrit :
> On 07/26/2013 07:21 AM, wxjmfauth@gmail.com wrote:
>
> >>>> sys.getsizeof('––') - sys.getsizeof('–')
>
> >
>
> > I have already explained / commented this.
>
>
>
> Maybe it got lost in translation, but I don't understand your point with
>
> that.
>
>
>
> > Hint: To understand Unicode (and every coding scheme), you should
>
> > understand "utf". The how and the *why*.
>
>
>
> Hmm, so if python used utf-8 internally to represent unicode strings
>
> would not that punish *all* users (not just non-ascii users) since
>
> searching a string for a certain character position requires an O(n)
>
> operation? UTF-32 I could see (and indeed that's essentially what FSR
>
> uses when necessary does it not?), but not utf-8 or utf-16.
------
Did you read my previous link? Unicode Character Encoding Model.
Did you understand it?
Unicode only - No FSR (I skip some points and I still attempt to
be still correct.)
Unicode is a four-steps process.
[ {unique set of characters} --> {unique set of code points, the
"labels"} --> {unique set of encoded code points} ] --> implementation
(bytes)
First point to notice. "pure unicode", [...], is different from
the "implementation". *This is a deliberate choice*.
The critical step is the path {unique set of characters} --->
{unique set of encoded code points} in such a way so that
the implementation can "work comfortably" with this *unique* set
of encoded code points. Conceptualy, the implementation works
with an unique set of "already prepared encoded code points".
This is a very critical step. To explain it in a dirty way:
in the above chain, this problem is "already" eliminated and
solved. Like a byte/char coding schemes where this step is
a no-op.
Now, and if you wish this is a seperated/different problem.
To create this unique set of encoded code points, "Unicode"
uses these "utf(s)". I repeat again, a confusing name, for the
process and the result of the process. (I neglect ucs).
What are these? Chunks of bits, group of 8/16/32 bits, words.
It is up to the implementation to convert these sequences
of bits into bytes, ***if you wish to convert these in bytes!***.
Suprise! Why not putting two of the 32-bits words in a 64-bits
"machine"? (see golang / rune / int32).
Back to utf. utfs are not only elements of a unique set of encoded
code points. They have an interesting feature. Each "utf chunk"
holds intrisically the character (in fact the code point) it is
supposed to represent. In utf-32, the obvious case, it is just
the code point. In utf-8, that's the first chunk which helps and
utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
implementation using bytes, for any pointer position it is always
possible to find the corresponding encoded code point and from this
the corresponding character without any "programmed" information. See
my editor example, how to find the char under the caret? In fact,
a silly example, how can the caret can be positioned or moved, if
the underlying corresponding encoded code point can not be
dicerned!
Next step and one another separated problem.
Why all these utf versions? It is always the
same story. Some prefer the universality (utf-32) and
some prefer, well, some kind of conservatism. utf-8 is
more complicated, it demands more work and logically,
in an expected way, some performance regression.
utf-8 is more suited to produce bytes, utf16/32 for
internal processing. utf-8 had no choice to lose the
indexing. And so on.
Fact: all these coding schemes are working with a unique
set of encoded code points (suprise again, it's like byte
string!). The loss of performance of utf-8 is very minimal
compared to the loss of performance one can get compare to
a multiple coding scheme. This kind of work has been done,
and if my informations are correct, even by the creators
of utf-8. (There are sometimes good scientists).
There are plenty of advantages in using utf instead of
something else and advantages in other fields than just
the pure coding.
utf-16/32 schemes have the advantages to ditch ascii
for ever. The ascii concept is no more existing.
One should also understand that all this stuff has
not been created from scratch. It was a balance between
existing technologies. MS sticked with the idea, no more
ascii, let's use ucs-2 and the *x world breaks the unicode
adoption as possible. utf-8 is one of the compromise for
the adoption of Unicode. Retrospectivly, a not so good
compromise.
Computer scientists are funny scientists. They do love
to solve the problems they created themselves.
-----
Quickly. sys.getsizeof() at the light of what I explained.
1) As this FSR works with multiple encoding, it has to keep
track of the encoding. it puts is in the overhead of str
class (overhead = real overhead + encoding). In such
a absurd way, that a
>>> sys.getsizeof('€')
40
needs 14 bytes more than a
>>> sys.getsizeof('z')
26
You may vary the length of the str. The problem is
still here. Not bad for a coding scheme.
2) Take a look at this. Get rid of the overhead.
>>> sys.getsizeof('b'*1000000 + 'c')
1000026
>>> sys.getsizeof('b'*1000000 + '€')
2000040
What does it mean? It means that Python has to
reencode a str every time it is necessary because
it works with multiple codings.
This FSR is not even a copy of the utf-8.
>>> len(('b'*1000000 + '€').encode('utf-8'))
1000003
utf-8 or any (utf) never need and never spend their time
in reencoding.
3) Unicode compliance. We know retrospectively, latin-1,
is was a bad choice. Unusable for 17 European languages.
Believe of not. 20 years of Unicode of incubation is not
long enough to learn it. When discussing once with a French
Python core dev, one with commit access, he did not know one
can not use latin-1 for the French language! BTW, I proposed
to the French devs, to test the FST with the set of characters,
recognized by the "Imprimerie Nationale", some kind of
the legal French authority regarding characters and typography.
Never heared about it. Of course, I dit it.
In short
FSR = bad performance + bad memory mangement + non unicode
compliance.
Good point. FSR, nice tool for those who wish to teach
Unicode. It is not every day, one has such an opportunity.
---------
I'm practicaly no more programming, writing applications.
I'm still active and observing since a decade and plus all this
unicode world, languages (go, c#, Python, Ruby), text
processing systems (esp. Unicode TeX engines) and font technology.
Very, very interesting.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2013-07-27 21:53 -0600 |
| Message-ID | <mailman.5188.1374983652.3114.python-list@python.org> |
| In reply to | #51340 |
On Sat, Jul 27, 2013 at 12:21 PM, <wxjmfauth@gmail.com> wrote:
> Back to utf. utfs are not only elements of a unique set of encoded
> code points. They have an interesting feature. Each "utf chunk"
> holds intrisically the character (in fact the code point) it is
> supposed to represent. In utf-32, the obvious case, it is just
> the code point. In utf-8, that's the first chunk which helps and
> utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
> implementation using bytes, for any pointer position it is always
> possible to find the corresponding encoded code point and from this
> the corresponding character without any "programmed" information. See
> my editor example, how to find the char under the caret? In fact,
> a silly example, how can the caret can be positioned or moved, if
> the underlying corresponding encoded code point can not be
> dicerned!
Yes, given a pointer location into a utf-8 or utf-16 string, it is
easy to determine the identity of the code point at that location.
But this is not often a useful operation, save for resynchronization
in the case that the string data is corrupted. The caret of an editor
does not conceptually correspond to a pointer location, but to a
character index. Given a particular character index (e.g. 127504), an
editor must be able to determine the identity and/or the memory
location of the character at that index, and for UTF-8 and UTF-16
without an auxiliary data structure that is a O(n) operation.
> 2) Take a look at this. Get rid of the overhead.
>
>>>> sys.getsizeof('b'*1000000 + 'c')
> 1000026
>>>> sys.getsizeof('b'*1000000 + '€')
> 2000040
>
> What does it mean? It means that Python has to
> reencode a str every time it is necessary because
> it works with multiple codings.
Large strings in practical usage do not need to be resized like this
often. Python 3.3 has been in production use for months now, and you
still have yet to produce any real-world application code that
demonstrates a performance regression. If there is no real-world
regression, then there is no problem.
> 3) Unicode compliance. We know retrospectively, latin-1,
> is was a bad choice. Unusable for 17 European languages.
> Believe of not. 20 years of Unicode of incubation is not
> long enough to learn it. When discussing once with a French
> Python core dev, one with commit access, he did not know one
> can not use latin-1 for the French language!
Probably because for many French strings, one can. As far as I am
aware, the only characters that are missing from Latin-1 are the Euro
sign (an unfortunate victim of history), the ligature œ (I have no
doubt that many users just type oe anyway), and the rare capital Ÿ
(the miniscule version is present in Latin-1). All French strings
that are fortunate enough to be absent these characters can be
represented in Latin-1 and so will have a 1-byte width in the FSR.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-28 11:13 -0700 |
| Message-ID | <4117e08f-941a-42d5-87b6-09e66f8c7b60@googlegroups.com> |
| In reply to | #51380 |
Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
> On Sat, Jul 27, 2013 at 12:21 PM, <wxjmfauth@gmail.com> wrote:
>
> > Back to utf. utfs are not only elements of a unique set of encoded
>
> > code points. They have an interesting feature. Each "utf chunk"
>
> > holds intrisically the character (in fact the code point) it is
>
> > supposed to represent. In utf-32, the obvious case, it is just
>
> > the code point. In utf-8, that's the first chunk which helps and
>
> > utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
>
> > implementation using bytes, for any pointer position it is always
>
> > possible to find the corresponding encoded code point and from this
>
> > the corresponding character without any "programmed" information. See
>
> > my editor example, how to find the char under the caret? In fact,
>
> > a silly example, how can the caret can be positioned or moved, if
>
> > the underlying corresponding encoded code point can not be
>
> > dicerned!
>
>
>
> Yes, given a pointer location into a utf-8 or utf-16 string, it is
>
> easy to determine the identity of the code point at that location.
>
> But this is not often a useful operation, save for resynchronization
>
> in the case that the string data is corrupted. The caret of an editor
>
> does not conceptually correspond to a pointer location, but to a
>
> character index. Given a particular character index (e.g. 127504), an
>
> editor must be able to determine the identity and/or the memory
>
> location of the character at that index, and for UTF-8 and UTF-16
>
> without an auxiliary data structure that is a O(n) operation.
>
>
>
> > 2) Take a look at this. Get rid of the overhead.
>
> >
>
> >>>> sys.getsizeof('b'*1000000 + 'c')
>
> > 1000026
>
> >>>> sys.getsizeof('b'*1000000 + '€')
>
> > 2000040
>
> >
>
> > What does it mean? It means that Python has to
>
> > reencode a str every time it is necessary because
>
> > it works with multiple codings.
>
>
>
> Large strings in practical usage do not need to be resized like this
>
> often. Python 3.3 has been in production use for months now, and you
>
> still have yet to produce any real-world application code that
>
> demonstrates a performance regression. If there is no real-world
>
> regression, then there is no problem.
>
>
>
> > 3) Unicode compliance. We know retrospectively, latin-1,
>
> > is was a bad choice. Unusable for 17 European languages.
>
> > Believe of not. 20 years of Unicode of incubation is not
>
> > long enough to learn it. When discussing once with a French
>
> > Python core dev, one with commit access, he did not know one
>
> > can not use latin-1 for the French language!
>
>
>
> Probably because for many French strings, one can. As far as I am
>
> aware, the only characters that are missing from Latin-1 are the Euro
>
> sign (an unfortunate victim of history), the ligature œ (I have no
>
> doubt that many users just type oe anyway), and the rare capital Ÿ
>
> (the miniscule version is present in Latin-1). All French strings
>
> that are fortunate enough to be absent these characters can be
>
> represented in Latin-1 and so will have a 1-byte width in the FSR.
------
latin-1? that's not even truth.
>>> sys.getsizeof('a')
26
>>> sys.getsizeof('ü')
38
>>> sys.getsizeof('aa')
27
>>> sys.getsizeof('aü')
39
jmf
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-07-28 20:04 +0100 |
| Message-ID | <mailman.5200.1375038295.3114.python-list@python.org> |
| In reply to | #51392 |
On 28/07/2013 19:13, wxjmfauth@gmail.com wrote:
> Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
>> On Sat, Jul 27, 2013 at 12:21 PM, <wxjmfauth@gmail.com> wrote:
>>
>> > Back to utf. utfs are not only elements of a unique set of encoded
>>
>> > code points. They have an interesting feature. Each "utf chunk"
>>
>> > holds intrisically the character (in fact the code point) it is
>>
>> > supposed to represent. In utf-32, the obvious case, it is just
>>
>> > the code point. In utf-8, that's the first chunk which helps and
>>
>> > utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
>>
>> > implementation using bytes, for any pointer position it is always
>>
>> > possible to find the corresponding encoded code point and from this
>>
>> > the corresponding character without any "programmed" information. See
>>
>> > my editor example, how to find the char under the caret? In fact,
>>
>> > a silly example, how can the caret can be positioned or moved, if
>>
>> > the underlying corresponding encoded code point can not be
>>
>> > dicerned!
>>
>>
>>
>> Yes, given a pointer location into a utf-8 or utf-16 string, it is
>>
>> easy to determine the identity of the code point at that location.
>>
>> But this is not often a useful operation, save for resynchronization
>>
>> in the case that the string data is corrupted. The caret of an editor
>>
>> does not conceptually correspond to a pointer location, but to a
>>
>> character index. Given a particular character index (e.g. 127504), an
>>
>> editor must be able to determine the identity and/or the memory
>>
>> location of the character at that index, and for UTF-8 and UTF-16
>>
>> without an auxiliary data structure that is a O(n) operation.
>>
>>
>>
>> > 2) Take a look at this. Get rid of the overhead.
>>
>> >
>>
>> >>>> sys.getsizeof('b'*1000000 + 'c')
>>
>> > 1000026
>>
>> >>>> sys.getsizeof('b'*1000000 + '€')
>>
>> > 2000040
>>
>> >
>>
>> > What does it mean? It means that Python has to
>>
>> > reencode a str every time it is necessary because
>>
>> > it works with multiple codings.
>>
>>
>>
>> Large strings in practical usage do not need to be resized like this
>>
>> often. Python 3.3 has been in production use for months now, and you
>>
>> still have yet to produce any real-world application code that
>>
>> demonstrates a performance regression. If there is no real-world
>>
>> regression, then there is no problem.
>>
>>
>>
>> > 3) Unicode compliance. We know retrospectively, latin-1,
>>
>> > is was a bad choice. Unusable for 17 European languages.
>>
>> > Believe of not. 20 years of Unicode of incubation is not
>>
>> > long enough to learn it. When discussing once with a French
>>
>> > Python core dev, one with commit access, he did not know one
>>
>> > can not use latin-1 for the French language!
>>
>>
>>
>> Probably because for many French strings, one can. As far as I am
>>
>> aware, the only characters that are missing from Latin-1 are the Euro
>>
>> sign (an unfortunate victim of history), the ligature œ (I have no
>>
>> doubt that many users just type oe anyway), and the rare capital Ÿ
>>
>> (the miniscule version is present in Latin-1). All French strings
>>
>> that are fortunate enough to be absent these characters can be
>>
>> represented in Latin-1 and so will have a 1-byte width in the FSR.
>
> ------
>
> latin-1? that's not even truth.
>
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('ü')
> 38
>>>> sys.getsizeof('aa')
> 27
>>>> sys.getsizeof('aü')
> 39
>
>>> sys.getsizeof('aa') - sys.getsizeof('a')
1
One byte per codepoint.
>>> sys.getsizeof('üü') - sys.getsizeof('ü')
1
Also one byte per codepoint.
>>> sys.getsizeof('ü') - sys.getsizeof('a')
12
Clearly there's more going on here.
FSR is an optimisation. You'll always be able to find some
circumstances where an optimisation makes things worse, but what
matters is the overall result.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-28 12:30 -0700 |
| Message-ID | <95b91473-b707-4288-860c-d02fda7af1ea@googlegroups.com> |
| In reply to | #51397 |
Le dimanche 28 juillet 2013 21:04:56 UTC+2, MRAB a écrit :
> On 28/07/2013 19:13, wxjmfauth@gmail.com wrote:
>
> > Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
>
> >> On Sat, Jul 27, 2013 at 12:21 PM, <wxjmfauth@gmail.com> wrote:
>
> >>
>
> >> > Back to utf. utfs are not only elements of a unique set of encoded
>
> >>
>
> >> > code points. They have an interesting feature. Each "utf chunk"
>
> >>
>
> >> > holds intrisically the character (in fact the code point) it is
>
> >>
>
> >> > supposed to represent. In utf-32, the obvious case, it is just
>
> >>
>
> >> > the code point. In utf-8, that's the first chunk which helps and
>
> >>
>
> >> > utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
>
> >>
>
> >> > implementation using bytes, for any pointer position it is always
>
> >>
>
> >> > possible to find the corresponding encoded code point and from this
>
> >>
>
> >> > the corresponding character without any "programmed" information. See
>
> >>
>
> >> > my editor example, how to find the char under the caret? In fact,
>
> >>
>
> >> > a silly example, how can the caret can be positioned or moved, if
>
> >>
>
> >> > the underlying corresponding encoded code point can not be
>
> >>
>
> >> > dicerned!
>
> >>
>
> >>
>
> >>
>
> >> Yes, given a pointer location into a utf-8 or utf-16 string, it is
>
> >>
>
> >> easy to determine the identity of the code point at that location.
>
> >>
>
> >> But this is not often a useful operation, save for resynchronization
>
> >>
>
> >> in the case that the string data is corrupted. The caret of an editor
>
> >>
>
> >> does not conceptually correspond to a pointer location, but to a
>
> >>
>
> >> character index. Given a particular character index (e.g. 127504), an
>
> >>
>
> >> editor must be able to determine the identity and/or the memory
>
> >>
>
> >> location of the character at that index, and for UTF-8 and UTF-16
>
> >>
>
> >> without an auxiliary data structure that is a O(n) operation.
>
> >>
>
> >>
>
> >>
>
> >> > 2) Take a look at this. Get rid of the overhead.
>
> >>
>
> >> >
>
> >>
>
> >> >>>> sys.getsizeof('b'*1000000 + 'c')
>
> >>
>
> >> > 1000026
>
> >>
>
> >> >>>> sys.getsizeof('b'*1000000 + '€')
>
> >>
>
> >> > 2000040
>
> >>
>
> >> >
>
> >>
>
> >> > What does it mean? It means that Python has to
>
> >>
>
> >> > reencode a str every time it is necessary because
>
> >>
>
> >> > it works with multiple codings.
>
> >>
>
> >>
>
> >>
>
> >> Large strings in practical usage do not need to be resized like this
>
> >>
>
> >> often. Python 3.3 has been in production use for months now, and you
>
> >>
>
> >> still have yet to produce any real-world application code that
>
> >>
>
> >> demonstrates a performance regression. If there is no real-world
>
> >>
>
> >> regression, then there is no problem.
>
> >>
>
> >>
>
> >>
>
> >> > 3) Unicode compliance. We know retrospectively, latin-1,
>
> >>
>
> >> > is was a bad choice. Unusable for 17 European languages.
>
> >>
>
> >> > Believe of not. 20 years of Unicode of incubation is not
>
> >>
>
> >> > long enough to learn it. When discussing once with a French
>
> >>
>
> >> > Python core dev, one with commit access, he did not know one
>
> >>
>
> >> > can not use latin-1 for the French language!
>
> >>
>
> >>
>
> >>
>
> >> Probably because for many French strings, one can. As far as I am
>
> >>
>
> >> aware, the only characters that are missing from Latin-1 are the Euro
>
> >>
>
> >> sign (an unfortunate victim of history), the ligature œ (I have no
>
> >>
>
> >> doubt that many users just type oe anyway), and the rare capital Ÿ
>
> >>
>
> >> (the miniscule version is present in Latin-1). All French strings
>
> >>
>
> >> that are fortunate enough to be absent these characters can be
>
> >>
>
> >> represented in Latin-1 and so will have a 1-byte width in the FSR.
>
> >
>
> > ------
>
> >
>
> > latin-1? that's not even truth.
>
> >
>
> >>>> sys.getsizeof('a')
>
> > 26
>
> >>>> sys.getsizeof('ü')
>
> > 38
>
> >>>> sys.getsizeof('aa')
>
> > 27
>
> >>>> sys.getsizeof('aü')
>
> > 39
>
> >
>
>
>
> >>> sys.getsizeof('aa') - sys.getsizeof('a')
>
> 1
>
>
>
> One byte per codepoint.
>
>
>
> >>> sys.getsizeof('üü') - sys.getsizeof('ü')
>
> 1
>
>
>
> Also one byte per codepoint.
>
>
>
> >>> sys.getsizeof('ü') - sys.getsizeof('a')
>
> 12
>
>
>
> Clearly there's more going on here.
>
>
>
> FSR is an optimisation. You'll always be able to find some
>
> circumstances where an optimisation makes things worse, but what
>
> matters is the overall result.
----
Yes, I know my examples are always wrong, never
real examples.
I can point long strings, I should point short strings.
I point a short string (char), it is not long enough.
Strings as dict keys, no the problem is in Python dict.
Performance? no that's a memory issue.
Memory? no, it's a question to keep perfomance.
I am using this char, no you should not, it's no common.
The nabla operator in TeX file, who is so stupid to use
that char?
Many time, I'm just mimicking 'BDFL' examples, just
by replacing "his" ascii chars by non ascii char ;-)
And so on.
To be short, this is *never* the FSR, always something
else.
Suggestion. Start by solving all these "micro-benchmarks".
all the memory cases. It a good start, no?
jmf
[toc] | [prev] | [next] | [standalone]
| From | Lele Gaifax <lele@metapensiero.it> |
|---|---|
| Date | 2013-07-28 22:45 +0200 |
| Message-ID | <mailman.5205.1375044339.3114.python-list@python.org> |
| In reply to | #51402 |
wxjmfauth@gmail.com writes: > Suggestion. Start by solving all these "micro-benchmarks". > all the memory cases. It a good start, no? Since you seem the only one who has this dramatic problem with such micro-benchmarks, that BTW have nothing to do with "unicode compliance", I'd suggest *you* should find a better implementation and propose it to the core devs. An even better suggestion, with due respect, is to get a life and find something more interesting to do, or at least better arguments :-) ciao, lele. -- nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia. lele@metapensiero.it | -- Fortunato Depero, 1929.
[toc] | [prev] | [next] | [standalone]
| From | Antoon Pardon <antoon.pardon@rece.vub.ac.be> |
|---|---|
| Date | 2013-07-28 22:01 +0200 |
| Message-ID | <mailman.5216.1375082822.3114.python-list@python.org> |
| In reply to | #51402 |
Op 28-07-13 21:30, wxjmfauth@gmail.com schreef: > To be short, this is *never* the FSR, always something > else. > > Suggestion. Start by solving all these "micro-benchmarks". > all the memory cases. It a good start, no? > There is nothing to solve. Unicode doesn't force implementations to use the same size of memory for strings of the same length. So you pointing out examples of same length strings that don't use the same size of memory doesn't point at something that must be solved. -- Antoon Pardon
[toc] | [prev] | [next] | [standalone]
Page 4 of 7 — ← Prev page 1 2 3 [4] 5 6 7 Next page →
Back to top | Article view | comp.lang.python
csiph-web