Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #27730 > unrolled thread
| Started by | wxjmfauth@gmail.com |
|---|---|
| First post | 2012-08-23 05:47 -0700 |
| Last post | 2012-08-25 07:23 -0400 |
| Articles | 20 on this page of 95 — 21 participants |
Back to article view | Back to comp.lang.python
Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-23 05:47 -0700
Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-23 23:57 +1000
Re: Flexible string representation, unicode, typography, ... MRAB <python@mrabarnett.plus.com> - 2012-08-23 16:11 +0100
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-23 09:19 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-23 11:33 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-23 13:22 -0600
Re: Flexible string representation, unicode, typography, ... rusi <rustompmody@gmail.com> - 2012-08-24 09:06 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-24 17:47 +0100
Re: Flexible string representation, unicode, typography, ... Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-24 14:34 -0400
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-23 20:34 +0100
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-23 15:18 +0100
Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-08-24 07:38 -0700
Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-25 00:24 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
Re: Flexible string representation, unicode, typography, ... Ben Finney <ben+python@benfinney.id.au> - 2012-08-25 17:54 +1000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 09:58 +0100
Re: Flexible string representation, unicode, typography, ... Frank Millman <frank@chagford.com> - 2012-08-25 11:46 +0200
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-25 16:26 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:50 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 11:49 +0000
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:40 -0600
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 20:13 +0000
Re: Flexible string representation, unicode, typography, ... Dan Sommers <dan@tombstonezero.net> - 2012-08-26 13:45 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-27 14:14 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 13:37 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-28 09:54 +1000
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 13:59 +1000
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-28 22:15 -0600
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-29 08:05 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-08-29 08:01 -0400
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 08:43 -0700
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 06:55 +0000
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 18:59 +1000
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-30 07:02 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 16:00 +0000
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-30 16:44 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 12:32 +0000
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-31 09:13 -0600
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-31 08:43 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 14:54 +0000
Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-30 15:01 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 09:58 +0100
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-02 03:06 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
Re: Flexible string representation, unicode, typography, ... Michael Torrie <torriem@gmail.com> - 2012-09-02 13:45 -0600
Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-09-02 16:07 -0400
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 16:38 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:42 +0000
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:26 +0300
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 00:53 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-02 11:52 +0200
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 11:36 +0100
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 15:00 +0300
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-03 07:11 +0100
Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-03 08:15 +0200
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-03 04:38 -0400
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:56 +0300
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 13:23 +0100
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-02 08:35 -0400
Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 15:46 +0100
Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-03 12:33 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-30 10:27 -0600
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 23:38 +0300
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:54 +0000
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 22:33 -0400
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-03 11:24 -0400
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:41 +0300
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 00:45 +0300
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 01:54 +1000
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 22:34 +1000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 15:42 -0600
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 23:31 +0000
Re: Flexible string representation, unicode, typography, ... Paul Rubin <no.email@nospam.invalid> - 2012-08-26 17:47 -0700
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:04 +1000
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 12:05 +0100
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:19 +1000
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-25 07:23 -0400
Page 2 of 5 — ← Prev page 1 [2] 3 4 5 Next page →
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-25 16:26 -0600 |
| Message-ID | <mailman.3816.1345933655.4697.python-list@python.org> |
| In reply to | #27878 |
On Sat, Aug 25, 2012 at 9:47 AM, <wxjmfauth@gmail.com> wrote:
> For those you do not know, the go language has introduced
> the rune type. As far as I know, nobody is complaining, I
> have not even seen a discussion related to this subject.
Python has that also. We call it "int".
More seriously, strings in Go are not sequences of runes. They're
actually arrays of UTF-8 bytes. That means that they're quite
efficient for ASCII strings, at the expense of other characters, like
Chinese (wait, this sounds familiar for some reason). It also means
that you have to bend over backwards if you want to work with actual
runes instead of bytes. Want to know how many characters are in your
string? Don't call len() on it -- that will only tell you how many
bytes are in it. Don't try to index or slice it either -- that will
(accidentally) work for ASCII strings, but for other strings your
indexes will be wrong. If you're unlucky you might even split up the
string in the middle of a character, and now your string has invalid
characters in it. The right way to do it looks something like this:
len([]rune("白鵬翔")) // get the length of the string in characters
string([]rune("白鵬翔")[0:2]) // get the substring containing the first
two characters
It reminds me of working in Python 2.X, except that instead of an
actual unicode type you just have arrays of ints.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-25 23:59 -0700 |
| Message-ID | <4853fddf-5e4d-4c11-9a19-5a1dbe4cbc20@googlegroups.com> |
| In reply to | #27888 |
Le dimanche 26 août 2012 00:26:56 UTC+2, Ian a écrit :
> On Sat, Aug 25, 2012 at 9:47 AM, <wxjmfauth@gmail.com> wrote:
>
> > For those you do not know, the go language has introduced
>
> > the rune type. As far as I know, nobody is complaining, I
>
> > have not even seen a discussion related to this subject.
>
>
>
> Python has that also. We call it "int".
>
>
>
> More seriously, strings in Go are not sequences of runes. They're
>
> actually arrays of UTF-8 bytes. That means that they're quite
>
> efficient for ASCII strings, at the expense of other characters, like
>
> Chinese (wait, this sounds familiar for some reason). It also means
>
> that you have to bend over backwards if you want to work with actual
>
> runes instead of bytes. Want to know how many characters are in your
>
> string? Don't call len() on it -- that will only tell you how many
>
> bytes are in it. Don't try to index or slice it either -- that will
>
> (accidentally) work for ASCII strings, but for other strings your
>
> indexes will be wrong. If you're unlucky you might even split up the
>
> string in the middle of a character, and now your string has invalid
>
> characters in it. The right way to do it looks something like this:
>
>
>
> len([]rune("白鵬翔")) // get the length of the string in characters
>
> string([]rune("白鵬翔")[0:2]) // get the substring containing the first
>
> two characters
>
>
>
> It reminds me of working in Python 2.X, except that instead of an
>
> actual unicode type you just have arrays of ints.
Sorry, you do not get it.
The rune is an alias for int32. A sequence of runes is a
sequence of int32's. Go do not spend its time in using a
machinery to work with, to differentiate, to keep in memory
this sequence according to the *characers* composing this
"array of code points".
The message is even stronger. Use runes to work comfortably [*]
with unicode:
rune -> int32 -> utf32 -> unicode (the perfect scheme, cann't be
better)
[*] Beyond my skill and my kwowloge and if I understood correctly,
this rune is even technically optimized to ensure it it always
an int32.
len() or slices() have nothing to do here.
My experience with go is equal to uero + epsilon.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-26 09:50 -0600 |
| Message-ID | <mailman.3842.1345996272.4697.python-list@python.org> |
| In reply to | #27906 |
On Sun, Aug 26, 2012 at 12:59 AM, <wxjmfauth@gmail.com> wrote: > Sorry, you do not get it. > > The rune is an alias for int32. A sequence of runes is a > sequence of int32's. Go do not spend its time in using a > machinery to work with, to differentiate, to keep in memory > this sequence according to the *characers* composing this > "array of code points". > > The message is even stronger. Use runes to work comfortably [*] > with unicode: > rune -> int32 -> utf32 -> unicode (the perfect scheme, cann't be > better) I understand what rune is. I think you've missed my complaint, which is that although rune is the basic building block of Unicode strings -- representing a single Unicode character -- strings in Go are not built from runes but from bytes. If you want to do any actual work with Unicode strings, then you have to first convert them to runes or arrays of runes. The conceptual cost of this is that the object you're working with is no longer a string. You call this the "perfect scheme" for working with Unicode. Why does the "perfect scheme" for Unicode make it *easier* to write buggy code that only works for ASCII than to write correct code that works for all characters? This is IMO where Python 3 gets it right. When you want to work with Unicode strings, you just work with Unicode strings -- none of this nonsense of first explicitly converting the string to an array of ints that looks nothing like a string at a high level. The only place Python 3 makes you worry about converting strings is at the boundaries of your program, where decoding from bytes to strings and back is necessary.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-25 23:59 -0700 |
| Message-ID | <mailman.3831.1345964382.4697.python-list@python.org> |
| In reply to | #27888 |
Le dimanche 26 août 2012 00:26:56 UTC+2, Ian a écrit :
> On Sat, Aug 25, 2012 at 9:47 AM, <wxjmfauth@gmail.com> wrote:
>
> > For those you do not know, the go language has introduced
>
> > the rune type. As far as I know, nobody is complaining, I
>
> > have not even seen a discussion related to this subject.
>
>
>
> Python has that also. We call it "int".
>
>
>
> More seriously, strings in Go are not sequences of runes. They're
>
> actually arrays of UTF-8 bytes. That means that they're quite
>
> efficient for ASCII strings, at the expense of other characters, like
>
> Chinese (wait, this sounds familiar for some reason). It also means
>
> that you have to bend over backwards if you want to work with actual
>
> runes instead of bytes. Want to know how many characters are in your
>
> string? Don't call len() on it -- that will only tell you how many
>
> bytes are in it. Don't try to index or slice it either -- that will
>
> (accidentally) work for ASCII strings, but for other strings your
>
> indexes will be wrong. If you're unlucky you might even split up the
>
> string in the middle of a character, and now your string has invalid
>
> characters in it. The right way to do it looks something like this:
>
>
>
> len([]rune("白鵬翔")) // get the length of the string in characters
>
> string([]rune("白鵬翔")[0:2]) // get the substring containing the first
>
> two characters
>
>
>
> It reminds me of working in Python 2.X, except that instead of an
>
> actual unicode type you just have arrays of ints.
Sorry, you do not get it.
The rune is an alias for int32. A sequence of runes is a
sequence of int32's. Go do not spend its time in using a
machinery to work with, to differentiate, to keep in memory
this sequence according to the *characers* composing this
"array of code points".
The message is even stronger. Use runes to work comfortably [*]
with unicode:
rune -> int32 -> utf32 -> unicode (the perfect scheme, cann't be
better)
[*] Beyond my skill and my kwowloge and if I understood correctly,
this rune is even technically optimized to ensure it it always
an int32.
len() or slices() have nothing to do here.
My experience with go is equal to uero + epsilon.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-26 11:49 +0000 |
| Message-ID | <503a0d51$0$6574$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27907 |
On Sat, 25 Aug 2012 23:59:34 -0700, wxjmfauth wrote: > Le dimanche 26 août 2012 00:26:56 UTC+2, Ian a écrit : >> More seriously, strings in Go are not sequences of runes. They're >> actually arrays of UTF-8 bytes. Actually, it's worse that that. Strings in Go aren't even proper UTF-8. They are arbitrary bytes, which means you can create strings which are invalid Unicode. Go looks like an interesting language, but it seems to me that they have totally screwed up strings. At least Python had the excuse that it is 20 years old and carrying the old ASCII baggage. Nobody used Unicode in 1992 when Python was invented. What is Google's excuse for getting Unicode wrong? In Go, strings are UTF-8 encoded sequences of bytes, except when they're not, in which case they're arbitrary bytes. You can't tell if a string is valid UTF-8 unless you carefully inspect every single character and decide for yourself if it is valid. Don't know the rules for valid UTF-8? Too bad. This also means that basic string operations like slicing are both *slow* and *wrong* -- they are slow, because you have to track character boundaries yourself. And they are wrong, because most people won't bother, they'll just assume each character is one byte. See here for more information: http://comments.gmane.org/gmane.comp.lang.go.general/56245 Some useful quotes: - "Strings are *not* required to be UTF-8." - "If the string must always be valid UTF-8 then relatively expensive validation is required for many operations. Plus making those operations able to fail complicates the interface." - "In almost all cases strings are just byte arrays." - "Go simply doesn't have 8-bit Unicode strings" - "Python3 can afford the luxury of storing strings in UCS-2/UCS-4, Go can't." I don't question that Go needs a type for arbitrary bytes. But that should be "bytes", not "string", and it should be there for the advanced programmers who *need* to worry about bytes. Programmers who want to include strings in their applications (i.e. all of them) shouldn't need to care that "$" is one byte, "¢" is two, "€" is three, and "𤭢" (U+24B62) is four. With Python 3.3, it *just works*. With Go, it doesn't. In my not-so-humble opinion, Go has made a silly design error. Go programmers will be paying for this mistake for at least a decade. What they should have done is create two data types: 1) Strings which are guaranteed to be valid Unicode. That could be UTF-32 or a PEP 393 approach, depending on how much memory you want to use, or even UTF-16 if you don't mind the complication of surrogate pairs. 2) Bytes which are not guaranteed to be valid Unicode but let the programmer work with arbitrary bytes. (If this sounds familiar, it should -- it is exactly what Python 3 does. We have a string type that guarantees to be valid Unicode, and a bytes type that doesn't.) As given, *every single programmer* who wants to use Unicode in Go is now responsible for doing all the hard work of validating UTF-8, converting from bytes to strings, etc. Sure, eventually Go will have libraries to do that, but not yet, and even when it does, many people will not use them and their code will fail to handle Unicode correctly. Right now, every Go programmer who wants Unicode has to pay the cost of the freedom to have arbitrary byte sequences, whether they need those arbitrary bytes or not. The consequence is that instead of Go making Unicode as trivial and easy to use as it should be, it will be hard to get right, annoying, slow and painful. Another generation of programmers will grow up thinking that Unicode is all too difficult and we should stick to just plain ASCII. Since Go doesn't have Unicode strings, you can never trust that a string is valid UTF-8, you can't slice it efficiently, you can't get the length in characters, you can't write it to a file and have other applications to be able to read it. Sure, sometimes it will work, and then somebody will input a Euro sign into your application, and it will blow up. Why am I not surprised that JMF misunderstands both Go byte-strings and Python Unicode strings? > Sorry, you do not get it. > > The rune is an alias for int32. A sequence of runes is a sequence of > int32's. It certainly is not. Runes are variable-width. Here, for example, are a number of Go functions which return a single rune and its width in bytes: http://golang.org/pkg/unicode/utf8/ > Go do not spend its time in using a machinery to work with, to > differentiate, to keep in memory this sequence according to the > *characers* composing this "array of code points". > > The message is even stronger. Use runes to work comfortably [*] with > unicode: > rune -> int32 -> utf32 -> unicode (the perfect scheme, cann't be better) Runes are not int32, and int32 is not UTF-32. Whether UTF-32 is the "perfect scheme" for Unicode is a matter of opinion. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-26 09:40 -0600 |
| Message-ID | <mailman.3841.1345995646.4697.python-list@python.org> |
| In reply to | #27913 |
On Sun, Aug 26, 2012 at 5:49 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: >> Sorry, you do not get it. >> >> The rune is an alias for int32. A sequence of runes is a sequence of >> int32's. > > It certainly is not. Runes are variable-width. Here, for example, are a > number of Go functions which return a single rune and its width in bytes: > > http://golang.org/pkg/unicode/utf8/ I think the documentation for those functions is simply badly worded. The "width in bytes" it returns is not the width of the rune (which as jmf notes is simply an alias for int32 that stores a single code point). It means the UTF-8 width of the character, i.e. the number of UTF-8 bytes the function "consumed", presumably so that the caller can then reslice the data with that many bytes fewer.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-26 20:13 +0000 |
| Message-ID | <503a8361$0$6574$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #27931 |
On Sun, 26 Aug 2012 09:40:13 -0600, Ian Kelly wrote: > I think the documentation for those functions is simply badly worded. > The "width in bytes" it returns is not the width of the rune (which as > jmf notes is simply an alias for int32 that stores a single code point). Is this documented somewhere? I can't tell you how long I spent unsuccessfully googling for variations on "go language runes", which unsurprisingly mostly came back with pages about Germanic runes and elf runes but not Go runes. I read the golang FAQs, which mentioned Unicode *once* and runes not at all. Obviously Go language programmers don't care much about Unicode. > It means the UTF-8 width of the character, i.e. the number of UTF-8 > bytes the function "consumed", presumably so that the caller can then > reslice the data with that many bytes fewer. That makes sense, given the lousy string implementation and API they're working with. I note that not all 32-bit ints are valid code points. I suppose I can see sense in having rune be a 32-bit integer value limited to those valid code points. (But, dammit, why not call it a code point?) But if rune is merely an alias for int32, why not just call it int32? -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Dan Sommers <dan@tombstonezero.net> |
|---|---|
| Date | 2012-08-26 13:45 -0700 |
| Message-ID | <mailman.3853.1346014938.4697.python-list@python.org> |
| In reply to | #27946 |
On 2012-08-26 at 20:13:21 +0000, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > I note that not all 32-bit ints are valid code points. I suppose I can > see sense in having rune be a 32-bit integer value limited to those > valid code points. (But, dammit, why not call it a code point?) But if > rune is merely an alias for int32, why not just call it int32? Having a "code point" type is a good idea. If nothing else, human code readers can tell that you're doing something with characters rather than something with integers. If your language provides any sort of type safety, then you get that, too. Calling your code points int32 is a bad idea for the same reason that it turned out to be a bad idea to call all my old ASCII characters int8. Or all my pointers int<n> (or unsigned int<n>), for n in 16, 20, 24, 32, 36, 48, or 64 (or I'm sure other values of n that I never had the pain or pleasure of using). Dan
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-27 12:16 -0700 |
| Message-ID | <2e92da71-fbd2-467f-9088-1c79fa7bcf69@googlegroups.com> |
| In reply to | #27947 |
Le dimanche 26 août 2012 22:45:09 UTC+2, Dan Sommers a écrit : > On 2012-08-26 at 20:13:21 +0000, > > Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > > > > > I note that not all 32-bit ints are valid code points. I suppose I can > > > see sense in having rune be a 32-bit integer value limited to those > > > valid code points. (But, dammit, why not call it a code point?) But if > > > rune is merely an alias for int32, why not just call it int32? > > > > Having a "code point" type is a good idea. If nothing else, human code > > readers can tell that you're doing something with characters rather than > > something with integers. If your language provides any sort of type > > safety, then you get that, too. > > > > Calling your code points int32 is a bad idea for the same reason that it > > turned out to be a bad idea to call all my old ASCII characters int8. > > Or all my pointers int<n> (or unsigned int<n>), for n in 16, 20, 24, 32, > > 36, 48, or 64 (or I'm sure other values of n that I never had the pain > > or pleasure of using). > And this is precisely the concept of rune, a real int which is a name for Unicode code point. Go "has" the integers int32 and int64. A rune ensure the usage of int32. "Text libs" use runes. Go has only bytes and runes. If you do not like the word "perfection", this mechanism has at least an ideal simplicity (with probably a lot of positive consequences). rune -> int32 -> utf32 -> unicode code points. - Why int32 and not uint32? No idea, I tried to find an answer without asking. - I find the name "rune" elegant. "char" would have been too confusing. End. This is supposed to be a Python forum. jmf
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-27 14:14 -0600 |
| Message-ID | <mailman.3884.1346098483.4697.python-list@python.org> |
| In reply to | #27994 |
On Mon, Aug 27, 2012 at 1:16 PM, <wxjmfauth@gmail.com> wrote: > - Why int32 and not uint32? No idea, I tried to find an > answer without asking. UCS-4 is technically only a 31-bit encoding. The sign bit is not used, so the choice of int32 vs. uint32 is inconsequential. (In fact, since they made the decision to limit Unicode to the range 0 - 0x0010FFFF, one might even point out that the *entire high-order byte* as well as 3 bits of the next byte are irrelevant. Truly, UTF-32 is not designed for memory efficiency.)
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-27 13:37 -0700 |
| Message-ID | <mailman.3885.1346099824.4697.python-list@python.org> |
| In reply to | #27998 |
Le lundi 27 août 2012 22:14:07 UTC+2, Ian a écrit : > On Mon, Aug 27, 2012 at 1:16 PM, <wxjmfauth@gmail.com> wrote: > > > - Why int32 and not uint32? No idea, I tried to find an > > > answer without asking. > > > > UCS-4 is technically only a 31-bit encoding. The sign bit is not used, > > so the choice of int32 vs. uint32 is inconsequential. > > > > (In fact, since they made the decision to limit Unicode to the range 0 > > - 0x0010FFFF, one might even point out that the *entire high-order > > byte* as well as 3 bits of the next byte are irrelevant. Truly, > > UTF-32 is not designed for memory efficiency.) I know all this. The question is more, why not a uint32 knowing there are only positive code points. It seems to me more "natural".
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-29 04:38 -0700 |
| Message-ID | <e49d21ac-b14e-4e9d-befa-8f0008c87c58@googlegroups.com> |
| In reply to | #27998 |
Le lundi 27 août 2012 22:37:03 UTC+2, (inconnu) a écrit : > Le lundi 27 août 2012 22:14:07 UTC+2, Ian a écrit : > > > On Mon, Aug 27, 2012 at 1:16 PM, <wxjmfauth@gmail.com> wrote: > > > > > > > - Why int32 and not uint32? No idea, I tried to find an > > > > > > > answer without asking. > > > > > > > > > > > > UCS-4 is technically only a 31-bit encoding. The sign bit is not used, > > > > > > so the choice of int32 vs. uint32 is inconsequential. > > > > > > > > > > > > (In fact, since they made the decision to limit Unicode to the range 0 > > > > > > - 0x0010FFFF, one might even point out that the *entire high-order > > > > > > byte* as well as 3 bits of the next byte are irrelevant. Truly, > > > > > > UTF-32 is not designed for memory efficiency.) > > > > I know all this. The question is more, why not a uint32 knowing > > there are only positive code points. It seems to me more "natural". Answer found. In short: using negative ints simplifies internal tasks.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-29 04:38 -0700 |
| Message-ID | <mailman.3926.1346240303.4697.python-list@python.org> |
| In reply to | #27998 |
Le lundi 27 août 2012 22:37:03 UTC+2, (inconnu) a écrit : > Le lundi 27 août 2012 22:14:07 UTC+2, Ian a écrit : > > > On Mon, Aug 27, 2012 at 1:16 PM, <wxjmfauth@gmail.com> wrote: > > > > > > > - Why int32 and not uint32? No idea, I tried to find an > > > > > > > answer without asking. > > > > > > > > > > > > UCS-4 is technically only a 31-bit encoding. The sign bit is not used, > > > > > > so the choice of int32 vs. uint32 is inconsequential. > > > > > > > > > > > > (In fact, since they made the decision to limit Unicode to the range 0 > > > > > > - 0x0010FFFF, one might even point out that the *entire high-order > > > > > > byte* as well as 3 bits of the next byte are irrelevant. Truly, > > > > > > UTF-32 is not designed for memory efficiency.) > > > > I know all this. The question is more, why not a uint32 knowing > > there are only positive code points. It seems to me more "natural". Answer found. In short: using negative ints simplifies internal tasks.
[toc] | [prev] | [next] | [standalone]
| From | Neil Hodgson <nhodgson@iinet.net.au> |
|---|---|
| Date | 2012-08-28 09:54 +1000 |
| Message-ID | <UIOdnTQtcNTRlKHNnZ2dnUVZ_vednZ2d@westnet.com.au> |
| In reply to | #27994 |
wxjmfauth@gmail.com:
> Go "has" the integers int32 and int64. A rune ensure
> the usage of int32. "Text libs" use runes. Go has only
> bytes and runes.
Go's text libraries use UTF-8 encoded byte strings. Not arrays of
runes. See, for example,
http://golang.org/pkg/regexp/
Are you claiming that UTF-8 is the optimum string representation and
therefore should be used by Python?
Neil
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-29 13:59 +1000 |
| Message-ID | <mailman.3919.1346212776.4697.python-list@python.org> |
| In reply to | #28007 |
On Wed, Aug 29, 2012 at 12:42 PM, rusi <rustompmody@gmail.com> wrote: > Clearly there are 3 string-engines in the python 3 world: > - 3.2 narrow > - 3.2 wide > - 3.3 (flexible) > > How difficult would it be to giving the choice of string engine as a > command-line flag? > This would avoid the nuisance of having two binaries -- narrow and > wide. > And it would give the python programmer a choice of efficiency > profiles. To what benefit? 3.2 narrow is, I would have to say, buggy. It handles everything up to \uFFFF without problems, but once you have any character beyond that, your indexing and slicing are wrong. 3.2 wide is fine but memory-inefficient. 3.3 is never worse than 3.2 except for some tiny checks, and will be more memory-efficient in many cases. Supporting narrow would require fixing the handling of surrogates. Potentially a huge job, and you'll end up with ridiculous performance in many cases. So what you're really asking for is a command-line option to force all strings to have their 'kind' set to 11, UCS-4 storage. That would be doable, I suppose; it wouldn't require many changes (just a quick check in string creation functions). But what would be the advantage? Every string requires 4 bytes per character to store; an optimization has been lost. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-28 22:15 -0600 |
| Message-ID | <mailman.3920.1346213765.4697.python-list@python.org> |
| In reply to | #28007 |
On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody@gmail.com> wrote:
> In summary:
> 1. The problem is not on jmf's computer
> 2. It is not windows-only
> 3. It is not directly related to latin-1 encodable or not
>
> The only question which is not yet clear is this:
> Given a typical string operation that is complexity O(n), in more
> detail it is going to be O(a + bn)
> If only a is worse going 3.2 to 3.3, it may be a small issue.
> If b is worse by even a tiny amount, it is likely to be a significant
> regression for some use-cases.
As has been pointed out repeatedly already, this is a microbenchmark.
jmf is focusing in one one particular area (string construction) where
Python 3.3 happens to be slower than Python 3.2, ignoring the fact
that real code usually does lots of things other than building
strings, many of which are slower to begin with. In the real-world
benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.
Here's a much more realistic benchmark that nonetheless still focuses
on strings: word counting.
Source: http://pastebin.com/RDeDsgPd
C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"
"wc.wc('unilang8.htm')"
1000 loops, best of 3: 310 usec per loop
C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"
"wc.wc('unilang8.htm')"
1000 loops, best of 3: 302 usec per loop
"unilang8.htm" is an arbitrary UTF-8 document containing a broad swath
of Unicode characters that I pulled off the web. Even though this
program is still mostly string processing, Python 3.3 wins. Of
course, that's not really a very good test -- since it reads the file
on every pass, it probably spends more time in I/O than it does in
actual processing. Let's try it again with prepared string data:
C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
open('unilang8.htm', 'r', encoding
='utf-8').read()" "wc.wc_str(t)"
10000 loops, best of 3: 87.3 usec per loop
C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
open('unilang8.htm', 'r', encoding
='utf-8').read()" "wc.wc_str(t)"
10000 loops, best of 3: 84.6 usec per loop
Nope, 3.3 still wins. And just for the sake of my own curiosity, I
decided to try it again using str.split() instead of a StringIO.
Since str.split() creates more strings, I expect Python 3.2 might
actually win this time.
C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
open('unilang8.htm', 'r', encoding
='utf-8').read()" "wc.wc_split(t)"
10000 loops, best of 3: 88 usec per loop
C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
open('unilang8.htm', 'r', encoding
='utf-8').read()" "wc.wc_split(t)"
10000 loops, best of 3: 76.5 usec per loop
Interestingly, although Python 3.2 performs the splits in about the
same time as the StringIO operation, Python 3.3 is significantly
*faster* using str.split(), at least on this data set.
> So doing some arm-chair thinking (I dont know the code and difficulty
> involved):
>
> Clearly there are 3 string-engines in the python 3 world:
> - 3.2 narrow
> - 3.2 wide
> - 3.3 (flexible)
>
> How difficult would it be to giving the choice of string engine as a
> command-line flag?
> This would avoid the nuisance of having two binaries -- narrow and
> wide.
Quite difficult. Even if we avoid having two or three separate
binaries, we would still have separate binary representations of the
string structs. It makes the maintainability of the software go down
instead of up.
> And it would give the python programmer a choice of efficiency
> profiles.
So instead of having just one test for my Unicode-handling code, I'll
now have to run that same test *three times* -- once for each possible
string engine option. Choice isn't always a good thing.
Cheers,
Ian
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-29 08:05 +0000 |
| Message-ID | <503dcd35$0$9416$c3e8da3$76491128@news.astraweb.com> |
| In reply to | #28044 |
On Tue, 28 Aug 2012 22:15:31 -0600, Ian Kelly wrote:
> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody@gmail.com> wrote:
>> How difficult would it be to giving the choice of string engine as a
>> command-line flag?
>> This would avoid the nuisance of having two binaries -- narrow and
>> wide.
>
> Quite difficult. Even if we avoid having two or three separate
> binaries, we would still have separate binary representations of the
> string structs. It makes the maintainability of the software go down
> instead of up.
In fairness, there are already multiple binary representations of strings
in Python 3.3:
- ASCII-only strings use a 1-byte format (PyASCIIObject);
- Compact Unicode objects (PyCompactObject), which if I'm reading
correctly, appears to use a non-fixed width UTF-8 format, but are only
used when the string length and maximum character are known ahead of
time;
- Legacy string objects (PyUnicodeObject), which are not compact, and
which may use as their internal format:
* 1-byte characters for Latin1-compatible strings;
* 2-byte UCS-2 characters for strings in the Basic Multilingual Plane;
* 4-byte UCS-4 characters for strings with at least one non-BMP
character.
http://www.python.org/dev/peps/pep-0393/#specification
By my calculations, that makes *five* different internal formats for
strings, at least two of which are capable of representing all Unicode
characters. I don't think it would add that much additional complexity to
have a runtime option --always-wide-strings to always use the UCS-4
format. For, you know, crazy people with more memory than sense.
But I don't think there's any point in exposing further runtime options
to choose the string representation:
- neither the ASCII nor Latin1 representations can store arbitrary
Unicode chars, so they're out;
- the UTF-8 format is only used under restrictive circumstances, and so
is (probably?) unsuitable for all strings.
- the UCS-2 format can, by using surrogate pairs, but that's troublesome
to get right, some might even say buggy.
>> And it would give the python programmer a choice of efficiency
>> profiles.
>
> So instead of having just one test for my Unicode-handling code, I'll
> now have to run that same test *three times* -- once for each possible
> string engine option. Choice isn't always a good thing.
There is that too.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-29 04:40 -0700 |
| Message-ID | <62566024-df1d-4948-a27a-45c7820ddc6c@googlegroups.com> |
| In reply to | #28044 |
Le mercredi 29 août 2012 06:16:05 UTC+2, Ian a écrit :
> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody@gmail.com> wrote:
>
> > In summary:
>
> > 1. The problem is not on jmf's computer
>
> > 2. It is not windows-only
>
> > 3. It is not directly related to latin-1 encodable or not
>
> >
>
> > The only question which is not yet clear is this:
>
> > Given a typical string operation that is complexity O(n), in more
>
> > detail it is going to be O(a + bn)
>
> > If only a is worse going 3.2 to 3.3, it may be a small issue.
>
> > If b is worse by even a tiny amount, it is likely to be a significant
>
> > regression for some use-cases.
>
>
>
> As has been pointed out repeatedly already, this is a microbenchmark.
>
> jmf is focusing in one one particular area (string construction) where
>
> Python 3.3 happens to be slower than Python 3.2, ignoring the fact
>
> that real code usually does lots of things other than building
>
> strings, many of which are slower to begin with. In the real-world
>
> benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.
>
> Here's a much more realistic benchmark that nonetheless still focuses
>
> on strings: word counting.
>
>
>
> Source: http://pastebin.com/RDeDsgPd
>
>
>
>
>
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"
>
> "wc.wc('unilang8.htm')"
>
> 1000 loops, best of 3: 310 usec per loop
>
>
>
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"
>
> "wc.wc('unilang8.htm')"
>
> 1000 loops, best of 3: 302 usec per loop
>
>
>
> "unilang8.htm" is an arbitrary UTF-8 document containing a broad swath
>
> of Unicode characters that I pulled off the web. Even though this
>
> program is still mostly string processing, Python 3.3 wins. Of
>
> course, that's not really a very good test -- since it reads the file
>
> on every pass, it probably spends more time in I/O than it does in
>
> actual processing. Let's try it again with prepared string data:
>
>
>
>
>
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
>
> open('unilang8.htm', 'r', encoding
>
> ='utf-8').read()" "wc.wc_str(t)"
>
> 10000 loops, best of 3: 87.3 usec per loop
>
>
>
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
>
> open('unilang8.htm', 'r', encoding
>
> ='utf-8').read()" "wc.wc_str(t)"
>
> 10000 loops, best of 3: 84.6 usec per loop
>
>
>
> Nope, 3.3 still wins. And just for the sake of my own curiosity, I
>
> decided to try it again using str.split() instead of a StringIO.
>
> Since str.split() creates more strings, I expect Python 3.2 might
>
> actually win this time.
>
>
>
>
>
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
>
> open('unilang8.htm', 'r', encoding
>
> ='utf-8').read()" "wc.wc_split(t)"
>
> 10000 loops, best of 3: 88 usec per loop
>
>
>
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
>
> open('unilang8.htm', 'r', encoding
>
> ='utf-8').read()" "wc.wc_split(t)"
>
> 10000 loops, best of 3: 76.5 usec per loop
>
>
>
> Interestingly, although Python 3.2 performs the splits in about the
>
> same time as the StringIO operation, Python 3.3 is significantly
>
> *faster* using str.split(), at least on this data set.
>
>
>
>
>
> > So doing some arm-chair thinking (I dont know the code and difficulty
>
> > involved):
>
> >
>
> > Clearly there are 3 string-engines in the python 3 world:
>
> > - 3.2 narrow
>
> > - 3.2 wide
>
> > - 3.3 (flexible)
>
> >
>
> > How difficult would it be to giving the choice of string engine as a
>
> > command-line flag?
>
> > This would avoid the nuisance of having two binaries -- narrow and
>
> > wide.
>
>
>
> Quite difficult. Even if we avoid having two or three separate
>
> binaries, we would still have separate binary representations of the
>
> string structs. It makes the maintainability of the software go down
>
> instead of up.
>
>
>
> > And it would give the python programmer a choice of efficiency
>
> > profiles.
>
>
>
> So instead of having just one test for my Unicode-handling code, I'll
>
> now have to run that same test *three times* -- once for each possible
>
> string engine option. Choice isn't always a good thing.
>
>
Forget Python and all these benchmarks. The problem
is on an other level. Coding schemes, typography,
usage of characters, ...
For a given coding scheme, all code points/characters are
equivalent. Expecting to handle a sub-range in a coding
scheme without shaking that coding scheme is impossible.
If a coding scheme does not give satisfaction, the only
valid solution is to create a new coding scheme, cp1252,
mac-roman, EBCDIC, ... or the interesting "TeX" case, where
the "internal" coding depends on the fonts!
Unicode (utf***), as just one another coding scheme, does
not escape to this rule.
This "Flexible String Representation" fails. Not only
it is unable to stick with a coding scheme, it is
a mixing of coding schemes, the worst of all possible
implementations.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-08-29 08:01 -0400 |
| Message-ID | <mailman.3929.1346241717.4697.python-list@python.org> |
| In reply to | #28055 |
On 08/29/2012 07:40 AM, wxjmfauth@gmail.com wrote: > <snip> > Forget Python and all these benchmarks. The problem is on an other > level. Coding schemes, typography, usage of characters, ... For a > given coding scheme, all code points/characters are equivalent. > Expecting to handle a sub-range in a coding scheme without shaking > that coding scheme is impossible. If a coding scheme does not give > satisfaction, the only valid solution is to create a new coding > scheme, cp1252, mac-roman, EBCDIC, ... or the interesting "TeX" case, > where the "internal" coding depends on the fonts! Unicode (utf***), as > just one another coding scheme, does not escape to this rule. This > "Flexible String Representation" fails. Not only it is unable to stick > with a coding scheme, it is a mixing of coding schemes, the worst of > all possible implementations. jmf Nonsense. The discussion was not about an encoding scheme, but an internal representation. That representation does not change the programmer's interface in any way other than performance (cpu and memory usage). Most of the rest of your babble is unsupported opinion. Plonk. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-29 08:43 -0700 |
| Message-ID | <mailman.3938.1346254994.4697.python-list@python.org> |
| In reply to | #28059 |
Le mercredi 29 août 2012 14:01:57 UTC+2, Dave Angel a écrit : > On 08/29/2012 07:40 AM, wxjmfauth@gmail.com wrote: > > > <snip> > > > > > Forget Python and all these benchmarks. The problem is on an other > > > level. Coding schemes, typography, usage of characters, ... For a > > > given coding scheme, all code points/characters are equivalent. > > > Expecting to handle a sub-range in a coding scheme without shaking > > > that coding scheme is impossible. If a coding scheme does not give > > > satisfaction, the only valid solution is to create a new coding > > > scheme, cp1252, mac-roman, EBCDIC, ... or the interesting "TeX" case, > > > where the "internal" coding depends on the fonts! Unicode (utf***), as > > > just one another coding scheme, does not escape to this rule. This > > > "Flexible String Representation" fails. Not only it is unable to stick > > > with a coding scheme, it is a mixing of coding schemes, the worst of > > > all possible implementations. jmf > > > > Nonsense. The discussion was not about an encoding scheme, but an > > internal representation. That representation does not change the > > programmer's interface in any way other than performance (cpu and memory > > usage). Most of the rest of your babble is unsupported opinion. > I can hit the nail a little more. I have even a better idea and I'm serious. If "Python" has found a new way to cover the set of the Unicode characters, why not proposing it to the Unicode consortium? Unicode has already three schemes covering practically all cases: memory consumption, maximum flexibility and an intermediate solution. It would be to bad, to not share it. What do you think? ;-) jmf
[toc] | [prev] | [next] | [standalone]
Page 2 of 5 — ← Prev page 1 [2] 3 4 5 Next page →
Back to top | Article view | comp.lang.python
csiph-web