Groups > comp.lang.python > #27843 > unrolled thread

Re: Flexible string representation, unicode, typography, ...

Started by	Antoine Pitrou <solipsis@pitrou.net>
First post	2012-08-25 00:24 +0000
Last post	2012-08-25 07:23 -0400
Articles	20 on this page of 83 — 18 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-25 00:24 +0000
    Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
      Re: Flexible string representation, unicode, typography, ... Ben Finney <ben+python@benfinney.id.au> - 2012-08-25 17:54 +1000
    Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
      Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 09:58 +0100
      Re: Flexible string representation, unicode, typography, ... Frank Millman <frank@chagford.com> - 2012-08-25 11:46 +0200
        Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
        Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
          Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-25 16:26 -0600
            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
              Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:50 -0600
            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
              Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 11:49 +0000
                Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:40 -0600
                  Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 20:13 +0000
                    Re: Flexible string representation, unicode, typography, ... Dan Sommers <dan@tombstonezero.net> - 2012-08-26 13:45 -0700
                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-27 14:14 -0600
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 13:37 -0700
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
                        Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-28 09:54 +1000
                          Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 13:59 +1000
                          Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-28 22:15 -0600
                            Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-29 08:05 +0000
                            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
                              Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-08-29 08:01 -0400
                                Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 08:43 -0700
                                  Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 06:55 +0000
                                    Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 18:59 +1000
                                    Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-30 07:02 -0400
                                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 16:00 +0000
                                        Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-30 16:44 -0400
                                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 12:32 +0000
                                            Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-31 09:13 -0600
                                        Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-31 08:43 -0400
                                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 14:54 +0000
                                    Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-30 15:01 +0000
                                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
                                        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 09:58 +0100
                                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-02 03:06 -0600
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
                                            Re: Flexible string representation, unicode, typography, ... Michael Torrie <torriem@gmail.com> - 2012-09-02 13:45 -0600
                                            Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-09-02 16:07 -0400
                                            Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 16:38 -0400
                                            Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:42 +0000
                                              Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:26 +0300
                                                Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 00:53 +0000
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
                                        Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-02 11:52 +0200
                                        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 11:36 +0100
                                        Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 15:00 +0300
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-03 07:11 +0100
                                            Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-03 08:15 +0200
                                            Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-03 04:38 -0400
                                            Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:56 +0300
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
                                        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 13:23 +0100
                                          Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-02 08:35 -0400
                                          Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 15:46 +0100
                                          Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
                                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-03 12:33 -0600
                                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
                                    Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-30 10:27 -0600
                                    Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 23:38 +0300
                                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:54 +0000
                                        Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 22:33 -0400
                                        Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-03 11:24 -0400
                                        Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:41 +0300
                                    Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 00:45 +0300
                                Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 01:54 +1000
                              Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 22:34 +1000
                            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
                    Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 15:42 -0600
                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 23:31 +0000
                        Re: Flexible string representation, unicode, typography, ... Paul Rubin <no.email@nospam.invalid> - 2012-08-26 17:47 -0700
      Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:04 +1000
      Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 12:05 +0100
      Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:19 +1000
      Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-25 07:23 -0400

Page 1 of 5 [1] 2 3 4 5 Next page →

#27843 — Re: Flexible string representation, unicode, typography, ...

From	Antoine Pitrou <solipsis@pitrou.net>
Date	2012-08-25 00:24 +0000
Subject	Re: Flexible string representation, unicode, typography, ...
Message-ID	<mailman.3784.1345854291.4697.python-list@python.org>

Ramchandra Apte <maniandram01 <at> gmail.com> writes:
> 
> The zen of python is simply a guideline

What's more, the Zen guides the language's design, not its implementation.
People who think CPython is a complicated implementation can take a look at PyPy 
:-)

Regards

Antoine.

-- 
Software development and contracting: http://pro.pitrou.net

[toc] | [next] | [standalone]

#27853

From	wxjmfauth@gmail.com
Date	2012-08-25 00:27 -0700
Message-ID	<mailman.3788.1345879639.4697.python-list@python.org>
In reply to	#27843

Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit :
> Ramchandra Apte <maniandram01 <at> gmail.com> writes:
> 
> > 
> 
> > The zen of python is simply a guideline
> 
> 
> 
> What's more, the Zen guides the language's design, not its implementation.
> 
> People who think CPython is a complicated implementation can take a look at PyPy 
> 
> :-)

Unicode design: a flat table of code points, where all code
points are "equals".
As soon as one attempts to escape from this rule, one has to
"pay" for it.
The creator of this machinery (flexible string representation)
can not even benefit from it in his native language (I think
I'm correctly informed).

Hint: Google -> "Das grosse Eszett"

jmf

[toc] | [prev] | [next] | [standalone]

#27855

From	Ben Finney <ben+python@benfinney.id.au>
Date	2012-08-25 17:54 +1000
Message-ID	<87sjbbe78w.fsf@benfinney.id.au>
In reply to	#27853

wxjmfauth@gmail.com writes:

> Unicode design: a flat table of code points, where all code
> points are "equals".

Yes, Unicode's design entails a flat table of hundreds of thousands of
code points, expansible in future.

This is in direct conflict with the design of all significant computers
we need to write software for: data stored and transported as 8-bit
bytes, which can only ever hold 256 different values, no expansion.

> As soon as one attempts to escape from this rule, one has to
> "pay" for it.

Yes, in either direction; the conflict means that trade-offs need to be
made.

See this presentation by Ned Batchelder, “Pragmatic Unicode”
<URL:http://nedbatchelder.com/text/unipain.html>, which lays out the
fundamental conflict of representing human text in computer data; and
several practical approaches to deal with it.

-- 
 \      “I busted a mirror and got seven years bad luck, but my lawyer |
  `\                        thinks he can get me five.” —Steven Wright |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#27854

From	wxjmfauth@gmail.com
Date	2012-08-25 00:27 -0700
Message-ID	<1cb3f062-eb45-4b0c-977b-76afb099923c@googlegroups.com>
In reply to	#27843

Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit :
> Ramchandra Apte <maniandram01 <at> gmail.com> writes:
> 
> > 
> 
> > The zen of python is simply a guideline
> 
> 
> 
> What's more, the Zen guides the language's design, not its implementation.
> 
> People who think CPython is a complicated implementation can take a look at PyPy 
> 
> :-)

Unicode design: a flat table of code points, where all code
points are "equals".
As soon as one attempts to escape from this rule, one has to
"pay" for it.
The creator of this machinery (flexible string representation)
can not even benefit from it in his native language (I think
I'm correctly informed).

Hint: Google -> "Das grosse Eszett"

jmf

[toc] | [prev] | [next] | [standalone]

#27858

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-08-25 09:58 +0100
Message-ID	<mailman.3791.1345885204.4697.python-list@python.org>
In reply to	#27854

On 25/08/2012 08:27, wxjmfauth@gmail.com wrote:
> Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit :
>> Ramchandra Apte <maniandram01 <at> gmail.com> writes:
>>
>>>
>>
>>> The zen of python is simply a guideline
>>
>>
>>
>> What's more, the Zen guides the language's design, not its implementation.
>>
>> People who think CPython is a complicated implementation can take a look at PyPy
>>
>> :-)
>
> Unicode design: a flat table of code points, where all code
> points are "equals".
> As soon as one attempts to escape from this rule, one has to
> "pay" for it.
> The creator of this machinery (flexible string representation)
> can not even benefit from it in his native language (I think
> I'm correctly informed).
>
> Hint: Google -> "Das grosse Eszett"
>
> jmf
>

It's Saturday morning, I'm stone cold sober, had a good sleep and I'm 
still baffled as to the point if any.  Could someone please enlightem me?

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#27860

From	Frank Millman <frank@chagford.com>
Date	2012-08-25 11:46 +0200
Message-ID	<mailman.3793.1345888006.4697.python-list@python.org>
In reply to	#27854

On 25/08/2012 10:58, Mark Lawrence wrote:
> On 25/08/2012 08:27, wxjmfauth@gmail.com wrote:
>>
>> Unicode design: a flat table of code points, where all code
>> points are "equals".
>> As soon as one attempts to escape from this rule, one has to
>> "pay" for it.
>> The creator of this machinery (flexible string representation)
>> can not even benefit from it in his native language (I think
>> I'm correctly informed).
>>
>> Hint: Google -> "Das grosse Eszett"
>>
>> jmf
>>
>
> It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
> still baffled as to the point if any.  Could someone please enlightem me?
>

Here's what I think he is saying. I am posting this to test the water. I 
am also confused, and if I have got it wrong hopefully someone will 
correct me.

In python 3.3, unicode strings are now stored as follows -
   if all characters can be represented by 1 byte, the entire string is 
composed of 1-byte characters
   else if all characters can be represented by 1 or 2 bytea, the entire 
string is composed of 2-byte characters
   else the entire string is composed of 4-byte characters

There is an overhead in making this choice, to detect the lowest number 
of bytes required.

jmfauth believes that this only benefits 'english-speaking' users, as 
the rest of the world will tend to have strings where at least one 
character requires 2 or 4 bytes. So they incur the overhead, without 
getting any benefit.

Therefore, I think he is saying that he would have preferred that python 
standardise on 4-byte characters, on the grounds that the saving in 
memory does not justify the performance overhead.

Frank Millman

[toc] | [prev] | [next] | [standalone]

#27876

From	wxjmfauth@gmail.com
Date	2012-08-25 08:47 -0700
Message-ID	<mailman.3805.1345909675.4697.python-list@python.org>
In reply to	#27860

Le samedi 25 août 2012 11:46:34 UTC+2, Frank Millman a écrit :
> On 25/08/2012 10:58, Mark Lawrence wrote:
> 
> > On 25/08/2012 08:27, wxjmfauth@gmail.com wrote:
> 
> >>
> 
> >> Unicode design: a flat table of code points, where all code
> 
> >> points are "equals".
> 
> >> As soon as one attempts to escape from this rule, one has to
> 
> >> "pay" for it.
> 
> >> The creator of this machinery (flexible string representation)
> 
> >> can not even benefit from it in his native language (I think
> 
> >> I'm correctly informed).
> 
> >>
> 
> >> Hint: Google -> "Das grosse Eszett"
> 
> >>
> 
> >> jmf
> 
> >>
> 
> >
> 
> > It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
> 
> > still baffled as to the point if any.  Could someone please enlightem me?
> 
> >
> 
> 
> 
> Here's what I think he is saying. I am posting this to test the water. I 
> 
> am also confused, and if I have got it wrong hopefully someone will 
> 
> correct me.
> 
> 
> 
> In python 3.3, unicode strings are now stored as follows -
> 
>    if all characters can be represented by 1 byte, the entire string is 
> 
> composed of 1-byte characters
> 
>    else if all characters can be represented by 1 or 2 bytea, the entire 
> 
> string is composed of 2-byte characters
> 
>    else the entire string is composed of 4-byte characters
> 
> 
> 
> There is an overhead in making this choice, to detect the lowest number 
> 
> of bytes required.
> 
> 
> 
> jmfauth believes that this only benefits 'english-speaking' users, as 
> 
> the rest of the world will tend to have strings where at least one 
> 
> character requires 2 or 4 bytes. So they incur the overhead, without 
> 
> getting any benefit.
> 
> 
> 
> Therefore, I think he is saying that he would have preferred that python 
> 
> standardise on 4-byte characters, on the grounds that the saving in 
> 
> memory does not justify the performance overhead.
> 
> 
> 
> Frank Millman

Very well explained. Thanks.

More precisely, affected are not only the 'english-speaking'
users, but all the users who are using not latin-1 characters.
(See the title of this topic, ... typography).

Being at the same time, latin-1 and unicode compliant is
a plain absurdity in the mathematical sense.

---

For those you do not know, the go language has introduced
the rune type. As far as I know, nobody is complaining, I
have not even seen a discussion related to this subject.


100% Unicode compliant from the day 0. Congratulations.

jmf

[toc] | [prev] | [next] | [standalone]

#27878

From	wxjmfauth@gmail.com
Date	2012-08-25 08:47 -0700
Message-ID	<f6266544-d67c-4589-a3ed-c14428ead237@googlegroups.com>
In reply to	#27860

Le samedi 25 août 2012 11:46:34 UTC+2, Frank Millman a écrit :
> On 25/08/2012 10:58, Mark Lawrence wrote:
> 
> > On 25/08/2012 08:27, wxjmfauth@gmail.com wrote:
> 
> >>
> 
> >> Unicode design: a flat table of code points, where all code
> 
> >> points are "equals".
> 
> >> As soon as one attempts to escape from this rule, one has to
> 
> >> "pay" for it.
> 
> >> The creator of this machinery (flexible string representation)
> 
> >> can not even benefit from it in his native language (I think
> 
> >> I'm correctly informed).
> 
> >>
> 
> >> Hint: Google -> "Das grosse Eszett"
> 
> >>
> 
> >> jmf
> 
> >>
> 
> >
> 
> > It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
> 
> > still baffled as to the point if any.  Could someone please enlightem me?
> 
> >
> 
> 
> 
> Here's what I think he is saying. I am posting this to test the water. I 
> 
> am also confused, and if I have got it wrong hopefully someone will 
> 
> correct me.
> 
> 
> 
> In python 3.3, unicode strings are now stored as follows -
> 
>    if all characters can be represented by 1 byte, the entire string is 
> 
> composed of 1-byte characters
> 
>    else if all characters can be represented by 1 or 2 bytea, the entire 
> 
> string is composed of 2-byte characters
> 
>    else the entire string is composed of 4-byte characters
> 
> 
> 
> There is an overhead in making this choice, to detect the lowest number 
> 
> of bytes required.
> 
> 
> 
> jmfauth believes that this only benefits 'english-speaking' users, as 
> 
> the rest of the world will tend to have strings where at least one 
> 
> character requires 2 or 4 bytes. So they incur the overhead, without 
> 
> getting any benefit.
> 
> 
> 
> Therefore, I think he is saying that he would have preferred that python 
> 
> standardise on 4-byte characters, on the grounds that the saving in 
> 
> memory does not justify the performance overhead.
> 
> 
> 
> Frank Millman

Very well explained. Thanks.

More precisely, affected are not only the 'english-speaking'
users, but all the users who are using not latin-1 characters.
(See the title of this topic, ... typography).

Being at the same time, latin-1 and unicode compliant is
a plain absurdity in the mathematical sense.

---

For those you do not know, the go language has introduced
the rune type. As far as I know, nobody is complaining, I
have not even seen a discussion related to this subject.


100% Unicode compliant from the day 0. Congratulations.

jmf

[toc] | [prev] | [next] | [standalone]

#27888

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-25 16:26 -0600
Message-ID	<mailman.3816.1345933655.4697.python-list@python.org>
In reply to	#27878

On Sat, Aug 25, 2012 at 9:47 AM,  <wxjmfauth@gmail.com> wrote:
> For those you do not know, the go language has introduced
> the rune type. As far as I know, nobody is complaining, I
> have not even seen a discussion related to this subject.

Python has that also.  We call it "int".

More seriously, strings in Go are not sequences of runes.  They're
actually arrays of UTF-8 bytes.  That means that they're quite
efficient for ASCII strings, at the expense of other characters, like
Chinese (wait, this sounds familiar for some reason).  It also means
that you have to bend over backwards if you want to work with actual
runes instead of bytes.  Want to know how many characters are in your
string?  Don't call len() on it -- that will only tell you how many
bytes are in it.  Don't try to index or slice it either -- that will
(accidentally) work for ASCII strings, but for other strings your
indexes will be wrong.  If you're unlucky you might even split up the
string in the middle of a character, and now your string has invalid
characters in it.  The right way to do it looks something like this:

len([]rune("白鵬翔"))  // get the length of the string in characters
string([]rune("白鵬翔")[0:2])  // get the substring containing the first
two characters

It reminds me of working in Python 2.X, except that instead of an
actual unicode type you just have arrays of ints.

[toc] | [prev] | [next] | [standalone]

#27906

From	wxjmfauth@gmail.com
Date	2012-08-25 23:59 -0700
Message-ID	<4853fddf-5e4d-4c11-9a19-5a1dbe4cbc20@googlegroups.com>
In reply to	#27888

Le dimanche 26 août 2012 00:26:56 UTC+2, Ian a écrit :
> On Sat, Aug 25, 2012 at 9:47 AM,  <wxjmfauth@gmail.com> wrote:
> 
> > For those you do not know, the go language has introduced
> 
> > the rune type. As far as I know, nobody is complaining, I
> 
> > have not even seen a discussion related to this subject.
> 
> 
> 
> Python has that also.  We call it "int".
> 
> 
> 
> More seriously, strings in Go are not sequences of runes.  They're
> 
> actually arrays of UTF-8 bytes.  That means that they're quite
> 
> efficient for ASCII strings, at the expense of other characters, like
> 
> Chinese (wait, this sounds familiar for some reason).  It also means
> 
> that you have to bend over backwards if you want to work with actual
> 
> runes instead of bytes.  Want to know how many characters are in your
> 
> string?  Don't call len() on it -- that will only tell you how many
> 
> bytes are in it.  Don't try to index or slice it either -- that will
> 
> (accidentally) work for ASCII strings, but for other strings your
> 
> indexes will be wrong.  If you're unlucky you might even split up the
> 
> string in the middle of a character, and now your string has invalid
> 
> characters in it.  The right way to do it looks something like this:
> 
> 
> 
> len([]rune("白鵬翔"))  // get the length of the string in characters
> 
> string([]rune("白鵬翔")[0:2])  // get the substring containing the first
> 
> two characters
> 
> 
> 
> It reminds me of working in Python 2.X, except that instead of an
> 
> actual unicode type you just have arrays of ints.


Sorry, you do not get it.

The rune is an alias for int32. A sequence of runes is a
sequence of int32's. Go do not spend its time in using a
machinery to work with, to differentiate, to keep in memory
this sequence according to the *characers* composing this
"array of code points".

The message is even stronger. Use runes to work comfortably [*]
with unicode:
rune -> int32 -> utf32 -> unicode (the perfect scheme, cann't be
better)

[*] Beyond my skill and my kwowloge and if I understood correctly,
this rune is even technically optimized to ensure it it always
an int32.

len() or slices() have nothing to do here.

My experience with go is equal to uero + epsilon.

jmf

[toc] | [prev] | [next] | [standalone]

#27932

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-26 09:50 -0600
Message-ID	<mailman.3842.1345996272.4697.python-list@python.org>
In reply to	#27906

On Sun, Aug 26, 2012 at 12:59 AM,  <wxjmfauth@gmail.com> wrote:
> Sorry, you do not get it.
>
> The rune is an alias for int32. A sequence of runes is a
> sequence of int32's. Go do not spend its time in using a
> machinery to work with, to differentiate, to keep in memory
> this sequence according to the *characers* composing this
> "array of code points".
>
> The message is even stronger. Use runes to work comfortably [*]
> with unicode:
> rune -> int32 -> utf32 -> unicode (the perfect scheme, cann't be
> better)

I understand what rune is.  I think you've missed my complaint, which
is that although rune is the basic building block of Unicode strings
-- representing a single Unicode character -- strings in Go are not
built from runes but from bytes.  If you want to do any actual work
with Unicode strings, then you have to first convert them to runes or
arrays of runes.  The conceptual cost of this is that the object
you're working with is no longer a string.

You call this the "perfect scheme" for working with Unicode.  Why does
the "perfect scheme" for Unicode make it *easier* to write buggy code
that only works for ASCII than to write correct code that works for
all characters?  This is IMO where Python 3 gets it right.  When you
want to work with Unicode strings, you just work with Unicode strings
-- none of this nonsense of first explicitly converting the string to
an array of ints that looks nothing like a string at a high level.
The only place Python 3 makes you worry about converting strings is at
the boundaries of your program, where decoding from bytes to strings
and back is necessary.

[toc] | [prev] | [next] | [standalone]

#27907

From	wxjmfauth@gmail.com
Date	2012-08-25 23:59 -0700
Message-ID	<mailman.3831.1345964382.4697.python-list@python.org>
In reply to	#27888

Le dimanche 26 août 2012 00:26:56 UTC+2, Ian a écrit :
> On Sat, Aug 25, 2012 at 9:47 AM,  <wxjmfauth@gmail.com> wrote:
> 
> > For those you do not know, the go language has introduced
> 
> > the rune type. As far as I know, nobody is complaining, I
> 
> > have not even seen a discussion related to this subject.
> 
> 
> 
> Python has that also.  We call it "int".
> 
> 
> 
> More seriously, strings in Go are not sequences of runes.  They're
> 
> actually arrays of UTF-8 bytes.  That means that they're quite
> 
> efficient for ASCII strings, at the expense of other characters, like
> 
> Chinese (wait, this sounds familiar for some reason).  It also means
> 
> that you have to bend over backwards if you want to work with actual
> 
> runes instead of bytes.  Want to know how many characters are in your
> 
> string?  Don't call len() on it -- that will only tell you how many
> 
> bytes are in it.  Don't try to index or slice it either -- that will
> 
> (accidentally) work for ASCII strings, but for other strings your
> 
> indexes will be wrong.  If you're unlucky you might even split up the
> 
> string in the middle of a character, and now your string has invalid
> 
> characters in it.  The right way to do it looks something like this:
> 
> 
> 
> len([]rune("白鵬翔"))  // get the length of the string in characters
> 
> string([]rune("白鵬翔")[0:2])  // get the substring containing the first
> 
> two characters
> 
> 
> 
> It reminds me of working in Python 2.X, except that instead of an
> 
> actual unicode type you just have arrays of ints.


Sorry, you do not get it.

The rune is an alias for int32. A sequence of runes is a
sequence of int32's. Go do not spend its time in using a
machinery to work with, to differentiate, to keep in memory
this sequence according to the *characers* composing this
"array of code points".

The message is even stronger. Use runes to work comfortably [*]
with unicode:
rune -> int32 -> utf32 -> unicode (the perfect scheme, cann't be
better)

[*] Beyond my skill and my kwowloge and if I understood correctly,
this rune is even technically optimized to ensure it it always
an int32.

len() or slices() have nothing to do here.

My experience with go is equal to uero + epsilon.

jmf

[toc] | [prev] | [next] | [standalone]

#27913

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-26 11:49 +0000
Message-ID	<503a0d51$0$6574$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27907

On Sat, 25 Aug 2012 23:59:34 -0700, wxjmfauth wrote:

> Le dimanche 26 août 2012 00:26:56 UTC+2, Ian a écrit :

>> More seriously, strings in Go are not sequences of runes.  They're
>> actually arrays of UTF-8 bytes.

Actually, it's worse that that. Strings in Go aren't even proper UTF-8. 
They are arbitrary bytes, which means you can create strings which are 
invalid Unicode.

Go looks like an interesting language, but it seems to me that they have 
totally screwed up strings. At least Python had the excuse that it is 20 
years old and carrying the old ASCII baggage. Nobody used Unicode in 1992 
when Python was invented. What is Google's excuse for getting Unicode 
wrong?

In Go, strings are UTF-8 encoded sequences of bytes, except when they're 
not, in which case they're arbitrary bytes. You can't tell if a string is 
valid UTF-8 unless you carefully inspect every single character and 
decide for yourself if it is valid. Don't know the rules for valid UTF-8? 
Too bad.

This also means that basic string operations like slicing are both *slow* 
and *wrong* -- they are slow, because you have to track character 
boundaries yourself. And they are wrong, because most people won't 
bother, they'll just assume each character is one byte.

See here for more information:

http://comments.gmane.org/gmane.comp.lang.go.general/56245

Some useful quotes:

-  "Strings are *not* required to be UTF-8."

- "If the string must always be valid UTF-8 then relatively expensive
   validation is required for many operations. Plus making those
   operations able to fail complicates the interface."

- "In almost all cases strings are just byte arrays."

- "Go simply doesn't have 8-bit Unicode strings"

- "Python3 can afford the luxury of storing strings in UCS-2/UCS-4, 
  Go can't."

I don't question that Go needs a type for arbitrary bytes. But that 
should be "bytes", not "string", and it should be there for the advanced 
programmers who *need* to worry about bytes. Programmers who want to 
include strings in their applications (i.e. all of them) shouldn't need 
to care that "$" is one byte, "¢" is two, "€" is three, and "𤭢" 
(U+24B62) is four. With Python 3.3, it *just works*. With Go, it doesn't.

In my not-so-humble opinion, Go has made a silly design error. Go 
programmers will be paying for this mistake for at least a decade. What 
they should have done is create two data types:

1) Strings which are guaranteed to be valid Unicode. That could be UTF-32 
or a PEP 393 approach, depending on how much memory you want to use, or 
even UTF-16 if you don't mind the complication of surrogate pairs.

2) Bytes which are not guaranteed to be valid Unicode but let the 
programmer work with arbitrary bytes.

(If this sounds familiar, it should -- it is exactly what Python 3 does. 
We have a string type that guarantees to be valid Unicode, and a bytes 
type that doesn't.)

As given, *every single programmer* who wants to use Unicode in Go is now 
responsible for doing all the hard work of validating UTF-8, converting 
from bytes to strings, etc. Sure, eventually Go will have libraries to do 
that, but not yet, and even when it does, many people will not use them 
and their code will fail to handle Unicode correctly.

Right now, every Go programmer who wants Unicode has to pay the cost of 
the freedom to have arbitrary byte sequences, whether they need those 
arbitrary bytes or not. The consequence is that instead of Go making 
Unicode as trivial and easy to use as it should be, it will be hard to 
get right, annoying, slow and painful. Another generation of programmers 
will grow up thinking that Unicode is all too difficult and we should 
stick to just plain ASCII.

Since Go doesn't have Unicode strings, you can never trust that a string 
is valid UTF-8, you can't slice it efficiently, you can't get the length 
in characters, you can't write it to a file and have other applications 
to be able to read it. Sure, sometimes it will work, and then somebody 
will input a Euro sign into your application, and it will blow up.

Why am I not surprised that JMF misunderstands both Go byte-strings and 
Python Unicode strings?

> Sorry, you do not get it.
> 
> The rune is an alias for int32. A sequence of runes is a sequence of
> int32's.

It certainly is not. Runes are variable-width. Here, for example, are a 
number of Go functions which return a single rune and its width in bytes:

http://golang.org/pkg/unicode/utf8/

> Go do not spend its time in using a machinery to work with, to
> differentiate, to keep in memory this sequence according to the
> *characers* composing this "array of code points".
> 
> The message is even stronger. Use runes to work comfortably [*] with
> unicode:
> rune -> int32 -> utf32 -> unicode (the perfect scheme, cann't be better)

Runes are not int32, and int32 is not UTF-32.

Whether UTF-32 is the "perfect scheme" for Unicode is a matter of opinion.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27931

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-26 09:40 -0600
Message-ID	<mailman.3841.1345995646.4697.python-list@python.org>
In reply to	#27913

On Sun, Aug 26, 2012 at 5:49 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>> Sorry, you do not get it.
>>
>> The rune is an alias for int32. A sequence of runes is a sequence of
>> int32's.
>
> It certainly is not. Runes are variable-width. Here, for example, are a
> number of Go functions which return a single rune and its width in bytes:
>
> http://golang.org/pkg/unicode/utf8/

I think the documentation for those functions is simply badly worded.
The "width in bytes" it returns is not the width of the rune (which as
jmf notes is simply an alias for int32 that stores a single code
point).  It means the UTF-8 width of the character, i.e. the number of
UTF-8 bytes the function "consumed", presumably so that the caller can
then reslice the data with that many bytes fewer.

[toc] | [prev] | [next] | [standalone]

#27946

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-26 20:13 +0000
Message-ID	<503a8361$0$6574$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27931

On Sun, 26 Aug 2012 09:40:13 -0600, Ian Kelly wrote:

> I think the documentation for those functions is simply badly worded.
> The "width in bytes" it returns is not the width of the rune (which as
> jmf notes is simply an alias for int32 that stores a single code point).

Is this documented somewhere?

I can't tell you how long I spent unsuccessfully googling for variations 
on "go language runes", which unsurprisingly mostly came back with pages 
about Germanic runes and elf runes but not Go runes. I read the golang 
FAQs, which mentioned Unicode *once* and runes not at all. Obviously Go 
language programmers don't care much about Unicode.

>  It means the UTF-8 width of the character, i.e. the number of UTF-8
> bytes the function "consumed", presumably so that the caller can then
> reslice the data with that many bytes fewer.

That makes sense, given the lousy string implementation and API they're 
working with.

I note that not all 32-bit ints are valid code points. I suppose I can 
see sense in having rune be a 32-bit integer value limited to those valid 
code points. (But, dammit, why not call it a code point?) But if rune is 
merely an alias for int32, why not just call it int32?

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27947

From	Dan Sommers <dan@tombstonezero.net>
Date	2012-08-26 13:45 -0700
Message-ID	<mailman.3853.1346014938.4697.python-list@python.org>
In reply to	#27946

On 2012-08-26 at 20:13:21 +0000,
Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> I note that not all 32-bit ints are valid code points. I suppose I can
> see sense in having rune be a 32-bit integer value limited to those
> valid code points. (But, dammit, why not call it a code point?) But if
> rune is merely an alias for int32, why not just call it int32?

Having a "code point" type is a good idea.  If nothing else, human code
readers can tell that you're doing something with characters rather than
something with integers.  If your language provides any sort of type
safety, then you get that, too.

Calling your code points int32 is a bad idea for the same reason that it
turned out to be a bad idea to call all my old ASCII characters int8.
Or all my pointers int<n> (or unsigned int<n>), for n in 16, 20, 24, 32,
36, 48, or 64 (or I'm sure other values of n that I never had the pain
or pleasure of using).

Dan

[toc] | [prev] | [next] | [standalone]

#27994

From	wxjmfauth@gmail.com
Date	2012-08-27 12:16 -0700
Message-ID	<2e92da71-fbd2-467f-9088-1c79fa7bcf69@googlegroups.com>
In reply to	#27947

Le dimanche 26 août 2012 22:45:09 UTC+2, Dan Sommers a écrit :
> On 2012-08-26 at 20:13:21 +0000,
> 
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> 
> 
> 
> > I note that not all 32-bit ints are valid code points. I suppose I can
> 
> > see sense in having rune be a 32-bit integer value limited to those
> 
> > valid code points. (But, dammit, why not call it a code point?) But if
> 
> > rune is merely an alias for int32, why not just call it int32?
> 
> 
> 
> Having a "code point" type is a good idea.  If nothing else, human code
> 
> readers can tell that you're doing something with characters rather than
> 
> something with integers.  If your language provides any sort of type
> 
> safety, then you get that, too.
> 
> 
> 
> Calling your code points int32 is a bad idea for the same reason that it
> 
> turned out to be a bad idea to call all my old ASCII characters int8.
> 
> Or all my pointers int<n> (or unsigned int<n>), for n in 16, 20, 24, 32,
> 
> 36, 48, or 64 (or I'm sure other values of n that I never had the pain
> 
> or pleasure of using).
> 

And this is precisely the concept of rune, a real int which
is a name for Unicode code point.

Go "has" the integers int32 and int64. A rune ensure
the usage of int32. "Text libs" use runes. Go has only
bytes and runes.

If you do not like the word "perfection", this mechanism
has at least an ideal simplicity (with probably a lot
of positive consequences).

rune -> int32 -> utf32 -> unicode code points.

- Why int32 and not uint32? No idea, I tried to find an
answer without asking.
- I find the name "rune" elegant. "char" would have been
too confusing.

End. This is supposed to be a Python forum.
jmf

[toc] | [prev] | [next] | [standalone]

#27998

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-27 14:14 -0600
Message-ID	<mailman.3884.1346098483.4697.python-list@python.org>
In reply to	#27994

On Mon, Aug 27, 2012 at 1:16 PM,  <wxjmfauth@gmail.com> wrote:
> - Why int32 and not uint32? No idea, I tried to find an
> answer without asking.

UCS-4 is technically only a 31-bit encoding. The sign bit is not used,
so the choice of int32 vs. uint32 is inconsequential.

(In fact, since they made the decision to limit Unicode to the range 0
- 0x0010FFFF, one might even point out that the *entire high-order
byte* as well as 3 bits of the next byte are irrelevant.  Truly,
UTF-32 is not designed for memory efficiency.)

[toc] | [prev] | [next] | [standalone]

#27999

From	wxjmfauth@gmail.com
Date	2012-08-27 13:37 -0700
Message-ID	<mailman.3885.1346099824.4697.python-list@python.org>
In reply to	#27998

Le lundi 27 août 2012 22:14:07 UTC+2, Ian a écrit :
> On Mon, Aug 27, 2012 at 1:16 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > - Why int32 and not uint32? No idea, I tried to find an
> 
> > answer without asking.
> 
> 
> 
> UCS-4 is technically only a 31-bit encoding. The sign bit is not used,
> 
> so the choice of int32 vs. uint32 is inconsequential.
> 
> 
> 
> (In fact, since they made the decision to limit Unicode to the range 0
> 
> - 0x0010FFFF, one might even point out that the *entire high-order
> 
> byte* as well as 3 bits of the next byte are irrelevant.  Truly,
> 
> UTF-32 is not designed for memory efficiency.)

I know all this. The question is more, why not a uint32 knowing
there are only positive code points. It seems to me more "natural".

[toc] | [prev] | [next] | [standalone]

#28053

From	wxjmfauth@gmail.com
Date	2012-08-29 04:38 -0700
Message-ID	<e49d21ac-b14e-4e9d-befa-8f0008c87c58@googlegroups.com>
In reply to	#27998

Le lundi 27 août 2012 22:37:03 UTC+2, (inconnu) a écrit :
> Le lundi 27 août 2012 22:14:07 UTC+2, Ian a écrit :
> 
> > On Mon, Aug 27, 2012 at 1:16 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > 
> 
> > > - Why int32 and not uint32? No idea, I tried to find an
> 
> > 
> 
> > > answer without asking.
> 
> > 
> 
> > 
> 
> > 
> 
> > UCS-4 is technically only a 31-bit encoding. The sign bit is not used,
> 
> > 
> 
> > so the choice of int32 vs. uint32 is inconsequential.
> 
> > 
> 
> > 
> 
> > 
> 
> > (In fact, since they made the decision to limit Unicode to the range 0
> 
> > 
> 
> > - 0x0010FFFF, one might even point out that the *entire high-order
> 
> > 
> 
> > byte* as well as 3 bits of the next byte are irrelevant.  Truly,
> 
> > 
> 
> > UTF-32 is not designed for memory efficiency.)
> 
> 
> 
> I know all this. The question is more, why not a uint32 knowing
> 
> there are only positive code points. It seems to me more "natural".

Answer found. In short: using negative ints
simplifies internal tasks.

[toc] | [prev] | [next] | [standalone]

Page 1 of 5 [1] 2 3 4 5 Next page →

csiph-web

Re: Flexible string representation, unicode, typography, ...

Contents

#27843 — Re: Flexible string representation, unicode, typography, ...

#27853

#27855

#27854

#27858

#27860

#27876

#27878

#27888

#27906

#27932

#27907

#27913

#27931

#27946

#27947

#27994

#27998

#27999

#28053