Groups > comp.lang.python > #27730 > unrolled thread

Flexible string representation, unicode, typography, ...

Started by	wxjmfauth@gmail.com
First post	2012-08-23 05:47 -0700
Last post	2012-08-25 07:23 -0400
Articles	15 on this page of 95 — 21 participants

Back to article view | Back to comp.lang.python

  Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-23 05:47 -0700
    Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-23 23:57 +1000
      Re: Flexible string representation, unicode, typography, ... MRAB <python@mrabarnett.plus.com> - 2012-08-23 16:11 +0100
      Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-23 09:19 -0600
      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-23 11:33 -0700
        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-23 13:22 -0600
          Re: Flexible string representation, unicode, typography, ... rusi <rustompmody@gmail.com> - 2012-08-24 09:06 -0700
            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-24 17:47 +0100
            Re: Flexible string representation, unicode, typography, ... Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-24 14:34 -0400
        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-23 20:34 +0100
    Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-23 15:18 +0100
    Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-08-24 07:38 -0700
      Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-25 00:24 +0000
        Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
          Re: Flexible string representation, unicode, typography, ... Ben Finney <ben+python@benfinney.id.au> - 2012-08-25 17:54 +1000
        Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
          Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 09:58 +0100
          Re: Flexible string representation, unicode, typography, ... Frank Millman <frank@chagford.com> - 2012-08-25 11:46 +0200
            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
              Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-25 16:26 -0600
                Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
                  Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:50 -0600
                Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
                  Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 11:49 +0000
                    Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:40 -0600
                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 20:13 +0000
                        Re: Flexible string representation, unicode, typography, ... Dan Sommers <dan@tombstonezero.net> - 2012-08-26 13:45 -0700
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
                            Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-27 14:14 -0600
                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 13:37 -0700
                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
                            Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-28 09:54 +1000
                              Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 13:59 +1000
                              Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-28 22:15 -0600
                                Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-29 08:05 +0000
                                Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
                                  Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-08-29 08:01 -0400
                                    Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 08:43 -0700
                                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 06:55 +0000
                                        Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 18:59 +1000
                                        Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-30 07:02 -0400
                                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 16:00 +0000
                                            Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-30 16:44 -0400
                                              Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 12:32 +0000
                                                Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-31 09:13 -0600
                                            Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-31 08:43 -0400
                                              Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 14:54 +0000
                                        Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-30 15:01 +0000
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 09:58 +0100
                                            Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-02 03:06 -0600
                                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
                                                Re: Flexible string representation, unicode, typography, ... Michael Torrie <torriem@gmail.com> - 2012-09-02 13:45 -0600
                                                Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-09-02 16:07 -0400
                                                Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 16:38 -0400
                                                Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:42 +0000
                                                  Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:26 +0300
                                                    Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 00:53 +0000
                                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
                                            Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-02 11:52 +0200
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 11:36 +0100
                                            Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 15:00 +0300
                                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
                                                Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-03 07:11 +0100
                                                Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-03 08:15 +0200
                                                Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-03 04:38 -0400
                                                Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:56 +0300
                                              Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 13:23 +0100
                                              Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-02 08:35 -0400
                                              Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
                                                Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 15:46 +0100
                                              Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
                                            Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-03 12:33 -0600
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
                                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-30 10:27 -0600
                                        Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 23:38 +0300
                                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:54 +0000
                                            Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 22:33 -0400
                                            Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-03 11:24 -0400
                                            Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:41 +0300
                                        Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 00:45 +0300
                                    Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 01:54 +1000
                                  Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 22:34 +1000
                                Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 15:42 -0600
                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 23:31 +0000
                            Re: Flexible string representation, unicode, typography, ... Paul Rubin <no.email@nospam.invalid> - 2012-08-26 17:47 -0700
          Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:04 +1000
          Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 12:05 +0100
          Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:19 +1000
          Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-25 07:23 -0400

Page 5 of 5 — ← Prev page 1 2 3 4 [5]

#28334

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-09-02 22:33 -0400
Message-ID	<mailman.124.1346639651.27098.python-list@python.org>
In reply to	#28333

On 9/2/2012 9:54 PM, Steven D'Aprano wrote:
> On Sun, 02 Sep 2012 23:38:49 +0300, Serhiy Storchaka wrote:
>
>> On 30.08.12 09:55, Steven D'Aprano wrote:
>>> And Python's solution uses those: UCS-2, UCS-4, and UTF-8.
>>
>> I see that this misconception widely spread.
>
> I am not familiar enough with the C implementation to tell what Python
> 3.3 actually does, and the PEP assumes a fair amount of familiarity with
> the CPython source. So I welcome corrections.
>
>
>> In fact Python 3.3 uses four kinds of ready strings.
>>
>> * ASCII. All codes <= U+007F.
>> * UCS1. All codes <= U+00FF, at least one code > U+007F.
>> * UCS2. All codes <= U+FFFF, at least one code > U+00FF.
>> * UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF.
>
> Where UCS1 is equivalent to Latin-1, correct?
>
> UCS2 is what Python 3.2 narrow builds uses for all strings, including
> codes > U+FFFF using surrogate pairs.
>
> UCS4 is what Python 3.2 wide builds uses for all strings.
>
> This means that Python 3.3 will no longer have surrogate pairs.

Basically, yes. I believe CPython will only use surrogate code points if 
one requests errors=surrogate-escape on decoding or explicitly puts them 
in a literal (\unnnn or \Ummmmmmmm). The consequences fall under the 
'consenting adults' policy.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#28358

From	Roy Smith <roy@panix.com>
Date	2012-09-03 11:24 -0400
Message-ID	<roy-4C0CCA.11245603092012@news.panix.com>
In reply to	#28333

In article <50440de2$0$29967$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> > Indexing is O(0) for any string.
> 
> I think you mean O(1) for constant-time lookups.

Why settle for constant-time, when you can have zero-time instead :-)

[toc] | [prev] | [next] | [standalone]

#28360

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2012-09-03 18:41 +0300
Message-ID	<mailman.149.1346686927.27098.python-list@python.org>
In reply to	#28333

On 03.09.12 04:54, Steven D'Aprano wrote:
> This means that Python 3.3 will no longer have surrogate pairs.
>
> Am I right?

As Terry said, basically, yes. Python 3.3 does not need in surrogate 
pairs, but does not prevent their creation. You can create a surrogate 
code (U+D800..U+DFFF) intentionally (as you can create a single accent 
modifier or other senseless alone charcode), but less likely that you 
will get them unintentionally.

[toc] | [prev] | [next] | [standalone]

#28323

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2012-09-03 00:45 +0300
Message-ID	<mailman.117.1346622353.27098.python-list@python.org>
In reply to	#28092

On 02.09.12 23:38, Serhiy Storchaka wrote:
> Indexing is O(0) for any string.

Typo. O(1)

[toc] | [prev] | [next] | [standalone]

#28068

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-30 01:54 +1000
Message-ID	<mailman.3939.1346255667.4697.python-list@python.org>
In reply to	#28059

On Thu, Aug 30, 2012 at 1:43 AM,  <wxjmfauth@gmail.com> wrote:
> If "Python" has found a new way to cover the set
> of the Unicode characters, why not proposing it
> to the Unicode consortium?

Python's open source. If some other language wants to borrow the idea,
they can look at the code, or alternatively, just read PEP 393 and
implement something similar. It's a free world.

By the way, can you please trim the quoted text in your replies? It's
rather lengthy.

ChrisA

[toc] | [prev] | [next] | [standalone]

#28060

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-29 22:34 +1000
Message-ID	<mailman.3930.1346243680.4697.python-list@python.org>
In reply to	#28055

On Wed, Aug 29, 2012 at 9:40 PM,  <wxjmfauth@gmail.com> wrote:
> For a given coding scheme, all code points/characters are
> equivalent. Expecting to handle a sub-range in a coding
> scheme without shaking that coding scheme is impossible.

Not all codepoints are equally likely. That's the whole point behind
variable-length encodings like Huffman compression (eg deflation as
used in zip/gzip), UTF-8, quoted-printable, and Morse code. They
handle a sub-range efficiently and the rest of the range less
efficiently.

> If a coding scheme does not give satisfaction, the only
> valid solution is to create a new coding scheme, cp1252,
> mac-roman, EBCDIC, ... or the interesting "TeX" case, where
> the "internal" coding depends on the fonts!

http://xkcd.com/927/

> This "Flexible String Representation" fails. Not only
> it is unable to stick with a coding scheme, it is
> a mixing of coding schemes, the worst of all possible
> implementations.

I propose, then, that we abolish files. Who *knows* how many different
things might be represented in a file! We need a single coding scheme
that can handle everything, without changing representation. This
ridiculous state of affairs must not go on; the same representation
can be used for bitmapped images or raw audio data!

ChrisA

[toc] | [prev] | [next] | [standalone]

#28056

From	wxjmfauth@gmail.com
Date	2012-08-29 04:40 -0700
Message-ID	<mailman.3927.1346240457.4697.python-list@python.org>
In reply to	#28044

Le mercredi 29 août 2012 06:16:05 UTC+2, Ian a écrit :
> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody@gmail.com> wrote:
> 
> > In summary:
> 
> > 1. The problem is not on jmf's computer
> 
> > 2. It is not windows-only
> 
> > 3. It is not directly related to latin-1 encodable or not
> 
> >
> 
> > The only question which is not yet clear is this:
> 
> > Given a typical string operation that is complexity O(n), in more
> 
> > detail it is going to be O(a + bn)
> 
> > If only a is worse going 3.2 to 3.3, it may be a small issue.
> 
> > If b is worse by even a tiny amount, it is likely to be a significant
> 
> > regression for some use-cases.
> 
> 
> 
> As has been pointed out repeatedly already, this is a microbenchmark.
> 
> jmf is focusing in one one particular area (string construction) where
> 
> Python 3.3 happens to be slower than Python 3.2, ignoring the fact
> 
> that real code usually does lots of things other than building
> 
> strings, many of which are slower to begin with.  In the real-world
> 
> benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.
> 
> Here's a much more realistic benchmark that nonetheless still focuses
> 
> on strings: word counting.
> 
> 
> 
> Source: http://pastebin.com/RDeDsgPd
> 
> 
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"
> 
> "wc.wc('unilang8.htm')"
> 
> 1000 loops, best of 3: 310 usec per loop
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"
> 
> "wc.wc('unilang8.htm')"
> 
> 1000 loops, best of 3: 302 usec per loop
> 
> 
> 
> "unilang8.htm" is an arbitrary UTF-8 document containing a broad swath
> 
> of Unicode characters that I pulled off the web.  Even though this
> 
> program is still mostly string processing, Python 3.3 wins.  Of
> 
> course, that's not really a very good test -- since it reads the file
> 
> on every pass, it probably spends more time in I/O than it does in
> 
> actual processing.  Let's try it again with prepared string data:
> 
> 
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_str(t)"
> 
> 10000 loops, best of 3: 87.3 usec per loop
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_str(t)"
> 
> 10000 loops, best of 3: 84.6 usec per loop
> 
> 
> 
> Nope, 3.3 still wins.  And just for the sake of my own curiosity, I
> 
> decided to try it again using str.split() instead of a StringIO.
> 
> Since str.split() creates more strings, I expect Python 3.2 might
> 
> actually win this time.
> 
> 
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_split(t)"
> 
> 10000 loops, best of 3: 88 usec per loop
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_split(t)"
> 
> 10000 loops, best of 3: 76.5 usec per loop
> 
> 
> 
> Interestingly, although Python 3.2 performs the splits in about the
> 
> same time as the StringIO operation, Python 3.3 is significantly
> 
> *faster* using str.split(), at least on this data set.
> 
> 
> 
> 
> 
> > So doing some arm-chair thinking (I dont know the code and difficulty
> 
> > involved):
> 
> >
> 
> > Clearly there are 3 string-engines in the python 3 world:
> 
> > - 3.2 narrow
> 
> > - 3.2 wide
> 
> > - 3.3 (flexible)
> 
> >
> 
> > How difficult would it be to giving the choice of string engine as a
> 
> > command-line flag?
> 
> > This would avoid the nuisance of having two binaries -- narrow and
> 
> > wide.
> 
> 
> 
> Quite difficult.  Even if we avoid having two or three separate
> 
> binaries, we would still have separate binary representations of the
> 
> string structs.  It makes the maintainability of the software go down
> 
> instead of up.
> 
> 
> 
> > And it would give the python programmer a choice of efficiency
> 
> > profiles.
> 
> 
> 
> So instead of having just one test for my Unicode-handling code, I'll
> 
> now have to run that same test *three times* -- once for each possible
> 
> string engine option.  Choice isn't always a good thing.
> 
> 

Forget Python and all these benchmarks. The problem
is on an other level. Coding schemes, typography,
usage of characters, ...

For a given coding scheme, all code points/characters are
equivalent. Expecting to handle a sub-range in a coding
scheme without shaking that coding scheme is impossible.

If a coding scheme does not give satisfaction, the only
valid solution is to create a new coding scheme, cp1252,
mac-roman, EBCDIC, ... or the interesting "TeX" case, where
the "internal" coding depends on the fonts!

Unicode (utf***), as just one another coding scheme, does
not escape to this rule.

This "Flexible String Representation" fails. Not only
it is unable to stick with a coding scheme, it is
a mixing of coding schemes, the worst of all possible
implementations.

jmf

[toc] | [prev] | [next] | [standalone]

#27995

From	wxjmfauth@gmail.com
Date	2012-08-27 12:16 -0700
Message-ID	<mailman.3882.1346094990.4697.python-list@python.org>
In reply to	#27947

Le dimanche 26 août 2012 22:45:09 UTC+2, Dan Sommers a écrit :
> On 2012-08-26 at 20:13:21 +0000,
> 
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> 
> 
> 
> > I note that not all 32-bit ints are valid code points. I suppose I can
> 
> > see sense in having rune be a 32-bit integer value limited to those
> 
> > valid code points. (But, dammit, why not call it a code point?) But if
> 
> > rune is merely an alias for int32, why not just call it int32?
> 
> 
> 
> Having a "code point" type is a good idea.  If nothing else, human code
> 
> readers can tell that you're doing something with characters rather than
> 
> something with integers.  If your language provides any sort of type
> 
> safety, then you get that, too.
> 
> 
> 
> Calling your code points int32 is a bad idea for the same reason that it
> 
> turned out to be a bad idea to call all my old ASCII characters int8.
> 
> Or all my pointers int<n> (or unsigned int<n>), for n in 16, 20, 24, 32,
> 
> 36, 48, or 64 (or I'm sure other values of n that I never had the pain
> 
> or pleasure of using).
> 

And this is precisely the concept of rune, a real int which
is a name for Unicode code point.

Go "has" the integers int32 and int64. A rune ensure
the usage of int32. "Text libs" use runes. Go has only
bytes and runes.

If you do not like the word "perfection", this mechanism
has at least an ideal simplicity (with probably a lot
of positive consequences).

rune -> int32 -> utf32 -> unicode code points.

- Why int32 and not uint32? No idea, I tried to find an
answer without asking.
- I find the name "rune" elegant. "char" would have been
too confusing.

End. This is supposed to be a Python forum.
jmf

[toc] | [prev] | [next] | [standalone]

#27949

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-26 15:42 -0600
Message-ID	<mailman.3855.1346017353.4697.python-list@python.org>
In reply to	#27946

On Sun, Aug 26, 2012 at 2:13 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Sun, 26 Aug 2012 09:40:13 -0600, Ian Kelly wrote:
>
>> I think the documentation for those functions is simply badly worded.
>> The "width in bytes" it returns is not the width of the rune (which as
>> jmf notes is simply an alias for int32 that stores a single code point).
>
> Is this documented somewhere?

http://golang.org/ref/spec#Numeric_types

[toc] | [prev] | [next] | [standalone]

#27955

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-26 23:31 +0000
Message-ID	<503ab1d9$0$1555$c3e8da3$76491128@news.astraweb.com>
In reply to	#27949

On Sun, 26 Aug 2012 15:42:00 -0600, Ian Kelly wrote:

> On Sun, Aug 26, 2012 at 2:13 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Sun, 26 Aug 2012 09:40:13 -0600, Ian Kelly wrote:
>>
>>> I think the documentation for those functions is simply badly worded.
>>> The "width in bytes" it returns is not the width of the rune (which as
>>> jmf notes is simply an alias for int32 that stores a single code
>>> point).
>>
>> Is this documented somewhere?
> 
> http://golang.org/ref/spec#Numeric_types

Thanks.

Well that's just plain nuts.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27956

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-26 17:47 -0700
Message-ID	<7x6285kvp5.fsf@ruckus.brouhaha.com>
In reply to	#27955

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>> http://golang.org/ref/spec#Numeric_types
> Thanks.
> Well that's just plain nuts.

I'm not sure how Rust handles Unicode, but overall I think it is more
clueful than Go while having sort of comparable goals.  See:
http://rust-lang.org .

[toc] | [prev] | [next] | [standalone]

#27864

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-25 21:04 +1000
Message-ID	<mailman.3796.1345892653.4697.python-list@python.org>
In reply to	#27854

On Sat, Aug 25, 2012 at 7:46 PM, Frank Millman <frank@chagford.com> wrote:
> Therefore, I think he is saying that he would have preferred that python
> standardise on 4-byte characters, on the grounds that the saving in memory
> does not justify the performance overhead.

If that's indeed the argument, then at least it's something to argue.
What gets difficult is when people complain about the expansion from a
2-byte narrow build to the current 1/2/4-byte representation, which
will indeed use more memory if there are a small number of >0xFFFF
codepoints. But there's a correctness difference there.

ChrisA

[toc] | [prev] | [next] | [standalone]

#27865

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-08-25 12:05 +0100
Message-ID	<mailman.3797.1345892703.4697.python-list@python.org>
In reply to	#27854

On 25/08/2012 10:46, Frank Millman wrote:
> On 25/08/2012 10:58, Mark Lawrence wrote:
>> On 25/08/2012 08:27, wxjmfauth@gmail.com wrote:
>>>
>>> Unicode design: a flat table of code points, where all code
>>> points are "equals".
>>> As soon as one attempts to escape from this rule, one has to
>>> "pay" for it.
>>> The creator of this machinery (flexible string representation)
>>> can not even benefit from it in his native language (I think
>>> I'm correctly informed).
>>>
>>> Hint: Google -> "Das grosse Eszett"
>>>
>>> jmf
>>>
>>
>> It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
>> still baffled as to the point if any.  Could someone please enlightem me?
>>
>
> Here's what I think he is saying. I am posting this to test the water. I
> am also confused, and if I have got it wrong hopefully someone will
> correct me.
>
> In python 3.3, unicode strings are now stored as follows -
>    if all characters can be represented by 1 byte, the entire string is
> composed of 1-byte characters
>    else if all characters can be represented by 1 or 2 bytea, the entire
> string is composed of 2-byte characters
>    else the entire string is composed of 4-byte characters
>
> There is an overhead in making this choice, to detect the lowest number
> of bytes required.
>
> jmfauth believes that this only benefits 'english-speaking' users, as
> the rest of the world will tend to have strings where at least one
> character requires 2 or 4 bytes. So they incur the overhead, without
> getting any benefit.
>
> Therefore, I think he is saying that he would have preferred that python
> standardise on 4-byte characters, on the grounds that the saving in
> memory does not justify the performance overhead.
>
> Frank Millman
>
>

I thought Terry Reedy had shot down any claims about performance 
overhead, and that the memory savings in many cases must be substantial 
and therefore worthwhile.  Or have I misread something?  Or what?

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#27866

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-25 21:19 +1000
Message-ID	<mailman.3798.1345893580.4697.python-list@python.org>
In reply to	#27854

On Sat, Aug 25, 2012 at 9:05 PM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
> I thought Terry Reedy had shot down any claims about performance overhead,
> and that the memory savings in many cases must be substantial and therefore
> worthwhile.  Or have I misread something?  Or what?

My reading of the thread(s) is/are that there are two reasons for the
debate to continue to rage:

1) Comparisons with a "narrow build" in which most characters take two
bytes but there are one or two characters that get encoded with
surrogates. The new system will allocate four bytes per character for
the whole string.

2) Arguments on the basis of huge strings that represent _all the
data_ that your program's working with, forgetting that there are
numerous strings all through everything that are ASCII-only.

ChrisA

[toc] | [prev] | [next] | [standalone]

#27867

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-08-25 07:23 -0400
Message-ID	<mailman.3799.1345893825.4697.python-list@python.org>
In reply to	#27854

On 8/25/2012 7:05 AM, Mark Lawrence wrote:

> I thought Terry Reedy had shot down any claims about performance
> overhead, and that the memory savings in many cases must be substantial
> and therefore worthwhile.  Or have I misread something?

No, you have correctly read what I and others have said. Jim appears to 
not be interested in dialog. Lets leave it at that.

-- 
Terry Jan Reedy

[toc] | [prev] | [standalone]

Page 5 of 5 — ← Prev page 1 2 3 4 [5]

csiph-web

Flexible string representation, unicode, typography, ...

Contents

#28334

#28358

#28360

#28323

#28068

#28060

#28056

#27995

#27949

#27955

#27956

#27864

#27865

#27866

#27867