Groups > comp.lang.python > #70722 > unrolled thread

Unicode 7

Started by	wxjmfauth@gmail.com
First post	2014-04-29 10:37 -0700
Last post	2014-04-30 23:00 -0700
Articles	20 on this page of 56 — 16 participants

Back to article view | Back to comp.lang.python

  Unicode 7 wxjmfauth@gmail.com - 2014-04-29 10:37 -0700
    Re: Unicode 7 Tim Chase <python.list@tim.thechases.com> - 2014-04-29 12:59 -0500
      Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-04-29 21:53 -0700
        Re: Unicode 7 Steven D'Aprano <steve@pearwood.info> - 2014-05-01 05:00 +0000
          Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 11:04 -0700
            Re: Unicode 7 Terry Reedy <tjreedy@udel.edu> - 2014-05-01 18:38 -0400
              Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 19:29 -0700
                Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 19:39 -0700
                Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 13:01 +1000
                  Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 20:16 -0700
                Re: Unicode 7 Terry Reedy <tjreedy@udel.edu> - 2014-05-02 01:05 -0400
              Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 03:15 +0000
            Re: Unicode 7 MRAB <python@mrabarnett.plus.com> - 2014-05-02 00:33 +0100
              Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 19:02 -0700
                Re: Unicode 7 Ben Finney <ben@benfinney.id.au> - 2014-05-02 12:39 +1000
                  Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 19:59 -0700
                Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 08:45 +0000
                  Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 19:08 +1000
                    Re: Unicode 7 Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2014-05-02 13:04 +0300
                  Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 03:39 -0700
                    Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 11:55 +0000
                      Re: Unicode 7 Marko Rauhamaa <marko@pacujo.net> - 2014-05-02 15:19 +0300
                        Re: Unicode 7 Ben Finney <ben@benfinney.id.au> - 2014-05-03 07:07 +1000
                          Re: Unicode 7 Roy Smith <roy@panix.com> - 2014-05-02 17:13 -0400
                      Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 09:03 -0700
                      Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 09:50 -0700
                        Re: Unicode 7 Michael Torrie <torriem@gmail.com> - 2014-05-02 11:39 -0600
                        Re: Unicode 7 Ned Batchelder <ned@nedbatchelder.com> - 2014-05-02 13:46 -0400
                        Re: Unicode 7 Peter Otten <__peter__@web.de> - 2014-05-02 20:07 +0200
                          Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 17:58 -0700
                            Re: Unicode 7 Ned Batchelder <ned@nedbatchelder.com> - 2014-05-02 21:18 -0400
                              Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 18:42 -0700
                                Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-03 11:54 +1000
                                  Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 19:02 -0700
                            Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-03 11:15 +1000
                            Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-03 02:02 +0000
                              Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-03 02:04 +0000
                              Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-03 12:17 +1000
                            Re: Unicode 7 Terry Reedy <tjreedy@udel.edu> - 2014-05-02 22:19 -0400
                      Re: Unicode 7 Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2014-05-03 12:57 -0400
                  Re: Unicode 7 Tim Chase <python.list@tim.thechases.com> - 2014-05-02 07:58 -0500
                Re: Unicode 7 MRAB <python@mrabarnett.plus.com> - 2014-05-02 17:52 +0100
            Re: Unicode 7 Terry Reedy <tjreedy@udel.edu> - 2014-05-02 00:16 -0400
              Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 21:42 -0700
                Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 14:54 +1000
                Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 08:08 +0000
                  Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 19:01 +1000
                    Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 11:52 +0000
                  Re: Unicode 7 Ben Finney <ben@benfinney.id.au> - 2014-05-02 19:16 +1000
                    Re: Unicode 7 Marko Rauhamaa <marko@pacujo.net> - 2014-05-02 13:05 +0300
                  Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 19:24 +1000
                  Re: Unicode 7 MRAB <python@mrabarnett.plus.com> - 2014-05-02 18:07 +0100
    Re: Unicode 7 MRAB <python@mrabarnett.plus.com> - 2014-04-29 19:12 +0100
      Re: Unicode 7 wxjmfauth@gmail.com - 2014-04-30 00:06 -0700
        Re: Unicode 7 Tim Chase <python.list@tim.thechases.com> - 2014-04-30 13:48 -0500
          Re: Unicode 7 wxjmfauth@gmail.com - 2014-04-30 23:00 -0700

Page 1 of 3 [1] 2 3 Next page →

#70722 — Unicode 7

From	wxjmfauth@gmail.com
Date	2014-04-29 10:37 -0700
Subject	Unicode 7
Message-ID	<d6e81de5-a82b-491f-b2f0-7ab4a24cff03@googlegroups.com>

Let see how Python is ready for the next Unicode version
(Unicode 7.0.0.Beta).


>>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = 'z'")
[1.4027834829454946, 1.38714224331963, 1.3822586635296261]
>>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = '\u0fce'")
[5.462776291480395, 5.4479432055423445, 5.447874284053398]
>>> 
>>> 
>>> # more interesting
>>> timeit.repeat("(x*1000 + y)[:-1]",\
...     setup="x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')")
[1.3496489533188765, 1.328654286266783, 1.3300913977710707]
>>> 

Note 1:  "lookup" is not the problem.

Note 2: From Unicode.org : "[...] We strongly encourage [...] and test
them with their programs [...]"

-> Done.

jmf

[toc] | [next] | [standalone]

#70723

From	Tim Chase <python.list@tim.thechases.com>
Date	2014-04-29 12:59 -0500
Message-ID	<mailman.9579.1398794381.18130.python-list@python.org>
In reply to	#70722

On 2014-04-29 10:37, wxjmfauth@gmail.com wrote:
> >>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = 'z'")  
> [1.4027834829454946, 1.38714224331963, 1.3822586635296261]
> >>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y =
> >>> '\u0fce'")  
> [5.462776291480395, 5.4479432055423445, 5.447874284053398]
> >>> 
> >>> 
> >>> # more interesting
> >>> timeit.repeat("(x*1000 + y)[:-1]",\  
> ...     setup="x = 'abc'.encode('utf-8'); y =
> '\u0fce'.encode('utf-8')") [1.3496489533188765, 1.328654286266783,
> 1.3300913977710707]
> >>>   

While I dislike feeding the troll, what I see here is:  on your
machine, all unicode manipulations in the test should take ~5.4
seconds.  But Python notices that some of your strings *don't*
require a full 32-bits and thus optimizes those operations, cutting
about 75% of the processing time (wow...4-bytes-per-char to
1-byte-per-char, I wonder where that 75% savings comes from).

So rather than highlight any *problem* with Python, your [mostly
worthless microbenchmark non-realworld] tests show that Python's
unicode implementation is awesome.

Still waiting to see an actual bug-report as mentioned on the other
thread.

-tkc

[toc] | [prev] | [next] | [standalone]

#70763

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-04-29 21:53 -0700
Message-ID	<ac9b2a50-3b5d-4ee8-8954-9f0f1ab490b6@googlegroups.com>
In reply to	#70723

On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote:
> While I dislike feeding the troll, what I see here is: 

<snipped>

Since its Unicode-troll time, here's my contribution
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

:-)

More seriously, since Ive quoted some esteemed members of this list 
explicitly (Steven) and the list in general, please let me know if
something is inaccurate or inappropriate

[toc] | [prev] | [next] | [standalone]

#70807

From	Steven D'Aprano <steve@pearwood.info>
Date	2014-05-01 05:00 +0000
Message-ID	<5361d4f9$0$11109$c3e8da3@news.astraweb.com>
In reply to	#70763

On Tue, 29 Apr 2014 21:53:22 -0700, Rustom Mody wrote:

> On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote:
>> While I dislike feeding the troll, what I see here is:
> 
> <snipped>
> 
> Since its Unicode-troll time, here's my contribution
> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

I disagree with much of your characterisation of the Unix assumption, and 
I point out that out of the two most widespread flavours of OS today, 
Linux/Unix and Windows, it is *Windows* and not Unix which still 
regularly uses legacy encodings.

Also your link to Joel On Software mistakenly links to me instead of Joel.

There's a missing apostrophe in "Ive" [sic] in Acknowledgment #2.

I didn't notice any other typos.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#70818

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-05-01 11:04 -0700
Message-ID	<82067b83-a6f5-4b16-b012-385535ea5607@googlegroups.com>
In reply to	#70807

On Thursday, May 1, 2014 10:30:43 AM UTC+5:30, Steven D'Aprano wrote:
> On Tue, 29 Apr 2014 21:53:22 -0700, Rustom Mody wrote:

> > On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote:
> >> While I dislike feeding the troll, what I see here is:
> > Since its Unicode-troll time, here's my contribution
> > http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

> Also your link to Joel On Software mistakenly links to me instead of Joel.
> There's a missing apostrophe in "Ive" [sic] in Acknowledgment #2.

Done, Done.

> I didn't notice any other typos.

Thank you sir!

> I point out that out of the two most widespread flavours of OS today, 
> Linux/Unix and Windows, it is *Windows* and not Unix which still 
> regularly uses legacy encodings.

Not sure what you are suggesting... 
That (I am suggesting that) 8859 is legacy and 1252 is not?

> I disagree with much of your characterisation of the Unix assumption,

I'd be interested to know the details -- Contents? Details? Tone? Tenor? Blaspheming the sacred scripture?
(if you are so inclined of course)

[toc] | [prev] | [next] | [standalone]

#70829

From	Terry Reedy <tjreedy@udel.edu>
Date	2014-05-01 18:38 -0400
Message-ID	<mailman.9637.1398983969.18130.python-list@python.org>
In reply to	#70818

On 5/1/2014 2:04 PM, Rustom Mody wrote:

>>> Since its Unicode-troll time, here's my contribution
>>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

I will not comment on the Unix-assumption part, but I think you go wrong 
with this:  "Unicode is a Headache". The major headache is that unicode 
and its very few encodings are not universally used. The headache is all 
the non-unicode legacy encodings still being used. So you better title 
this section 'Non-Unicode is a Headache'.

The first sentence is this misleading tautology: "With ASCII, data is 
ASCII whether its file, core, terminal, or network; ie "ABC" is 
65,66,67." Let me translate: "If all text is ASCII encoded, then text 
data is ASCII, whether ..." But it was never the case that all text was 
ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe 
still uses the latter. Other mainframe makers used other encodings of 
A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never 
universal. You could have just as well said "With EBCDIC, data is 
EBCDIC, whether ..."

https://en.wikipedia.org/wiki/Ascii
https://en.wikipedia.org/wiki/EBCDIC

A crucial step in the spread of Ascii was its use for microcomputers, 
including the IBM PC. The latter was considered a toy by the mainframe 
guys. If they had known that PCs would partly take over the computing 
world, they might have suggested or insisted that the it use EBCDIC.

"With unicode there are:
     encodings"
where 'encodings' is linked to
https://en.wikipedia.org/wiki/Character_encodings_in_HTML

If html 'always' used utf-8 (like xml), as has become common but not 
universal, all of the problems with *non-unicode* character sets and 
encodings would disappear. The pre-unicode declarations could then 
disappear. More truthful: "without unicode there are 100s of encodings 
and with unicode only 3 that we should worry about.

"in-memory formats"

These are not the concern of the using programmer as long as they do not 
introduce bugs or limitations (as do all the languages stuck on UCS-2 
and many using UTF-16, including old Python narrow builds). Using what 
should generally be the universal transmission format, UFT-8, as the 
internal format means either losing indexing and slicing, having those 
operations slow from O(1) to O(len(string)), or adding an index table 
that is not part of the unicode standard. Using UTF-32 avoids the above 
but usually wasted space -- up to 75%.

"strange beasties like python's FSR"

Have you really let yourself be poisoned by JMF's bizarre rants? The FSR 
is an *internal optimization* that benefits most unicode operations that 
people actually perform. It uses UTF-32 by default but adapts to the 
strings users create by compressing the internal format. The compression 
is trivial -- simple dropping leading null bytes common to all 
characters -- so each character is still readable as is. The string 
headers records how many bytes are left.  Is the idea of algorithms that 
adapt to inputs really strange to you?

Like good adaptive algorthms, the FSR is invisible to the user except 
for reducing space or time or maybe both. Unicode operations are 
otherwise the same as with previous wide builds. People who used to use 
narrow-builds also benefit from bug elimination. The only 'headaches' 
involved might have been those of the developers who optimized previous 
wide builds.

CPython has many other functions with special-case optimizations and 
'fast paths' for common, simple cases. For instance, (some? all?) number 
operations are optimized for pairs of integers.  Do you call these 
'strange beasties'?

PyPy is faster than CPython, when it is, because it is even more 
adaptable to particular computations by creating new fast paths. The 
mechanism to create these 'strange beasties' might have been a headache 
for the writers, but when it works, which it now seems to, it is not for 
the users.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#70837

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-05-01 19:29 -0700
Message-ID	<8c30f6fc-8493-419b-a4c8-dfe4a9d30de0@googlegroups.com>
In reply to	#70829

On Friday, May 2, 2014 4:08:35 AM UTC+5:30, Terry Reedy wrote:
> On 5/1/2014 2:04 PM, Rustom Mody wrote:

> >>> Since its Unicode-troll time, here's my contribution
> >>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

> I will not comment on the Unix-assumption part, but I think you go wrong 
> with this:  "Unicode is a Headache". The major headache is that unicode 
> and its very few encodings are not universally used. The headache is all 
> the non-unicode legacy encodings still being used. So you better title 
> this section 'Non-Unicode is a Headache'.

> The first sentence is this misleading tautology: "With ASCII, data is 
> ASCII whether its file, core, terminal, or network; ie "ABC" is 
> 65,66,67." Let me translate: "If all text is ASCII encoded, then text 
> data is ASCII, whether ..." But it was never the case that all text was 
> ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe 
> still uses the latter. Other mainframe makers used other encodings of 
> A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never 
> universal. You could have just as well said "With EBCDIC, data is 
> EBCDIC, whether ..."

> https://en.wikipedia.org/wiki/Ascii
> https://en.wikipedia.org/wiki/EBCDIC

> A crucial step in the spread of Ascii was its use for microcomputers, 
> including the IBM PC. The latter was considered a toy by the mainframe 
> guys. If they had known that PCs would partly take over the computing 
> world, they might have suggested or insisted that the it use EBCDIC.

> "With unicode there are:
>      encodings"
> where 'encodings' is linked to
> https://en.wikipedia.org/wiki/Character_encodings_in_HTML

> If html 'always' used utf-8 (like xml), as has become common but not 
> universal, all of the problems with *non-unicode* character sets and 
> encodings would disappear. The pre-unicode declarations could then 
> disappear. More truthful: "without unicode there are 100s of encodings 
> and with unicode only 3 that we should worry about.

> "in-memory formats"

> These are not the concern of the using programmer as long as they do not 
> introduce bugs or limitations (as do all the languages stuck on UCS-2 
> and many using UTF-16, including old Python narrow builds). Using what 
> should generally be the universal transmission format, UFT-8, as the 
> internal format means either losing indexing and slicing, having those 
> operations slow from O(1) to O(len(string)), or adding an index table 
> that is not part of the unicode standard. Using UTF-32 avoids the above 
> but usually wasted space -- up to 75%.

> "strange beasties like python's FSR"

> Have you really let yourself be poisoned by JMF's bizarre rants? The FSR 
> is an *internal optimization* that benefits most unicode operations that 
> people actually perform. It uses UTF-32 by default but adapts to the 
> strings users create by compressing the internal format. The compression 
> is trivial -- simple dropping leading null bytes common to all 
> characters -- so each character is still readable as is. The string 
> headers records how many bytes are left.  Is the idea of algorithms that 
> adapt to inputs really strange to you?

> Like good adaptive algorthms, the FSR is invisible to the user except 
> for reducing space or time or maybe both. Unicode operations are 
> otherwise the same as with previous wide builds. People who used to use 
> narrow-builds also benefit from bug elimination. The only 'headaches' 
> involved might have been those of the developers who optimized previous 
> wide builds.

> CPython has many other functions with special-case optimizations and 
> 'fast paths' for common, simple cases. For instance, (some? all?) number 
> operations are optimized for pairs of integers.  Do you call these 
> 'strange beasties'?

Here is an instance of someone who would like a certain optimization to be
dis-able-able

https://mail.python.org/pipermail/python-list/2014-February/667169.html

To the best of my knowledge its nothing to do with unicode or with jmf.

Why if optimizations are always desirable do C compilers have:
-O0 O1 O2 O3 and zillions of more specific flags?

JFTR I have no issue with FSR.  What we have to hand to jmf - willingly
or otherwise - is that many more people have heard of FSR thanks to him. [I am one of them]

I dont even know whether jmf has a real
technical (as he calls it 'mathematical') issue or its entirely political:

"Why should I pay more for a EURO sign than a $ sign?"

Well perhaps that is more related to the exchange rate than to python!

[toc] | [prev] | [next] | [standalone]

#70838

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-05-01 19:39 -0700
Message-ID	<51602756-7019-4f79-b168-d12b2b801a8e@googlegroups.com>
In reply to	#70837

On Friday, May 2, 2014 7:59:55 AM UTC+5:30, Rustom Mody wrote:
> "Why should I pay more for a EURO sign than a $ sign?"

A unicode 'headache' there:
I typed the Euro sign (trying again € ) not EURO

Somebody -- I guess its GG in overhelpful mode -- converted it
And made my post: 
Content-Type: text/plain; charset=ISO-8859-1

Will some devanagarari vowels help it stop being helpful?
अ आ इ ई उ ऊ ए ऐ

[toc] | [prev] | [next] | [standalone]

#70841

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-02 13:01 +1000
Message-ID	<mailman.9644.1398999726.18130.python-list@python.org>
In reply to	#70837

On Fri, May 2, 2014 at 12:29 PM, Rustom Mody <rustompmody@gmail.com> wrote:
> Here is an instance of someone who would like a certain optimization to be
> dis-able-able
>
> https://mail.python.org/pipermail/python-list/2014-February/667169.html
>
> To the best of my knowledge its nothing to do with unicode or with jmf.

It doesn't, and it has only to do with testing. I've had similar
issues at times; for instance, trying to benchmark one language or
language construct against another often means fighting against an
optimizer. (How, for instance, do you figure out what loop overhead
is, when an empty loop is completely optimized out?) This is nothing
whatsoever to do with Unicode, nor to do with the optimization that
Python and Pike (and maybe other languages) do with the storage of
Unicode strings.

ChrisA

[toc] | [prev] | [next] | [standalone]

#70843

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-05-01 20:16 -0700
Message-ID	<17d23e0b-7876-49b3-a785-4601f58414c6@googlegroups.com>
In reply to	#70841

On Friday, May 2, 2014 8:31:56 AM UTC+5:30, Chris Angelico wrote:
> On Fri, May 2, 2014 at 12:29 PM, Rustom Mody wrote:
> > Here is an instance of someone who would like a certain optimization to be
> > dis-able-able
> > https://mail.python.org/pipermail/python-list/2014-February/667169.html
> > To the best of my knowledge its nothing to do with unicode or with jmf.

> It doesn't, and it has only to do with testing. I've had similar
> issues at times; for instance, trying to benchmark one language or
> language construct against another often means fighting against an
> optimizer. (How, for instance, do you figure out what loop overhead
> is, when an empty loop is completely optimized out?) This is nothing
> whatsoever to do with Unicode, nor to do with the optimization that
> Python and Pike (and maybe other languages) do with the storage of
> Unicode strings.

This was said in response to Terry's

> CPython has many other functions with special-case optimizations and
> 'fast paths' for common, simple cases. For instance, (some? all?) number
> operations are optimized for pairs of integers.  Do you call these
> 'strange beasties'?

which evidently vanished -- optimized out :D -- in multiple levels of quoting

[toc] | [prev] | [next] | [standalone]

#70848

From	Terry Reedy <tjreedy@udel.edu>
Date	2014-05-02 01:05 -0400
Message-ID	<mailman.9648.1399007203.18130.python-list@python.org>
In reply to	#70837

On 5/1/2014 10:29 PM, Rustom Mody wrote:

> Here is an instance of someone who would like a certain optimization to be
> dis-able-able
>
> https://mail.python.org/pipermail/python-list/2014-February/667169.html
>
> To the best of my knowledge its nothing to do with unicode or with jmf.

Right. Ned has an actual technical reason to complain, even though the 
developers do not consider it strong enough to act.

> Why if optimizations are always desirable do C compilers have:
> -O0 O1 O2 O3 and zillions of more specific flags?

One reason is that many optimizations sometimes introduce bugs, or to 
put it another way, they are based on assumptions that are not true for 
all code. For instance, some people have suggested that CPython should 
have an optional optimization based on the assumption that builtin names 
are never rebound. That is true for perhaps many code files, but 
definitely not all. Guido does not seem to like such conditional 
optimizations.

I can think of three reasons for not adding to the numerous options 
CPython already has.
1. We do not have the developers resources to handle the added 
complications of multiple optimization options.
2. Zillions of options and flags confuse users. As it is, most options 
are seldom used.
3. Optimization options are easily misused, possibly leading to silently 
buggy results, or mysterious failures. For instance, people sometimes 
rebind builtins without realizing what they have done, such as using 
'id' as a parameter name. Being in the habit of routinely using the 
'assume no rebinding option' would lead to problems.

I am rather sure that the string (unicode) test suite was reviewed and 
the performance of 3.2 wide builds recorded before the new 
implementation was committed.

The tracker currently has 37 behavior (bug) issues marked for the 
unicode component. In a quick review, I do not see that any have 
anything to do with using standard UTF-32 versus adaptive UTF-32. 
Indeed, I believe a majority of the 37 were filed before 3.3 or are 2.7 
specific. Problems with FSR itself have been fixed as discovered.

> JFTR I have no issue with FSR.  What we have to hand to jmf - willingly
> or otherwise - is that many more people have heard of FSR thanks to him. [I am one of them]

Somewhat ironically, I suppose your are right.

> I dont even know whether jmf has a real
> technical (as he calls it 'mathematical') issue or its entirely political:

I would call his view personal or philosophical. I only object to 
endless repetition and the deception of claiming that personal views are 
mathematical facts.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#70842

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-05-02 03:15 +0000
Message-ID	<53630dcc$0$29965$c3e8da3$5496439d@news.astraweb.com>
In reply to	#70829

On Thu, 01 May 2014 18:38:35 -0400, Terry Reedy wrote:

> "strange beasties like python's FSR"
> 
> Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
> is an *internal optimization* that benefits most unicode operations that
> people actually perform. It uses UTF-32 by default but adapts to the
> strings users create by compressing the internal format. The compression
> is trivial -- simple dropping leading null bytes common to all
> characters -- so each character is still readable as is.

For anyone who, like me, wasn't convinced that Unicode worked that way, 
you can see for yourself that it does. You don't need Python 3.3, any 
version of 3.x will work. In Python 2.7, it should work if you just 
change the calls from "chr()" to "unichr()":

py> for i in range(256):
...     c = chr(i)
...     u = c.encode('utf-32-be')
...     assert u[:3] == b'\0\0\0'
...     assert u[3:] == c.encode('latin-1')
...
py> for i in range(256, 0xFFFF+1):
...     c = chr(i)
...     u = c.encode('utf-32-be')
...     assert u[:2] == b'\0\0'
...     assert u[2:] == c.encode('utf-16-be')
...
py> 

So Terry is correct: dropping leading zeroes, and treating the remainder 
as either Latin-1 or UTF-16, works fine, and potentially saves a lot of 
memory.

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]

#70831

From	MRAB <python@mrabarnett.plus.com>
Date	2014-05-02 00:33 +0100
Message-ID	<mailman.9639.1398987208.18130.python-list@python.org>
In reply to	#70818

On 2014-05-01 23:38, Terry Reedy wrote:
> On 5/1/2014 2:04 PM, Rustom Mody wrote:
>
>>>> Since its Unicode-troll time, here's my contribution
>>>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html
>
> I will not comment on the Unix-assumption part, but I think you go wrong
> with this:  "Unicode is a Headache". The major headache is that unicode
> and its very few encodings are not universally used. The headache is all
> the non-unicode legacy encodings still being used. So you better title
> this section 'Non-Unicode is a Headache'.
>
[snip]
I think he's right when he says "Unicode is a headache", but only
because it's being used to handle languages which are, themselves, a
"headache": left-to-right versus right-to-left, sometimes on the same
line; diacritics, possibly several on a glyph; etc.

[toc] | [prev] | [next] | [standalone]

#70834

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-05-01 19:02 -0700
Message-ID	<eb56fd65-4729-42db-bcd4-179c19aaf485@googlegroups.com>
In reply to	#70831

On Friday, May 2, 2014 5:03:21 AM UTC+5:30, MRAB wrote:
> On 2014-05-01 23:38, Terry Reedy wrote:
> > On 5/1/2014 2:04 PM, Rustom Mody wrote:
> >>>> Since its Unicode-troll time, here's my contribution
> >>>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html
> > I will not comment on the Unix-assumption part, but I think you go wrong
> > with this:  "Unicode is a Headache". The major headache is that unicode
> > and its very few encodings are not universally used. The headache is all
> > the non-unicode legacy encodings still being used. So you better title
> > this section 'Non-Unicode is a Headache'.
> [snip]
> I think he's right when he says "Unicode is a headache", but only
> because it's being used to handle languages which are, themselves, a
> "headache": left-to-right versus right-to-left, sometimes on the same
> line; diacritics, possibly several on a glyph; etc.

Yes, the headaches go a little further back than Unicode.
There is a certain large old book...
In which is described the building of a 'tower that reached up to heaven'...

At which point 'it was decided'¶ to do something to prevent that.

And our headaches started.

I dont know how one causally connects the 'headaches' but Ive seen
- mojibake
- unicode 'number-boxes' (what are these called?)
- Worst of all what we *dont* see -- how many others dont see what we see?

I never knew of any of this in the good ol days of ASCII

¶ Passive voice is often the best choice in the interests of political correctness

It would be a pleasant surprise if everyone sees a pilcrow at start of line above

[toc] | [prev] | [next] | [standalone]

#70839

From	Ben Finney <ben@benfinney.id.au>
Date	2014-05-02 12:39 +1000
Message-ID	<mailman.9643.1398998400.18130.python-list@python.org>
In reply to	#70834

Rustom Mody <rustompmody@gmail.com> writes:

> Yes, the headaches go a little further back than Unicode.

Okay, so can you change your article to reflect the fact that the
headaches both pre-date Unicode, and are made much easier by Unicode?

> There is a certain large old book...

Ah yes, the neo-Sumerian story “Enmerkar_and_the_Lord_of_Aratta”
<URL:https://en.wikipedia.org/wiki/Enmerkar_and_the_Lord_of_Aratta>.
Probably inspired by stories older than that, of course.

> In which is described the building of a 'tower that reached up to heaven'...
> At which point 'it was decided'¶ to do something to prevent that.
> And our headaches started.

And other myths with fantastic reasons for the diversity of language
<URL:https://en.wikipedia.org/wiki/Mythical_origins_of_language>.

> I never knew of any of this in the good ol days of ASCII

Yes, by ignoring all other writing systems except one's own – and
thereby excluding most of the world's people – the system can be made
simpler.

Hopefully the proportion of programmers who still feel they can make
such a parochial choice is rapidly shrinking.

-- 
 \     “Why doesn't Python warn that it's not 100% perfect? Are people |
  `\         just supposed to “know” this, magically?” —Mitya Sirenef, |
_o__)                                     comp.lang.python, 2012-12-27 |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#70840

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-05-01 19:59 -0700
Message-ID	<92004436-36a8-49ce-b4ec-dc0237b04bac@googlegroups.com>
In reply to	#70839

On Friday, May 2, 2014 8:09:44 AM UTC+5:30, Ben Finney wrote:
> Rustom Mody  writes:

> > Yes, the headaches go a little further back than Unicode.

> Okay, so can you change your article to reflect the fact that the
> headaches both pre-date Unicode, and are made much easier by Unicode?

Predate: Yes
Made easier: No

> > There is a certain large old book...

> Ah yes, the neo-Sumerian story "Enmerkar_and_the_Lord_of_Aratta"
> <URL:https://en.wikipedia.org/wiki/Enmerkar_and_the_Lord_of_Aratta>.
> Probably inspired by stories older than that, of course.

Thanks for that link

> > In which is described the building of a 'tower that reached up to heaven'...
> > At which point 'it was decided'¶ to do something to prevent that.
> > And our headaches started.

> And other myths with fantastic reasons for the diversity of language
> <URL:https://en.wikipedia.org/wiki/Mythical_origins_of_language>.

This one takes the cake - see 1st para
http://hilgart.org/enformy/BronsonRekindling.pdf


> > I never knew of any of this in the good ol days of ASCII

> Yes, by ignoring all other writing systems except one's own - and
> thereby excluding most of the world's people - the system can be made
> simpler.

> Hopefully the proportion of programmers who still feel they can make
> such a parochial choice is rapidly shrinking.

See link above: Ethnic differences and chauvinism are invariably linked

[toc] | [prev] | [next] | [standalone]

#70853

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-05-02 08:45 +0000
Message-ID	<53635b34$0$29965$c3e8da3$5496439d@news.astraweb.com>
In reply to	#70834

On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote:

> I dont know how one causally connects the 'headaches' but Ive seen -
> mojibake

Mojibake is certainly more common with multiple encodings, but the 
solution to that is Unicode, not ASCII.

In fact, in your blog post you even link to a post of mine where I 
explain that ASCII has gone through multiple backwards incompatible 
changes over the decades, which means you can have a limited form of 
mojibake even in pure ASCII. Between changes over various versions of 
ASCII, and ambiguous characters allowed by the standard, you needed some 
sort of out-of-band metadata to tell you whether they intended an @ or a 
`, a | or a ¬, a £ or a #, to mention only a few.

It's only since the 1980s that ASCII, actual 7-bit US ASCII, has become 
an unambiguous standard. But that's okay, because that merely allowed 
people to create dozens of 7-bit and 8-bit variations on ASCII, all 
incompatible with each other, and *call them ASCII* regardless of the 
actual standard name.

Between ambiguities in actual ASCII, and common practice to label non-
ASCII as ASCII, I can categorically say that mojibake has always been 
possible in so-called "plain text". If you haven't noticed it, it was 
because you were only exchanging documents with people who happened to 
use the same set of characters as you.

> - unicode 'number-boxes' (what are these called?) 

They are missing character glyphs, and they have nothing to do with 
Unicode. They are due to deficiencies in the text font you are using.

Admittedly with Unicode's 0x10FFFF possible characters (actually more, 
since a single code point can have multiple glyphs) it isn't surprising 
that most font designers have neither the time, skill or desire to create 
a glyph for every single code point. But then the same applies even for 
more restrictive 8-bit encodings -- sometimes font designers don't even 
bother providing glyphs for *ASCII* characters.

(E.g. they may only provide glyphs for uppercase A...Z, not lowercase.)

> - Worst of all what we
> *dont* see -- how many others dont see what we see?

Again, this a deficiency of the font. There are very few code points in 
Unicode which are intended to be invisible, e.g. space, newline, zero-
width joiner, control characters, etc., but they ought to be equally 
invisible to everyone. No printable character should ever be invisible in 
any decent font.

> I never knew of any of this in the good ol days of ASCII

You must have been happy with a very impoverished set of symbols, then.

> ¶ Passive voice is often the best choice in the interests of political
> correctness
> 
> It would be a pleasant surprise if everyone sees a pilcrow at start of
> line above

I do.

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [next] | [standalone]

#70856

From	Chris Angelico <rosuav@gmail.com>
Date	2014-05-02 19:08 +1000
Message-ID	<mailman.9650.1399021712.18130.python-list@python.org>
In reply to	#70853

On Fri, May 2, 2014 at 6:45 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>> - unicode 'number-boxes' (what are these called?)
>
> They are missing character glyphs, and they have nothing to do with
> Unicode. They are due to deficiencies in the text font you are using.
>
> Admittedly with Unicode's 0x10FFFF possible characters (actually more,
> since a single code point can have multiple glyphs) it isn't surprising
> that most font designers have neither the time, skill or desire to create
> a glyph for every single code point. But then the same applies even for
> more restrictive 8-bit encodings -- sometimes font designers don't even
> bother providing glyphs for *ASCII* characters.
>
> (E.g. they may only provide glyphs for uppercase A...Z, not lowercase.)

This is another area where Unicode has given us "a great improvement
over the old method of giving satisfaction". Back in the 1990s on
OS/2, DOS, and Windows, a missing glyph might be (a) blank, (b) a
simple square with no information, or (c) copied from some other font
(common with dingbats fonts). With Unicode, the standard is to show a
little box *with the hex digits in it*. Granted, those boxes are a LOT
more readable for BMP characters than SMP (unless your text is huge,
six digits in the space of one character will make them pretty tiny),
and a "Unicode" font will generally include all (or at least most) of
the BMP, but it's still better than having no information at all.

ChrisA

[toc] | [prev] | [next] | [standalone]

#70860

From	Jussi Piitulainen <jpiitula@ling.helsinki.fi>
Date	2014-05-02 13:04 +0300
Message-ID	<qottx98o5so.fsf@ruuvi.it.helsinki.fi>
In reply to	#70856

Chris Angelico writes:

> (common with dingbats fonts). With Unicode, the standard is to show
> a little box *with the hex digits in it*. Granted, those boxes are a
> LOT more readable for BMP characters than SMP (unless your text is
> huge, six digits in the space of one character will make them pretty
> tiny), and a "Unicode" font will generally include all (or at least
> most) of the BMP, but it's still better than having no information

I needed to see such tiny numbers just today, just the four of them in
the tiny box. So I pressed C-+ a few times to _make_ the text huge,
obtained my information, and returned to my normal text size with C--.

Perfect. Usually all I need to know is that I have a character for
which I don't have a glyph, but this time I wanted to record the
number because I was testing things rather than reading the text.

[toc] | [prev] | [next] | [standalone]

#70862

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-05-02 03:39 -0700
Message-ID	<0bdd2577-2893-4564-9857-fcfc6021dced@googlegroups.com>
In reply to	#70853

On Friday, May 2, 2014 2:15:41 PM UTC+5:30, Steven D'Aprano wrote:
> On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote:
> > - Worst of all what we
> > *dont* see -- how many others dont see what we see?

> Again, this a deficiency of the font. There are very few code points in 
> Unicode which are intended to be invisible, e.g. space, newline, zero-
> width joiner, control characters, etc., but they ought to be equally 
> invisible to everyone. No printable character should ever be invisible in 
> any decent font.

Thats not what I meant.

I wrote http://blog.languager.org/2014/04/unicoded-python.html
 – mostly on a debian box.
Later on seeing it on a less heavily setup ubuntu box, I see
 ⟮ ⟯ ⟬ ⟭ ⦇ ⦈ ⦉ ⦊
have become 'missing-glyph' boxes.

It leads me ask, how much else of what I am writing, some random reader 
has simply not seen?
Quite simply we can never know – because most are going to go away saying
"mojibaked/garbled rubbish"

Speaking of what you understood of what I said:
Yes invisible chars is another problem I was recently bitten by.
I pasted something from google into emacs' org mode.
Following that link again I kept getting a broken link.

Until I found that the link had an invisible char

The problem was that emacs was faithfully rendering that char according
to standard, ie invisibly!

[toc] | [prev] | [next] | [standalone]

Page 1 of 3 [1] 2 3 Next page →

csiph-web