Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #70722 > unrolled thread
| Started by | wxjmfauth@gmail.com |
|---|---|
| First post | 2014-04-29 10:37 -0700 |
| Last post | 2014-04-30 23:00 -0700 |
| Articles | 20 on this page of 56 — 16 participants |
Back to article view | Back to comp.lang.python
Unicode 7 wxjmfauth@gmail.com - 2014-04-29 10:37 -0700
Re: Unicode 7 Tim Chase <python.list@tim.thechases.com> - 2014-04-29 12:59 -0500
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-04-29 21:53 -0700
Re: Unicode 7 Steven D'Aprano <steve@pearwood.info> - 2014-05-01 05:00 +0000
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 11:04 -0700
Re: Unicode 7 Terry Reedy <tjreedy@udel.edu> - 2014-05-01 18:38 -0400
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 19:29 -0700
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 19:39 -0700
Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 13:01 +1000
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 20:16 -0700
Re: Unicode 7 Terry Reedy <tjreedy@udel.edu> - 2014-05-02 01:05 -0400
Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 03:15 +0000
Re: Unicode 7 MRAB <python@mrabarnett.plus.com> - 2014-05-02 00:33 +0100
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 19:02 -0700
Re: Unicode 7 Ben Finney <ben@benfinney.id.au> - 2014-05-02 12:39 +1000
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 19:59 -0700
Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 08:45 +0000
Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 19:08 +1000
Re: Unicode 7 Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2014-05-02 13:04 +0300
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 03:39 -0700
Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 11:55 +0000
Re: Unicode 7 Marko Rauhamaa <marko@pacujo.net> - 2014-05-02 15:19 +0300
Re: Unicode 7 Ben Finney <ben@benfinney.id.au> - 2014-05-03 07:07 +1000
Re: Unicode 7 Roy Smith <roy@panix.com> - 2014-05-02 17:13 -0400
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 09:03 -0700
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 09:50 -0700
Re: Unicode 7 Michael Torrie <torriem@gmail.com> - 2014-05-02 11:39 -0600
Re: Unicode 7 Ned Batchelder <ned@nedbatchelder.com> - 2014-05-02 13:46 -0400
Re: Unicode 7 Peter Otten <__peter__@web.de> - 2014-05-02 20:07 +0200
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 17:58 -0700
Re: Unicode 7 Ned Batchelder <ned@nedbatchelder.com> - 2014-05-02 21:18 -0400
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 18:42 -0700
Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-03 11:54 +1000
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 19:02 -0700
Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-03 11:15 +1000
Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-03 02:02 +0000
Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-03 02:04 +0000
Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-03 12:17 +1000
Re: Unicode 7 Terry Reedy <tjreedy@udel.edu> - 2014-05-02 22:19 -0400
Re: Unicode 7 Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2014-05-03 12:57 -0400
Re: Unicode 7 Tim Chase <python.list@tim.thechases.com> - 2014-05-02 07:58 -0500
Re: Unicode 7 MRAB <python@mrabarnett.plus.com> - 2014-05-02 17:52 +0100
Re: Unicode 7 Terry Reedy <tjreedy@udel.edu> - 2014-05-02 00:16 -0400
Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 21:42 -0700
Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 14:54 +1000
Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 08:08 +0000
Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 19:01 +1000
Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 11:52 +0000
Re: Unicode 7 Ben Finney <ben@benfinney.id.au> - 2014-05-02 19:16 +1000
Re: Unicode 7 Marko Rauhamaa <marko@pacujo.net> - 2014-05-02 13:05 +0300
Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 19:24 +1000
Re: Unicode 7 MRAB <python@mrabarnett.plus.com> - 2014-05-02 18:07 +0100
Re: Unicode 7 MRAB <python@mrabarnett.plus.com> - 2014-04-29 19:12 +0100
Re: Unicode 7 wxjmfauth@gmail.com - 2014-04-30 00:06 -0700
Re: Unicode 7 Tim Chase <python.list@tim.thechases.com> - 2014-04-30 13:48 -0500
Re: Unicode 7 wxjmfauth@gmail.com - 2014-04-30 23:00 -0700
Page 1 of 3 [1] 2 3 Next page →
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2014-04-29 10:37 -0700 |
| Subject | Unicode 7 |
| Message-ID | <d6e81de5-a82b-491f-b2f0-7ab4a24cff03@googlegroups.com> |
Let see how Python is ready for the next Unicode version
(Unicode 7.0.0.Beta).
>>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = 'z'")
[1.4027834829454946, 1.38714224331963, 1.3822586635296261]
>>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = '\u0fce'")
[5.462776291480395, 5.4479432055423445, 5.447874284053398]
>>>
>>>
>>> # more interesting
>>> timeit.repeat("(x*1000 + y)[:-1]",\
... setup="x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')")
[1.3496489533188765, 1.328654286266783, 1.3300913977710707]
>>>
Note 1: "lookup" is not the problem.
Note 2: From Unicode.org : "[...] We strongly encourage [...] and test
them with their programs [...]"
-> Done.
jmf
[toc] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-04-29 12:59 -0500 |
| Message-ID | <mailman.9579.1398794381.18130.python-list@python.org> |
| In reply to | #70722 |
On 2014-04-29 10:37, wxjmfauth@gmail.com wrote:
> >>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = 'z'")
> [1.4027834829454946, 1.38714224331963, 1.3822586635296261]
> >>> timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y =
> >>> '\u0fce'")
> [5.462776291480395, 5.4479432055423445, 5.447874284053398]
> >>>
> >>>
> >>> # more interesting
> >>> timeit.repeat("(x*1000 + y)[:-1]",\
> ... setup="x = 'abc'.encode('utf-8'); y =
> '\u0fce'.encode('utf-8')") [1.3496489533188765, 1.328654286266783,
> 1.3300913977710707]
> >>>
While I dislike feeding the troll, what I see here is: on your
machine, all unicode manipulations in the test should take ~5.4
seconds. But Python notices that some of your strings *don't*
require a full 32-bits and thus optimizes those operations, cutting
about 75% of the processing time (wow...4-bytes-per-char to
1-byte-per-char, I wonder where that 75% savings comes from).
So rather than highlight any *problem* with Python, your [mostly
worthless microbenchmark non-realworld] tests show that Python's
unicode implementation is awesome.
Still waiting to see an actual bug-report as mentioned on the other
thread.
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-04-29 21:53 -0700 |
| Message-ID | <ac9b2a50-3b5d-4ee8-8954-9f0f1ab490b6@googlegroups.com> |
| In reply to | #70723 |
On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote: > While I dislike feeding the troll, what I see here is: <snipped> Since its Unicode-troll time, here's my contribution http://blog.languager.org/2014/04/unicode-and-unix-assumption.html :-) More seriously, since Ive quoted some esteemed members of this list explicitly (Steven) and the list in general, please let me know if something is inaccurate or inappropriate
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2014-05-01 05:00 +0000 |
| Message-ID | <5361d4f9$0$11109$c3e8da3@news.astraweb.com> |
| In reply to | #70763 |
On Tue, 29 Apr 2014 21:53:22 -0700, Rustom Mody wrote: > On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote: >> While I dislike feeding the troll, what I see here is: > > <snipped> > > Since its Unicode-troll time, here's my contribution > http://blog.languager.org/2014/04/unicode-and-unix-assumption.html I disagree with much of your characterisation of the Unix assumption, and I point out that out of the two most widespread flavours of OS today, Linux/Unix and Windows, it is *Windows* and not Unix which still regularly uses legacy encodings. Also your link to Joel On Software mistakenly links to me instead of Joel. There's a missing apostrophe in "Ive" [sic] in Acknowledgment #2. I didn't notice any other typos. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-05-01 11:04 -0700 |
| Message-ID | <82067b83-a6f5-4b16-b012-385535ea5607@googlegroups.com> |
| In reply to | #70807 |
On Thursday, May 1, 2014 10:30:43 AM UTC+5:30, Steven D'Aprano wrote: > On Tue, 29 Apr 2014 21:53:22 -0700, Rustom Mody wrote: > > On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote: > >> While I dislike feeding the troll, what I see here is: > > Since its Unicode-troll time, here's my contribution > > http://blog.languager.org/2014/04/unicode-and-unix-assumption.html > Also your link to Joel On Software mistakenly links to me instead of Joel. > There's a missing apostrophe in "Ive" [sic] in Acknowledgment #2. Done, Done. > I didn't notice any other typos. Thank you sir! > I point out that out of the two most widespread flavours of OS today, > Linux/Unix and Windows, it is *Windows* and not Unix which still > regularly uses legacy encodings. Not sure what you are suggesting... That (I am suggesting that) 8859 is legacy and 1252 is not? > I disagree with much of your characterisation of the Unix assumption, I'd be interested to know the details -- Contents? Details? Tone? Tenor? Blaspheming the sacred scripture? (if you are so inclined of course)
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2014-05-01 18:38 -0400 |
| Message-ID | <mailman.9637.1398983969.18130.python-list@python.org> |
| In reply to | #70818 |
On 5/1/2014 2:04 PM, Rustom Mody wrote:
>>> Since its Unicode-troll time, here's my contribution
>>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html
I will not comment on the Unix-assumption part, but I think you go wrong
with this: "Unicode is a Headache". The major headache is that unicode
and its very few encodings are not universally used. The headache is all
the non-unicode legacy encodings still being used. So you better title
this section 'Non-Unicode is a Headache'.
The first sentence is this misleading tautology: "With ASCII, data is
ASCII whether its file, core, terminal, or network; ie "ABC" is
65,66,67." Let me translate: "If all text is ASCII encoded, then text
data is ASCII, whether ..." But it was never the case that all text was
ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe
still uses the latter. Other mainframe makers used other encodings of
A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never
universal. You could have just as well said "With EBCDIC, data is
EBCDIC, whether ..."
https://en.wikipedia.org/wiki/Ascii
https://en.wikipedia.org/wiki/EBCDIC
A crucial step in the spread of Ascii was its use for microcomputers,
including the IBM PC. The latter was considered a toy by the mainframe
guys. If they had known that PCs would partly take over the computing
world, they might have suggested or insisted that the it use EBCDIC.
"With unicode there are:
encodings"
where 'encodings' is linked to
https://en.wikipedia.org/wiki/Character_encodings_in_HTML
If html 'always' used utf-8 (like xml), as has become common but not
universal, all of the problems with *non-unicode* character sets and
encodings would disappear. The pre-unicode declarations could then
disappear. More truthful: "without unicode there are 100s of encodings
and with unicode only 3 that we should worry about.
"in-memory formats"
These are not the concern of the using programmer as long as they do not
introduce bugs or limitations (as do all the languages stuck on UCS-2
and many using UTF-16, including old Python narrow builds). Using what
should generally be the universal transmission format, UFT-8, as the
internal format means either losing indexing and slicing, having those
operations slow from O(1) to O(len(string)), or adding an index table
that is not part of the unicode standard. Using UTF-32 avoids the above
but usually wasted space -- up to 75%.
"strange beasties like python's FSR"
Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
is an *internal optimization* that benefits most unicode operations that
people actually perform. It uses UTF-32 by default but adapts to the
strings users create by compressing the internal format. The compression
is trivial -- simple dropping leading null bytes common to all
characters -- so each character is still readable as is. The string
headers records how many bytes are left. Is the idea of algorithms that
adapt to inputs really strange to you?
Like good adaptive algorthms, the FSR is invisible to the user except
for reducing space or time or maybe both. Unicode operations are
otherwise the same as with previous wide builds. People who used to use
narrow-builds also benefit from bug elimination. The only 'headaches'
involved might have been those of the developers who optimized previous
wide builds.
CPython has many other functions with special-case optimizations and
'fast paths' for common, simple cases. For instance, (some? all?) number
operations are optimized for pairs of integers. Do you call these
'strange beasties'?
PyPy is faster than CPython, when it is, because it is even more
adaptable to particular computations by creating new fast paths. The
mechanism to create these 'strange beasties' might have been a headache
for the writers, but when it works, which it now seems to, it is not for
the users.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-05-01 19:29 -0700 |
| Message-ID | <8c30f6fc-8493-419b-a4c8-dfe4a9d30de0@googlegroups.com> |
| In reply to | #70829 |
On Friday, May 2, 2014 4:08:35 AM UTC+5:30, Terry Reedy wrote: > On 5/1/2014 2:04 PM, Rustom Mody wrote: > >>> Since its Unicode-troll time, here's my contribution > >>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html > I will not comment on the Unix-assumption part, but I think you go wrong > with this: "Unicode is a Headache". The major headache is that unicode > and its very few encodings are not universally used. The headache is all > the non-unicode legacy encodings still being used. So you better title > this section 'Non-Unicode is a Headache'. > The first sentence is this misleading tautology: "With ASCII, data is > ASCII whether its file, core, terminal, or network; ie "ABC" is > 65,66,67." Let me translate: "If all text is ASCII encoded, then text > data is ASCII, whether ..." But it was never the case that all text was > ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe > still uses the latter. Other mainframe makers used other encodings of > A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never > universal. You could have just as well said "With EBCDIC, data is > EBCDIC, whether ..." > https://en.wikipedia.org/wiki/Ascii > https://en.wikipedia.org/wiki/EBCDIC > A crucial step in the spread of Ascii was its use for microcomputers, > including the IBM PC. The latter was considered a toy by the mainframe > guys. If they had known that PCs would partly take over the computing > world, they might have suggested or insisted that the it use EBCDIC. > "With unicode there are: > encodings" > where 'encodings' is linked to > https://en.wikipedia.org/wiki/Character_encodings_in_HTML > If html 'always' used utf-8 (like xml), as has become common but not > universal, all of the problems with *non-unicode* character sets and > encodings would disappear. The pre-unicode declarations could then > disappear. More truthful: "without unicode there are 100s of encodings > and with unicode only 3 that we should worry about. > "in-memory formats" > These are not the concern of the using programmer as long as they do not > introduce bugs or limitations (as do all the languages stuck on UCS-2 > and many using UTF-16, including old Python narrow builds). Using what > should generally be the universal transmission format, UFT-8, as the > internal format means either losing indexing and slicing, having those > operations slow from O(1) to O(len(string)), or adding an index table > that is not part of the unicode standard. Using UTF-32 avoids the above > but usually wasted space -- up to 75%. > "strange beasties like python's FSR" > Have you really let yourself be poisoned by JMF's bizarre rants? The FSR > is an *internal optimization* that benefits most unicode operations that > people actually perform. It uses UTF-32 by default but adapts to the > strings users create by compressing the internal format. The compression > is trivial -- simple dropping leading null bytes common to all > characters -- so each character is still readable as is. The string > headers records how many bytes are left. Is the idea of algorithms that > adapt to inputs really strange to you? > Like good adaptive algorthms, the FSR is invisible to the user except > for reducing space or time or maybe both. Unicode operations are > otherwise the same as with previous wide builds. People who used to use > narrow-builds also benefit from bug elimination. The only 'headaches' > involved might have been those of the developers who optimized previous > wide builds. > CPython has many other functions with special-case optimizations and > 'fast paths' for common, simple cases. For instance, (some? all?) number > operations are optimized for pairs of integers. Do you call these > 'strange beasties'? Here is an instance of someone who would like a certain optimization to be dis-able-able https://mail.python.org/pipermail/python-list/2014-February/667169.html To the best of my knowledge its nothing to do with unicode or with jmf. Why if optimizations are always desirable do C compilers have: -O0 O1 O2 O3 and zillions of more specific flags? JFTR I have no issue with FSR. What we have to hand to jmf - willingly or otherwise - is that many more people have heard of FSR thanks to him. [I am one of them] I dont even know whether jmf has a real technical (as he calls it 'mathematical') issue or its entirely political: "Why should I pay more for a EURO sign than a $ sign?" Well perhaps that is more related to the exchange rate than to python!
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-05-01 19:39 -0700 |
| Message-ID | <51602756-7019-4f79-b168-d12b2b801a8e@googlegroups.com> |
| In reply to | #70837 |
On Friday, May 2, 2014 7:59:55 AM UTC+5:30, Rustom Mody wrote: > "Why should I pay more for a EURO sign than a $ sign?" A unicode 'headache' there: I typed the Euro sign (trying again € ) not EURO Somebody -- I guess its GG in overhelpful mode -- converted it And made my post: Content-Type: text/plain; charset=ISO-8859-1 Will some devanagarari vowels help it stop being helpful? अ आ इ ई उ ऊ ए ऐ
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-05-02 13:01 +1000 |
| Message-ID | <mailman.9644.1398999726.18130.python-list@python.org> |
| In reply to | #70837 |
On Fri, May 2, 2014 at 12:29 PM, Rustom Mody <rustompmody@gmail.com> wrote: > Here is an instance of someone who would like a certain optimization to be > dis-able-able > > https://mail.python.org/pipermail/python-list/2014-February/667169.html > > To the best of my knowledge its nothing to do with unicode or with jmf. It doesn't, and it has only to do with testing. I've had similar issues at times; for instance, trying to benchmark one language or language construct against another often means fighting against an optimizer. (How, for instance, do you figure out what loop overhead is, when an empty loop is completely optimized out?) This is nothing whatsoever to do with Unicode, nor to do with the optimization that Python and Pike (and maybe other languages) do with the storage of Unicode strings. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-05-01 20:16 -0700 |
| Message-ID | <17d23e0b-7876-49b3-a785-4601f58414c6@googlegroups.com> |
| In reply to | #70841 |
On Friday, May 2, 2014 8:31:56 AM UTC+5:30, Chris Angelico wrote: > On Fri, May 2, 2014 at 12:29 PM, Rustom Mody wrote: > > Here is an instance of someone who would like a certain optimization to be > > dis-able-able > > https://mail.python.org/pipermail/python-list/2014-February/667169.html > > To the best of my knowledge its nothing to do with unicode or with jmf. > It doesn't, and it has only to do with testing. I've had similar > issues at times; for instance, trying to benchmark one language or > language construct against another often means fighting against an > optimizer. (How, for instance, do you figure out what loop overhead > is, when an empty loop is completely optimized out?) This is nothing > whatsoever to do with Unicode, nor to do with the optimization that > Python and Pike (and maybe other languages) do with the storage of > Unicode strings. This was said in response to Terry's > CPython has many other functions with special-case optimizations and > 'fast paths' for common, simple cases. For instance, (some? all?) number > operations are optimized for pairs of integers. Do you call these > 'strange beasties'? which evidently vanished -- optimized out :D -- in multiple levels of quoting
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2014-05-02 01:05 -0400 |
| Message-ID | <mailman.9648.1399007203.18130.python-list@python.org> |
| In reply to | #70837 |
On 5/1/2014 10:29 PM, Rustom Mody wrote: > Here is an instance of someone who would like a certain optimization to be > dis-able-able > > https://mail.python.org/pipermail/python-list/2014-February/667169.html > > To the best of my knowledge its nothing to do with unicode or with jmf. Right. Ned has an actual technical reason to complain, even though the developers do not consider it strong enough to act. > Why if optimizations are always desirable do C compilers have: > -O0 O1 O2 O3 and zillions of more specific flags? One reason is that many optimizations sometimes introduce bugs, or to put it another way, they are based on assumptions that are not true for all code. For instance, some people have suggested that CPython should have an optional optimization based on the assumption that builtin names are never rebound. That is true for perhaps many code files, but definitely not all. Guido does not seem to like such conditional optimizations. I can think of three reasons for not adding to the numerous options CPython already has. 1. We do not have the developers resources to handle the added complications of multiple optimization options. 2. Zillions of options and flags confuse users. As it is, most options are seldom used. 3. Optimization options are easily misused, possibly leading to silently buggy results, or mysterious failures. For instance, people sometimes rebind builtins without realizing what they have done, such as using 'id' as a parameter name. Being in the habit of routinely using the 'assume no rebinding option' would lead to problems. I am rather sure that the string (unicode) test suite was reviewed and the performance of 3.2 wide builds recorded before the new implementation was committed. The tracker currently has 37 behavior (bug) issues marked for the unicode component. In a quick review, I do not see that any have anything to do with using standard UTF-32 versus adaptive UTF-32. Indeed, I believe a majority of the 37 were filed before 3.3 or are 2.7 specific. Problems with FSR itself have been fixed as discovered. > JFTR I have no issue with FSR. What we have to hand to jmf - willingly > or otherwise - is that many more people have heard of FSR thanks to him. [I am one of them] Somewhat ironically, I suppose your are right. > I dont even know whether jmf has a real > technical (as he calls it 'mathematical') issue or its entirely political: I would call his view personal or philosophical. I only object to endless repetition and the deception of claiming that personal views are mathematical facts. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-05-02 03:15 +0000 |
| Message-ID | <53630dcc$0$29965$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #70829 |
On Thu, 01 May 2014 18:38:35 -0400, Terry Reedy wrote:
> "strange beasties like python's FSR"
>
> Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
> is an *internal optimization* that benefits most unicode operations that
> people actually perform. It uses UTF-32 by default but adapts to the
> strings users create by compressing the internal format. The compression
> is trivial -- simple dropping leading null bytes common to all
> characters -- so each character is still readable as is.
For anyone who, like me, wasn't convinced that Unicode worked that way,
you can see for yourself that it does. You don't need Python 3.3, any
version of 3.x will work. In Python 2.7, it should work if you just
change the calls from "chr()" to "unichr()":
py> for i in range(256):
... c = chr(i)
... u = c.encode('utf-32-be')
... assert u[:3] == b'\0\0\0'
... assert u[3:] == c.encode('latin-1')
...
py> for i in range(256, 0xFFFF+1):
... c = chr(i)
... u = c.encode('utf-32-be')
... assert u[:2] == b'\0\0'
... assert u[2:] == c.encode('utf-16-be')
...
py>
So Terry is correct: dropping leading zeroes, and treating the remainder
as either Latin-1 or UTF-16, works fine, and potentially saves a lot of
memory.
--
Steven D'Aprano
http://import-that.dreamwidth.org/
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2014-05-02 00:33 +0100 |
| Message-ID | <mailman.9639.1398987208.18130.python-list@python.org> |
| In reply to | #70818 |
On 2014-05-01 23:38, Terry Reedy wrote: > On 5/1/2014 2:04 PM, Rustom Mody wrote: > >>>> Since its Unicode-troll time, here's my contribution >>>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html > > I will not comment on the Unix-assumption part, but I think you go wrong > with this: "Unicode is a Headache". The major headache is that unicode > and its very few encodings are not universally used. The headache is all > the non-unicode legacy encodings still being used. So you better title > this section 'Non-Unicode is a Headache'. > [snip] I think he's right when he says "Unicode is a headache", but only because it's being used to handle languages which are, themselves, a "headache": left-to-right versus right-to-left, sometimes on the same line; diacritics, possibly several on a glyph; etc.
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-05-01 19:02 -0700 |
| Message-ID | <eb56fd65-4729-42db-bcd4-179c19aaf485@googlegroups.com> |
| In reply to | #70831 |
On Friday, May 2, 2014 5:03:21 AM UTC+5:30, MRAB wrote: > On 2014-05-01 23:38, Terry Reedy wrote: > > On 5/1/2014 2:04 PM, Rustom Mody wrote: > >>>> Since its Unicode-troll time, here's my contribution > >>>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html > > I will not comment on the Unix-assumption part, but I think you go wrong > > with this: "Unicode is a Headache". The major headache is that unicode > > and its very few encodings are not universally used. The headache is all > > the non-unicode legacy encodings still being used. So you better title > > this section 'Non-Unicode is a Headache'. > [snip] > I think he's right when he says "Unicode is a headache", but only > because it's being used to handle languages which are, themselves, a > "headache": left-to-right versus right-to-left, sometimes on the same > line; diacritics, possibly several on a glyph; etc. Yes, the headaches go a little further back than Unicode. There is a certain large old book... In which is described the building of a 'tower that reached up to heaven'... At which point 'it was decided'¶ to do something to prevent that. And our headaches started. I dont know how one causally connects the 'headaches' but Ive seen - mojibake - unicode 'number-boxes' (what are these called?) - Worst of all what we *dont* see -- how many others dont see what we see? I never knew of any of this in the good ol days of ASCII ¶ Passive voice is often the best choice in the interests of political correctness It would be a pleasant surprise if everyone sees a pilcrow at start of line above
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben@benfinney.id.au> |
|---|---|
| Date | 2014-05-02 12:39 +1000 |
| Message-ID | <mailman.9643.1398998400.18130.python-list@python.org> |
| In reply to | #70834 |
Rustom Mody <rustompmody@gmail.com> writes: > Yes, the headaches go a little further back than Unicode. Okay, so can you change your article to reflect the fact that the headaches both pre-date Unicode, and are made much easier by Unicode? > There is a certain large old book... Ah yes, the neo-Sumerian story “Enmerkar_and_the_Lord_of_Aratta” <URL:https://en.wikipedia.org/wiki/Enmerkar_and_the_Lord_of_Aratta>. Probably inspired by stories older than that, of course. > In which is described the building of a 'tower that reached up to heaven'... > At which point 'it was decided'¶ to do something to prevent that. > And our headaches started. And other myths with fantastic reasons for the diversity of language <URL:https://en.wikipedia.org/wiki/Mythical_origins_of_language>. > I never knew of any of this in the good ol days of ASCII Yes, by ignoring all other writing systems except one's own – and thereby excluding most of the world's people – the system can be made simpler. Hopefully the proportion of programmers who still feel they can make such a parochial choice is rapidly shrinking. -- \ “Why doesn't Python warn that it's not 100% perfect? Are people | `\ just supposed to “know” this, magically?” —Mitya Sirenef, | _o__) comp.lang.python, 2012-12-27 | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-05-01 19:59 -0700 |
| Message-ID | <92004436-36a8-49ce-b4ec-dc0237b04bac@googlegroups.com> |
| In reply to | #70839 |
On Friday, May 2, 2014 8:09:44 AM UTC+5:30, Ben Finney wrote: > Rustom Mody writes: > > Yes, the headaches go a little further back than Unicode. > Okay, so can you change your article to reflect the fact that the > headaches both pre-date Unicode, and are made much easier by Unicode? Predate: Yes Made easier: No > > There is a certain large old book... > Ah yes, the neo-Sumerian story "Enmerkar_and_the_Lord_of_Aratta" > <URL:https://en.wikipedia.org/wiki/Enmerkar_and_the_Lord_of_Aratta>. > Probably inspired by stories older than that, of course. Thanks for that link > > In which is described the building of a 'tower that reached up to heaven'... > > At which point 'it was decided'¶ to do something to prevent that. > > And our headaches started. > And other myths with fantastic reasons for the diversity of language > <URL:https://en.wikipedia.org/wiki/Mythical_origins_of_language>. This one takes the cake - see 1st para http://hilgart.org/enformy/BronsonRekindling.pdf > > I never knew of any of this in the good ol days of ASCII > Yes, by ignoring all other writing systems except one's own - and > thereby excluding most of the world's people - the system can be made > simpler. > Hopefully the proportion of programmers who still feel they can make > such a parochial choice is rapidly shrinking. See link above: Ethnic differences and chauvinism are invariably linked
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-05-02 08:45 +0000 |
| Message-ID | <53635b34$0$29965$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #70834 |
On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote: > I dont know how one causally connects the 'headaches' but Ive seen - > mojibake Mojibake is certainly more common with multiple encodings, but the solution to that is Unicode, not ASCII. In fact, in your blog post you even link to a post of mine where I explain that ASCII has gone through multiple backwards incompatible changes over the decades, which means you can have a limited form of mojibake even in pure ASCII. Between changes over various versions of ASCII, and ambiguous characters allowed by the standard, you needed some sort of out-of-band metadata to tell you whether they intended an @ or a `, a | or a ¬, a £ or a #, to mention only a few. It's only since the 1980s that ASCII, actual 7-bit US ASCII, has become an unambiguous standard. But that's okay, because that merely allowed people to create dozens of 7-bit and 8-bit variations on ASCII, all incompatible with each other, and *call them ASCII* regardless of the actual standard name. Between ambiguities in actual ASCII, and common practice to label non- ASCII as ASCII, I can categorically say that mojibake has always been possible in so-called "plain text". If you haven't noticed it, it was because you were only exchanging documents with people who happened to use the same set of characters as you. > - unicode 'number-boxes' (what are these called?) They are missing character glyphs, and they have nothing to do with Unicode. They are due to deficiencies in the text font you are using. Admittedly with Unicode's 0x10FFFF possible characters (actually more, since a single code point can have multiple glyphs) it isn't surprising that most font designers have neither the time, skill or desire to create a glyph for every single code point. But then the same applies even for more restrictive 8-bit encodings -- sometimes font designers don't even bother providing glyphs for *ASCII* characters. (E.g. they may only provide glyphs for uppercase A...Z, not lowercase.) > - Worst of all what we > *dont* see -- how many others dont see what we see? Again, this a deficiency of the font. There are very few code points in Unicode which are intended to be invisible, e.g. space, newline, zero- width joiner, control characters, etc., but they ought to be equally invisible to everyone. No printable character should ever be invisible in any decent font. > I never knew of any of this in the good ol days of ASCII You must have been happy with a very impoverished set of symbols, then. > ¶ Passive voice is often the best choice in the interests of political > correctness > > It would be a pleasant surprise if everyone sees a pilcrow at start of > line above I do. -- Steven D'Aprano http://import-that.dreamwidth.org/
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-05-02 19:08 +1000 |
| Message-ID | <mailman.9650.1399021712.18130.python-list@python.org> |
| In reply to | #70853 |
On Fri, May 2, 2014 at 6:45 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: >> - unicode 'number-boxes' (what are these called?) > > They are missing character glyphs, and they have nothing to do with > Unicode. They are due to deficiencies in the text font you are using. > > Admittedly with Unicode's 0x10FFFF possible characters (actually more, > since a single code point can have multiple glyphs) it isn't surprising > that most font designers have neither the time, skill or desire to create > a glyph for every single code point. But then the same applies even for > more restrictive 8-bit encodings -- sometimes font designers don't even > bother providing glyphs for *ASCII* characters. > > (E.g. they may only provide glyphs for uppercase A...Z, not lowercase.) This is another area where Unicode has given us "a great improvement over the old method of giving satisfaction". Back in the 1990s on OS/2, DOS, and Windows, a missing glyph might be (a) blank, (b) a simple square with no information, or (c) copied from some other font (common with dingbats fonts). With Unicode, the standard is to show a little box *with the hex digits in it*. Granted, those boxes are a LOT more readable for BMP characters than SMP (unless your text is huge, six digits in the space of one character will make them pretty tiny), and a "Unicode" font will generally include all (or at least most) of the BMP, but it's still better than having no information at all. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Jussi Piitulainen <jpiitula@ling.helsinki.fi> |
|---|---|
| Date | 2014-05-02 13:04 +0300 |
| Message-ID | <qottx98o5so.fsf@ruuvi.it.helsinki.fi> |
| In reply to | #70856 |
Chris Angelico writes: > (common with dingbats fonts). With Unicode, the standard is to show > a little box *with the hex digits in it*. Granted, those boxes are a > LOT more readable for BMP characters than SMP (unless your text is > huge, six digits in the space of one character will make them pretty > tiny), and a "Unicode" font will generally include all (or at least > most) of the BMP, but it's still better than having no information I needed to see such tiny numbers just today, just the four of them in the tiny box. So I pressed C-+ a few times to _make_ the text huge, obtained my information, and returned to my normal text size with C--. Perfect. Usually all I need to know is that I have a character for which I don't have a glyph, but this time I wanted to record the number because I was testing things rather than reading the text.
[toc] | [prev] | [next] | [standalone]
| From | Rustom Mody <rustompmody@gmail.com> |
|---|---|
| Date | 2014-05-02 03:39 -0700 |
| Message-ID | <0bdd2577-2893-4564-9857-fcfc6021dced@googlegroups.com> |
| In reply to | #70853 |
On Friday, May 2, 2014 2:15:41 PM UTC+5:30, Steven D'Aprano wrote: > On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote: > > - Worst of all what we > > *dont* see -- how many others dont see what we see? > Again, this a deficiency of the font. There are very few code points in > Unicode which are intended to be invisible, e.g. space, newline, zero- > width joiner, control characters, etc., but they ought to be equally > invisible to everyone. No printable character should ever be invisible in > any decent font. Thats not what I meant. I wrote http://blog.languager.org/2014/04/unicoded-python.html – mostly on a debian box. Later on seeing it on a less heavily setup ubuntu box, I see ⟮ ⟯ ⟬ ⟭ ⦇ ⦈ ⦉ ⦊ have become 'missing-glyph' boxes. It leads me ask, how much else of what I am writing, some random reader has simply not seen? Quite simply we can never know – because most are going to go away saying "mojibaked/garbled rubbish" Speaking of what you understood of what I said: Yes invisible chars is another problem I was recently bitten by. I pasted something from google into emacs' org mode. Following that link again I kept getting a broken link. Until I found that the link had an invisible char The problem was that emacs was faithfully rendering that char according to standard, ie invisibly!
[toc] | [prev] | [next] | [standalone]
Page 1 of 3 [1] 2 3 Next page →
Back to top | Article view | comp.lang.python
csiph-web