Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #70829

Re: Unicode 7

Path csiph.com!usenet.pasdenom.info!news.albasani.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'programmer': 0.03; 'cpython': 0.05; 'indexing': 0.07; 'pypy': 0.07; 'utf-8': 0.07; 'string': 0.09; '"if': 0.09; 'ascii': 0.09; 'core,': 0.09; 'guys.': 0.09; 'latter': 0.09; 'part,': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'sentence': 0.09; 'used.': 0.09; 'url:blog': 0.10; 'python': 0.11; 'bug': 0.12; 'jan': 0.12; '8-bit': 0.16; 'adaptable': 0.16; 'adaptive': 0.16; 'ascii,': 0.16; 'compression': 0.16; 'declarations': 0.16; 'ebcdic': 0.16; 'ebcdic,': 0.16; 'encodings': 0.16; 'inputs': 0.16; 'insisted': 0.16; 'integers.': 0.16; 'invisible': 0.16; 'losing': 0.16; 'pairs': 0.16; 'partly': 0.16; 'paths.': 0.16; 'readable': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'reedy': 0.16; 'slicing,': 0.16; 'subject:Unicode': 0.16; 'symbols': 0.16; 'xml),': 0.16; 'index': 0.16; 'wrote:': 0.18; 'users.': 0.18; 'file,': 0.19; 'mechanism': 0.19; "python's": 0.19; 'seems': 0.21; '>>>': 0.22; 'spread': 0.22; 'creating': 0.23; 'header:User-Agent:1': 0.23; 'bytes': 0.24; 'format,': 0.24; 'headers': 0.24; 'instance,': 0.24; 'unicode': 0.24; 'developers': 0.25; 'suggested': 0.26; 'this:': 0.26; 'world,': 0.26; 'header:X -Complaints-To:1': 0.27; 'header:In-Reply-To:1': 0.27; 'idea': 0.28; 'character': 0.29; 'generally': 0.29; '(like': 0.30; 'characters': 0.30; 'sets': 0.30; 'url:wiki': 0.31; 'usually': 0.31; 'you?': 0.31; 'about.': 0.31; 'concern': 0.31; 'trivial': 0.31; 'universal': 0.31; 'url:wikipedia': 0.31; 'languages': 0.32; 'text': 0.33; 'are:': 0.33; 'bugs': 0.33; 'comment': 0.34; 'table': 0.34; 'maybe': 0.34; 'could': 0.34; 'common': 0.35; 'except': 0.35; 'computing': 0.35; 'operations': 0.35; 'but': 0.35; 'there': 0.35; 'really': 0.36; 'format.': 0.36; 'ibm': 0.36; 'limitations': 0.36; 'url:org': 0.36; 'should': 0.36; 'wrong': 0.37; 'step': 0.37; 'being': 0.38; 'problems': 0.38; 'to:addr :python-list': 0.38; 'pm,': 0.38; 'previous': 0.38; 'to:addr:python.org': 0.39; 'either': 0.39; 'received:org': 0.40; 'space': 0.40; 'users': 0.40; 'major': 0.40; 'how': 0.40; 'even': 0.60; 'skip:u 10': 0.60; 'algorithms': 0.60; 'is.': 0.60; 'worry': 0.60; 'most': 0.60; 'new': 0.61; 'received:173': 0.61; 'skip:* 10': 0.61; 'simple': 0.61; 'first': 0.61; 'skip:n 10': 0.64; 'become': 0.64; 'more': 0.64; 'linked': 0.65; 'believe': 0.68; 'benefit': 0.68; 'optimized': 0.68; 'default': 0.69; 'to,': 0.72; 'records': 0.73; 'introduce': 0.78; 'yourself': 0.78; 'avoids': 0.84; 'common,': 0.84; 'received:fios.verizon.net': 0.84; 'toy': 0.84; 'universally': 0.84; 'url:2014': 0.84; 'crucial': 0.91; 'pc.': 0.93; 'reducing': 0.93
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Terry Reedy <tjreedy@udel.edu>
Subject Re: Unicode 7
Date Thu, 01 May 2014 18:38:35 -0400
References <d6e81de5-a82b-491f-b2f0-7ab4a24cff03@googlegroups.com> <mailman.9579.1398794381.18130.python-list@python.org> <ac9b2a50-3b5d-4ee8-8954-9f0f1ab490b6@googlegroups.com> <5361d4f9$0$11109$c3e8da3@news.astraweb.com> <82067b83-a6f5-4b16-b012-385535ea5607@googlegroups.com>
Mime-Version 1.0
Content-Type text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding 7bit
X-Gmane-NNTP-Posting-Host pool-173-75-254-207.phlapa.fios.verizon.net
User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0
In-Reply-To <82067b83-a6f5-4b16-b012-385535ea5607@googlegroups.com>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.9637.1398983969.18130.python-list@python.org> (permalink)
Lines 84
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1398983969 news.xs4all.nl 2870 [2001:888:2000:d::a6]:57531
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:70829

Show key headers only | View raw


On 5/1/2014 2:04 PM, Rustom Mody wrote:

>>> Since its Unicode-troll time, here's my contribution
>>> http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

I will not comment on the Unix-assumption part, but I think you go wrong 
with this:  "Unicode is a Headache". The major headache is that unicode 
and its very few encodings are not universally used. The headache is all 
the non-unicode legacy encodings still being used. So you better title 
this section 'Non-Unicode is a Headache'.

The first sentence is this misleading tautology: "With ASCII, data is 
ASCII whether its file, core, terminal, or network; ie "ABC" is 
65,66,67." Let me translate: "If all text is ASCII encoded, then text 
data is ASCII, whether ..." But it was never the case that all text was 
ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe 
still uses the latter. Other mainframe makers used other encodings of 
A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never 
universal. You could have just as well said "With EBCDIC, data is 
EBCDIC, whether ..."

https://en.wikipedia.org/wiki/Ascii
https://en.wikipedia.org/wiki/EBCDIC

A crucial step in the spread of Ascii was its use for microcomputers, 
including the IBM PC. The latter was considered a toy by the mainframe 
guys. If they had known that PCs would partly take over the computing 
world, they might have suggested or insisted that the it use EBCDIC.

"With unicode there are:
     encodings"
where 'encodings' is linked to
https://en.wikipedia.org/wiki/Character_encodings_in_HTML

If html 'always' used utf-8 (like xml), as has become common but not 
universal, all of the problems with *non-unicode* character sets and 
encodings would disappear. The pre-unicode declarations could then 
disappear. More truthful: "without unicode there are 100s of encodings 
and with unicode only 3 that we should worry about.

"in-memory formats"

These are not the concern of the using programmer as long as they do not 
introduce bugs or limitations (as do all the languages stuck on UCS-2 
and many using UTF-16, including old Python narrow builds). Using what 
should generally be the universal transmission format, UFT-8, as the 
internal format means either losing indexing and slicing, having those 
operations slow from O(1) to O(len(string)), or adding an index table 
that is not part of the unicode standard. Using UTF-32 avoids the above 
but usually wasted space -- up to 75%.

"strange beasties like python's FSR"

Have you really let yourself be poisoned by JMF's bizarre rants? The FSR 
is an *internal optimization* that benefits most unicode operations that 
people actually perform. It uses UTF-32 by default but adapts to the 
strings users create by compressing the internal format. The compression 
is trivial -- simple dropping leading null bytes common to all 
characters -- so each character is still readable as is. The string 
headers records how many bytes are left.  Is the idea of algorithms that 
adapt to inputs really strange to you?

Like good adaptive algorthms, the FSR is invisible to the user except 
for reducing space or time or maybe both. Unicode operations are 
otherwise the same as with previous wide builds. People who used to use 
narrow-builds also benefit from bug elimination. The only 'headaches' 
involved might have been those of the developers who optimized previous 
wide builds.

CPython has many other functions with special-case optimizations and 
'fast paths' for common, simple cases. For instance, (some? all?) number 
operations are optimized for pairs of integers.  Do you call these 
'strange beasties'?

PyPy is faster than CPython, when it is, because it is even more 
adaptable to particular computations by creating new fast paths. The 
mechanism to create these 'strange beasties' might have been a headache 
for the writers, but when it works, which it now seems to, it is not for 
the users.

-- 
Terry Jan Reedy

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Unicode 7 wxjmfauth@gmail.com - 2014-04-29 10:37 -0700
  Re: Unicode 7 Tim Chase <python.list@tim.thechases.com> - 2014-04-29 12:59 -0500
    Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-04-29 21:53 -0700
      Re: Unicode 7 Steven D'Aprano <steve@pearwood.info> - 2014-05-01 05:00 +0000
        Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 11:04 -0700
          Re: Unicode 7 Terry Reedy <tjreedy@udel.edu> - 2014-05-01 18:38 -0400
            Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 19:29 -0700
              Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 19:39 -0700
              Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 13:01 +1000
                Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 20:16 -0700
              Re: Unicode 7 Terry Reedy <tjreedy@udel.edu> - 2014-05-02 01:05 -0400
            Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 03:15 +0000
          Re: Unicode 7 MRAB <python@mrabarnett.plus.com> - 2014-05-02 00:33 +0100
            Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 19:02 -0700
              Re: Unicode 7 Ben Finney <ben@benfinney.id.au> - 2014-05-02 12:39 +1000
                Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 19:59 -0700
              Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 08:45 +0000
                Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 19:08 +1000
                Re: Unicode 7 Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2014-05-02 13:04 +0300
                Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 03:39 -0700
                Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 11:55 +0000
                Re: Unicode 7 Marko Rauhamaa <marko@pacujo.net> - 2014-05-02 15:19 +0300
                Re: Unicode 7 Ben Finney <ben@benfinney.id.au> - 2014-05-03 07:07 +1000
                Re: Unicode 7 Roy Smith <roy@panix.com> - 2014-05-02 17:13 -0400
                Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 09:03 -0700
                Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 09:50 -0700
                Re: Unicode 7 Michael Torrie <torriem@gmail.com> - 2014-05-02 11:39 -0600
                Re: Unicode 7 Ned Batchelder <ned@nedbatchelder.com> - 2014-05-02 13:46 -0400
                Re: Unicode 7 Peter Otten <__peter__@web.de> - 2014-05-02 20:07 +0200
                Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 17:58 -0700
                Re: Unicode 7 Ned Batchelder <ned@nedbatchelder.com> - 2014-05-02 21:18 -0400
                Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 18:42 -0700
                Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-03 11:54 +1000
                Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-02 19:02 -0700
                Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-03 11:15 +1000
                Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-03 02:02 +0000
                Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-03 02:04 +0000
                Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-03 12:17 +1000
                Re: Unicode 7 Terry Reedy <tjreedy@udel.edu> - 2014-05-02 22:19 -0400
                Re: Unicode 7 Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2014-05-03 12:57 -0400
                Re: Unicode 7 Tim Chase <python.list@tim.thechases.com> - 2014-05-02 07:58 -0500
              Re: Unicode 7 MRAB <python@mrabarnett.plus.com> - 2014-05-02 17:52 +0100
          Re: Unicode 7 Terry Reedy <tjreedy@udel.edu> - 2014-05-02 00:16 -0400
            Re: Unicode 7 Rustom Mody <rustompmody@gmail.com> - 2014-05-01 21:42 -0700
              Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 14:54 +1000
              Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 08:08 +0000
                Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 19:01 +1000
                Re: Unicode 7 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-05-02 11:52 +0000
                Re: Unicode 7 Ben Finney <ben@benfinney.id.au> - 2014-05-02 19:16 +1000
                Re: Unicode 7 Marko Rauhamaa <marko@pacujo.net> - 2014-05-02 13:05 +0300
                Re: Unicode 7 Chris Angelico <rosuav@gmail.com> - 2014-05-02 19:24 +1000
                Re: Unicode 7 MRAB <python@mrabarnett.plus.com> - 2014-05-02 18:07 +0100
  Re: Unicode 7 MRAB <python@mrabarnett.plus.com> - 2014-04-29 19:12 +0100
    Re: Unicode 7 wxjmfauth@gmail.com - 2014-04-30 00:06 -0700
      Re: Unicode 7 Tim Chase <python.list@tim.thechases.com> - 2014-04-30 13:48 -0500
        Re: Unicode 7 wxjmfauth@gmail.com - 2014-04-30 23:00 -0700

csiph-web