Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Akira Li <4kir4.1i@gmail.com>
Subject: Re: Python 3.2 has some deadly infection
Date: Fri, 06 Jun 2014 12:03:51 +0400
References: <mailman.10509.1401552642.18130.python-list@python.org> <538a8f48$0$29978$c3e8da3$5496439d@news.astraweb.com> <mailman.10531.1401663275.18130.python-list@python.org> <538bcfff$0$29978$c3e8da3$5496439d@news.astraweb.com> <CAN8CLgk5y0Of35RzEhm2-3OfyyzMqUfLABX7xAY-gQKqapYrJA@mail.gmail.com> <loom.20140602T094123-717@post.gmane.org> <CAN8CLgn_fXb1mqt3bZzeRw0d6TC=UykK5n0HTbA3JKHFo8gEHg@mail.gmail.com> <538C5BB8.1020702@chamonix.reportlab.co.uk> <mailman.10575.1401744891.18130.python-list@python.org> <bv540tFca0mU1@mid.individual.net> <lmjphk$bl7$1@ger.gmane.org> <mailman.10625.1401805111.18130.python-list@python.org> <538f1a61$0$29978$c3e8da3$5496439d@news.astraweb.com> <bva1ccFdr03U1@mid.individual.net> <53902bb1$0$11109$c3e8da3@news.astraweb.com> <87wqcvu20h.fsf@elektro.pacujo.net> <7b3543f6-6f62-49c5-abdc-e2783fd6d629@googlegroups.com> <87oay7tnxt.fsf@elektro.pacujo.net> <53908dd0$0$29978$c3e8da3$5496439d@news.astraweb.com> <87ha3zti2h.fsf@elektro.pacujo.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (gnu/linux)
Cancel-Lock: sha1:DC8fQbMyP0cb/cWzJYY/VFTBWKA=
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.10807.1402041845.18130.python-list@python.org>
Lines: 74
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:72820

Marko Rauhamaa <marko@pacujo.net> writes:

> Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
>
>> Nevertheless, there are important abstractions that are written on top
>> of the bytes layer, and in the Unix and Linux world, the most
>> important abstraction is *text*. In the Unix world, text formats and
>> text processing is much more common in user-space apps than binary
>> processing.
>
> That linux text is not the same thing as Python's text. Conceptually,
> Python text is a sequence of 32-bit integers. Linux text is a sequence
> of 8-bit integers.

_Unicode string in Python is a sequence of Unicode codepoints_. It is
correct that 32-bit integer is enough to represent any Unicode
codepoint: \u0000...\U0010FFFF 

It says *nothing* about how Unicode strings are represented
*internally* in Python. It may vary from version to version, build
options and even may depend on the content of a string at runtime.

In the past, "narrow builds" might break the abstraction in some cases
that is why Linux distributions used wide python builds.


_Unicode codepoint is  not a Python concept_. There is Unicode
standard http://unicode.org Though intead of following the
self-referential defenitions web, I find it easier to learn from
examples such as http://codepoints.net/U+0041 (A) or
http://codepoints.net/U+1F3A7 (🎧)

_There is no such thing as 8-bit text_
http://www.joelonsoftware.com/articles/Unicode.html

If you insert a space after each byte (8-bit) in the input text then you
may get garbage i.e., you can't assume that a character is a byte:

  $ echo "Hyvää yötä" | perl -pe's/.\K/ /g'
  H y v a � � � �   y � � t � �

In general, you can't assume that a character is a Unicode codepoint:

  $ echo "Hyvää yötä" | perl -C -pe's/.\K/ /g'
  H y v a ̈ ä   y ö t ä

The eXtended grapheme clusters (user-perceived characters) may be useful
in this case:

  $ echo "Hyvää yötä" | perl -C -pe's/\X\K/ /g'
  H y v ä ä   y ö t ä

\X pattern is supported by `regex` module in Python i.e., you can't even
iterate over characters (as they are seen by a user) in Python using
only stdlib. \w+ pattern is also broken for Unicode text
http://bugs.python.org/issue1693050 (it is fixed in the `regex` module)
i.e., you can't select a word in Unicode text using only stdlib.

\X along is not enough in some cases e.g., "“ch” may be considered a
grapheme cluster in Slovak, for processes such as collation" [1]
(sorting order). `PyICU` module might be useful here.

Knowing about Unicode normalization forms (NFC, NFKD, etc)
http://unicode.org/reports/tr15/ Unicode
text segmentation [1] and Unicode collation algorithm
http://www.unicode.org/reports/tr10/ concepts is also 
useful; if you want to work with text. 

[1]: http://www.unicode.org/reports/tr29/


--
akira