Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python.': 0.02; 'algorithm': 0.04; 'broken': 0.04; 'insert': 0.05; 'subject:Python': 0.06; 'binary': 0.07; 'processing.': 0.07; 'string': 0.09; '32-bit': 0.09; '[1]:': 0.09; 'abstraction': 0.09; 'e.g.,': 0.09; 'iterate': 0.09; 'module)': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'skip:\\ 10': 0.09; 'url:unicode': 0.09; 'python': 0.11; 'assume': 0.14; '(it': 0.16; '8-bit': 0.16; 'garbage': 0.16; 'i.e.,': 0.16; 'integers.': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'runtime.': 0.16; 'stdlib.': 0.16; 'apps': 0.16; 'module': 0.19; "python's": 0.19; 'examples': 0.20; 'written': 0.21; 'input': 0.22; 'select': 0.22; 'header:User-Agent:1': 0.23; '(a)': 0.24; 'byte': 0.24; 'bytes': 0.24; 'integer': 0.24; 'text.': 0.24; 'unicode': 0.24; 'options': 0.25; 'supported': 0.26; 'world,': 0.26; 'header:X-Complaints-To:1': 0.27; 'correct': 0.29; 'fixed': 0.29; 'url:bugs': 0.29; '[1]': 0.29; 'character': 0.29; 'unix': 0.29; 'characters': 0.30; 'along': 0.30; 'easier': 0.31; "d'aprano": 0.31; 'perl': 0.31; 'steven': 0.31; 'subject:some': 0.31; 'writes:': 0.31; 'text': 0.33; 'linux': 0.33; 'says': 0.33; 'url:python': 0.33; 'cases': 0.33; "can't": 0.35; 'common': 0.35; 'there': 0.35; 'version': 0.36; 'sequence': 0.36; 'useful': 0.36; 'web,': 0.36; 'url:org': 0.36; 'represent': 0.38; 'message- id:@gmail.com': 0.38; 'version,': 0.38; 'to:addr:python-list': 0.38; 'to:addr:python.org': 0.39; 'enough': 0.39; 'received:org': 0.40; 'space': 0.40; 'how': 0.40; 'even': 0.60; 'most': 0.60; 'break': 0.61; 'extended': 0.61; 'url:u': 0.61; 'such': 0.63; 'skip:n 10': 0.64; 'more': 0.64; 'past,': 0.68; 'collation': 0.84; 'received:89': 0.85 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Akira Li <4kir4.1i@gmail.com> Subject: Re: Python 3.2 has some deadly infection Date: Fri, 06 Jun 2014 12:03:51 +0400 References: <538a8f48$0$29978$c3e8da3$5496439d@news.astraweb.com> <538bcfff$0$29978$c3e8da3$5496439d@news.astraweb.com> <538C5BB8.1020702@chamonix.reportlab.co.uk> <538f1a61$0$29978$c3e8da3$5496439d@news.astraweb.com> <53902bb1$0$11109$c3e8da3@news.astraweb.com> <87wqcvu20h.fsf@elektro.pacujo.net> <7b3543f6-6f62-49c5-abdc-e2783fd6d629@googlegroups.com> <87oay7tnxt.fsf@elektro.pacujo.net> <53908dd0$0$29978$c3e8da3$5496439d@news.astraweb.com> <87ha3zti2h.fsf@elektro.pacujo.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Gmane-NNTP-Posting-Host: 89.169.229.68 User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (gnu/linux) Cancel-Lock: sha1:DC8fQbMyP0cb/cWzJYY/VFTBWKA= X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 74 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1402041845 news.xs4all.nl 2907 [2001:888:2000:d::a6]:48016 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:72820 Marko Rauhamaa writes: > Steven D'Aprano : > >> Nevertheless, there are important abstractions that are written on top >> of the bytes layer, and in the Unix and Linux world, the most >> important abstraction is *text*. In the Unix world, text formats and >> text processing is much more common in user-space apps than binary >> processing. > > That linux text is not the same thing as Python's text. Conceptually, > Python text is a sequence of 32-bit integers. Linux text is a sequence > of 8-bit integers. _Unicode string in Python is a sequence of Unicode codepoints_. It is correct that 32-bit integer is enough to represent any Unicode codepoint: \u0000...\U0010FFFF It says *nothing* about how Unicode strings are represented *internally* in Python. It may vary from version to version, build options and even may depend on the content of a string at runtime. In the past, "narrow builds" might break the abstraction in some cases that is why Linux distributions used wide python builds. _Unicode codepoint is not a Python concept_. There is Unicode standard http://unicode.org Though intead of following the self-referential defenitions web, I find it easier to learn from examples such as http://codepoints.net/U+0041 (A) or http://codepoints.net/U+1F3A7 (🎧) _There is no such thing as 8-bit text_ http://www.joelonsoftware.com/articles/Unicode.html If you insert a space after each byte (8-bit) in the input text then you may get garbage i.e., you can't assume that a character is a byte: $ echo "Hyvää yötä" | perl -pe's/.\K/ /g' H y v a � � � � y � � t � � In general, you can't assume that a character is a Unicode codepoint: $ echo "Hyvää yötä" | perl -C -pe's/.\K/ /g' H y v a ̈ ä y ö t ä The eXtended grapheme clusters (user-perceived characters) may be useful in this case: $ echo "Hyvää yötä" | perl -C -pe's/\X\K/ /g' H y v ä ä y ö t ä \X pattern is supported by `regex` module in Python i.e., you can't even iterate over characters (as they are seen by a user) in Python using only stdlib. \w+ pattern is also broken for Unicode text http://bugs.python.org/issue1693050 (it is fixed in the `regex` module) i.e., you can't select a word in Unicode text using only stdlib. \X along is not enough in some cases e.g., "“ch” may be considered a grapheme cluster in Slovak, for processes such as collation" [1] (sorting order). `PyICU` module might be useful here. Knowing about Unicode normalization forms (NFC, NFKD, etc) http://unicode.org/reports/tr15/ Unicode text segmentation [1] and Unicode collation algorithm http://www.unicode.org/reports/tr10/ concepts is also useful; if you want to work with text. [1]: http://www.unicode.org/reports/tr29/ -- akira