Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!eternal-september.org!feeder.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Marko Rauhamaa Newsgroups: comp.lang.python Subject: Re: Python 3.2 has some deadly infection Date: Thu, 05 Jun 2014 19:52:22 +0300 Organization: A noiseless patient Spider Lines: 45 Message-ID: <87ha3zti2h.fsf@elektro.pacujo.net> References: <538a8f48$0$29978$c3e8da3$5496439d@news.astraweb.com> <538bcfff$0$29978$c3e8da3$5496439d@news.astraweb.com> <538C5BB8.1020702@chamonix.reportlab.co.uk> <538f1a61$0$29978$c3e8da3$5496439d@news.astraweb.com> <53902bb1$0$11109$c3e8da3@news.astraweb.com> <87wqcvu20h.fsf@elektro.pacujo.net> <7b3543f6-6f62-49c5-abdc-e2783fd6d629@googlegroups.com> <87oay7tnxt.fsf@elektro.pacujo.net> <53908dd0$0$29978$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: mx05.eternal-september.org; posting-host="ff5cf27ef3d5b31f034d3b72bdc27a41"; logging-data="29008"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/sQ6208+GDcMn5h3hCX1/q" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) Cancel-Lock: sha1:QPBXtAba1YUeU6NpYS41HeTp7OE= sha1:ur+Zjm5a2yjBW9m1euQUU0IFsIs= Xref: csiph.com comp.lang.python:72728 Steven D'Aprano : > Nevertheless, there are important abstractions that are written on top > of the bytes layer, and in the Unix and Linux world, the most > important abstraction is *text*. In the Unix world, text formats and > text processing is much more common in user-space apps than binary > processing. That linux text is not the same thing as Python's text. Conceptually, Python text is a sequence of 32-bit integers. Linux text is a sequence of 8-bit integers. It is great that lots of computer-to-computer formats are encoded in ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction layer that processes Python-esque text. Case in point: $ env | grep UTF LANG=en_US.UTF-8 $ od -c <<<"Hyvää yötä" # "Good night" in Finnish 0000000 H y v 303 244 303 244 y 303 266 t 303 244 \n 0000017 The "od" utility is asked to display its input as characters. The locale info gives a hint that all text data is in UTF-8. Yet what comes out is bytes. How about: $ wc -c <<<"Hyvää yötä" 15 $ tr 'ä' 'a' <<<"Hyvää yötä" Hyvaaaa ya�taa Grep is smarter: $ grep v...y <<<"Hyvää yötä" Hyvää yötä which is why you should always prefix "grep" with LC_ALL=C in your scripts (makes it far faster, too). Marko