Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!eternal-september.org!feeder.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Marko Rauhamaa <marko@pacujo.net>
Newsgroups: comp.lang.python
Subject: Re: Python 3.2 has some deadly infection
Date: Thu, 05 Jun 2014 19:52:22 +0300
Organization: A noiseless patient Spider
Lines: 45
Message-ID: <87ha3zti2h.fsf@elektro.pacujo.net>
References: <mailman.10509.1401552642.18130.python-list@python.org> <538a8f48$0$29978$c3e8da3$5496439d@news.astraweb.com> <mailman.10531.1401663275.18130.python-list@python.org> <538bcfff$0$29978$c3e8da3$5496439d@news.astraweb.com> <CAN8CLgk5y0Of35RzEhm2-3OfyyzMqUfLABX7xAY-gQKqapYrJA@mail.gmail.com> <loom.20140602T094123-717@post.gmane.org> <CAN8CLgn_fXb1mqt3bZzeRw0d6TC=UykK5n0HTbA3JKHFo8gEHg@mail.gmail.com> <538C5BB8.1020702@chamonix.reportlab.co.uk> <mailman.10575.1401744891.18130.python-list@python.org> <bv540tFca0mU1@mid.individual.net> <lmjphk$bl7$1@ger.gmane.org> <mailman.10625.1401805111.18130.python-list@python.org> <538f1a61$0$29978$c3e8da3$5496439d@news.astraweb.com> <bva1ccFdr03U1@mid.individual.net> <53902bb1$0$11109$c3e8da3@news.astraweb.com> <87wqcvu20h.fsf@elektro.pacujo.net> <7b3543f6-6f62-49c5-abdc-e2783fd6d629@googlegroups.com> <87oay7tnxt.fsf@elektro.pacujo.net> <53908dd0$0$29978$c3e8da3$5496439d@news.astraweb.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: mx05.eternal-september.org; posting-host="ff5cf27ef3d5b31f034d3b72bdc27a41"; logging-data="29008"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/sQ6208+GDcMn5h3hCX1/q"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)
Cancel-Lock: sha1:QPBXtAba1YUeU6NpYS41HeTp7OE= sha1:ur+Zjm5a2yjBW9m1euQUU0IFsIs=
Xref: csiph.com comp.lang.python:72728

Steven D'Aprano <steve+comp.lang.python@pearwood.info>:

> Nevertheless, there are important abstractions that are written on top
> of the bytes layer, and in the Unix and Linux world, the most
> important abstraction is *text*. In the Unix world, text formats and
> text processing is much more common in user-space apps than binary
> processing.

That linux text is not the same thing as Python's text. Conceptually,
Python text is a sequence of 32-bit integers. Linux text is a sequence
of 8-bit integers.

It is great that lots of computer-to-computer formats are encoded in
ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction
layer that processes Python-esque text.

Case in point:

   $ env | grep UTF
   LANG=en_US.UTF-8
   $ od -c <<<"Hyvää yötä"     # "Good night" in Finnish
   0000000   H   y   v 303 244 303 244       y 303 266   t 303 244  \n
   0000017

The "od" utility is asked to display its input as characters. The locale
info gives a hint that all text data is in UTF-8. Yet what comes out is
bytes.

How about:

   $ wc -c <<<"Hyvää yötä"
   15
   $ tr 'ä' 'a' <<<"Hyvää yötä"
   Hyvaaaa ya�taa

Grep is smarter:

   $ grep v...y <<<"Hyvää yötä"
   Hyvää yötä

which is why you should always prefix "grep" with LC_ALL=C in your
scripts (makes it far faster, too).


Marko