Path: csiph.com!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: How to waste computer memory?
Date: Sun, 20 Mar 2016 22:22:45 +1100
Lines: 42
Message-ID: <mailman.404.1458472974.12893.python-list@python.org>
References: <a2639027-c69c-46df-a7a5-45a677b9e01d@googlegroups.com> <265377f4-741d-4aa2-9338-239f56f8bc57@googlegroups.com> <mailman.302.1458284448.12893.python-list@python.org> <lf5y49gw5s9.fsf@ling.helsinki.fi> <mailman.327.1458313179.12893.python-list@python.org> <87twk3oli0.fsf@elektro.pacujo.net> <mailman.351.1458332168.12893.python-list@python.org> <87k2kzo5y5.fsf@elektro.pacujo.net> <mailman.353.1458335305.12893.python-list@python.org> <56ed0a71$0$1607$c3e8da3$5496439d@news.astraweb.com> <87lh5en79a.fsf@elektro.pacujo.net> <56ed68bb$0$1604$c3e8da3$5496439d@news.astraweb.com> <877fgylddm.fsf@elektro.pacujo.net> <56ed749e$0$1583$c3e8da3$5496439d@news.astraweb.com> <8737rmla4w.fsf@elektro.pacujo.net> <56ee2ebd$0$1597$c3e8da3$5496439d@news.astraweb.com> <12db8cba-8edf-4cd0-a91d-2f6b6634c9d3@googlegroups.com> <56ee8454$0$22142$c3e8da3$5496439d@news.astraweb.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <56ee8454$0$22142$c3e8da3$5496439d@news.astraweb.com>
Precedence: list
Xref: csiph.com comp.lang.python:105299

On Sun, Mar 20, 2016 at 10:06 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> The Unicode standard does not, as far as I am aware, care how you represent
> code points in memory, only that there are 0x110000 of them, numbered from
> U+0000 to U+10FFFF. That's what I mean by abstract. The obvious
> implementation is to use 32-bit integers, where 0x00000000 represents code
> point U+0000, 0x00000001 represents U+0001, and so forth. This is
> essentially equivalent to UTF-16, but it's not mandated or specified by the
> Unicode standard, you could, if you choose, use something else.

(UTF-32)

The codepoints are not representable in *memory*; they are, by
definition, representable in a field of integers. If you choose to
represent those integers as little-endian 32-bit values, then yes, the
layout in memory will look like UTF-32LE, but that's because UTF-32LE
is defined in this extremely simple way. In fact, that's exactly how
the layers work - Unicode defines a mapping of characters to code
points, and then UTF-x defines a mapping of code points to bytes.

> On the other hand, I believe that the output of the UTF transformations is
> explicitly described in terms of 8-bit bytes and 16- or 32-bit words. For
> instance, the UTF-8 encoding of "A" has to be a single byte with value 0x41
> (decimal 65). It isn't that this is the most obvious implementation, its
> that it can't be anything else and still be UTF-8.

Exactly. Aside from the way UTF-16 and UTF-32 have LE and BE variants,
there is only one bitpattern for any given character sequence and
UTF-x (so if you work with eg "UTF-16LE", there's only one). This is
no accident. Unlike some encodings, in which there's a "one most
obvious" way to encode things but then a number of other legal ways,
UTF-x can be compared for equality [1] using simple byte-for-byte
comparisons. This means you don't have to worry about someone sneaking
a magic character past your filter; if you're checking a UTF-8 stream
for the character U+003C LESS-THAN SIGN, the only byte value to look
for is 0x3C - the sequence 0xC0 0xBC, despite mathematically
representing the number 003C, is explicitly forbidden.

ChrisA

[1] Though not inequality - lexical sorting doesn't follow codepoint
order, and codepoint order won't always match byte order. But equality
is easy.