Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <51f1e371$0$29971$c3e8da3$5496439d@news.astraweb.com>
References: <mailman.4618.1373613834.3114.python-list@python.org> <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> <CAN1F8qUFP3uX57HhiiUPaYqO3h_HiT8Q_YD=vCYky3EAWsdE7Q@mail.gmail.com> <mailman.4666.1373670835.3114.python-list@python.org> <4f1067f6-bc99-42ad-9166-37fb228b90e8@googlegroups.com> <mailman.5094.1374759404.3114.python-list@python.org> <51f14395$0$29971$c3e8da3$5496439d@news.astraweb.com> <mailman.5106.1374766576.3114.python-list@python.org> <51f15e03$0$29971$c3e8da3$5496439d@news.astraweb.com> <mailman.5121.1374785646.3114.python-list@python.org> <51f1e371$0$29971$c3e8da3$5496439d@news.astraweb.com>
From: Ian Kelly <ian.g.kelly@gmail.com>
Date: Thu, 25 Jul 2013 21:20:45 -0600
Subject: Re: RE Module Performance
To: Python <python-list@python.org>
Content-Type: text/plain; charset=ISO-8859-1
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5129.1374808894.3114.python-list@python.org>
Lines: 24
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:51277

On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> UTF-8 uses a flexible representation on a character-by-character basis.
> When parsing UTF-8, one needs to look at EVERY character to decide how
> many bytes you need to read. In Python 3, the flexible representation is
> on a string-by-string basis: once Python has looked at the string header,
> it can tell whether the *entire* string takes 1, 2 or 4 bytes per
> character, and the string is then fixed-width. You can't do that with
> UTF-8.

UTF-8 does not use a flexible representation.  A codec that is
encoding a string in UTF-8 and examining a particular character does
not have any choice of how to encode that character; there is exactly
one sequence of bits that is the UTF-8 encoding for the character.
Further, for any given sequence of code points there is exactly one
sequence of bytes that is the UTF-8 encoding of those code points.  In
contrast, with the FSR there are as many as three different sequences
of bytes that encode a sequence of code points, with one of them (the
shortest) being canonical.  That's what makes it flexible.

Anyway, my point was just that Emacs is not a counter-example to jmf's
claim about implementing text editors, because UTF-8 is not what he
(or anybody else) is referring to when speaking of the FSR or
"something like the FSR".