Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!1.eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <55742e0e$0$12980$c3e8da3$5496439d@news.astraweb.com>
References: <555f440a$0$12990$c3e8da3$5496439d@news.astraweb.com> <mailman.222.1432309028.17265.python-list@python.org> <2212595.DFZ6OqehRn@PointedEars.de> <55607a1b$0$13011$c3e8da3$5496439d@news.astraweb.com> <2c4d029c-8ea5-465b-8adc-6c35185bd150@googlegroups.com> <2483375.eHyISxeWLQ@PointedEars.de> <55742e0e$0$12980$c3e8da3$5496439d@news.astraweb.com>
Date: Sun, 7 Jun 2015 22:08:06 +1000
Subject: Re: Ah Python, you have spoiled me for all other languages
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.243.1433678894.13271.python-list@python.org>
Lines: 59
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:92237

On Sun, Jun 7, 2015 at 9:42 PM, Steven D'Aprano <steve@pearwood.info> wrote=
:
> My opinion is that a programming language like Python or ECMAScript shoul=
d
> operate on *code points*. If we want to call them "characters" informally=
,
> that should be allowed, but whenever there is ambiguity we should remembe=
r
> we're dealing with code points. The implementation shouldn't matter:
> compliant Python interpreters might choose to use UTF-8 internally, or
> UTF-16, or UTF-32, or something else, and still agree on how many
> characters a string contains. Normalisation is still an issue, of course,
> but any decent Unicode implementation will include a way to normalise or
> denormalise strings.

If by "normalise" you mean the NF[K]C/NF[K]D composition and
decomposition, then yes, any decent Unicode library will provide that.
I'm not sure it's critical to string handling itself, though; and
Python defers the operation to the unicodedata module:

>>> s1 =3D "\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}"
>>> s2 =3D "\N{LATIN SMALL LETTER A WITH ACUTE}"
>>> s1 =3D=3D s2
False
>>> unicodedata.normalize("NFC", s1) =3D=3D s2
True

It's a useful operation to be able to do, but I would never expect
that *string comparison* or other operations should automatically
normalize. (Unless you want to say that all strings are guaranteed to
be NFC/NFD normalized, such that s1 and s2 would actually be
identical, which I suppose is plausible. I'm not sure what the
advantage would be, though. And certainly you wouldn't want to
K-normalize strings automatically.)

> The question of graphemes (what "ordinary people" consider letters and
> characters, e.g. "ch" is two letters to an English speaker but one letter
> to a Czech speaker) should be left to libraries. It's a much harder probl=
em
> to solve in the full general case, requires localisation, and is overkill
> for many string-processing tasks.

Yeah. The basic challenge to a beginning programmer, "reverse this
string", becomes rather tricky in the presence of natural language.

>>> s1 +=3D "e"
>>> s1
'a=CC=81e'
>>> s1[::-1]
'e=CC=81a'

Oops.

But hey. It's easier to understand what went wrong here than, say, if
you reverse the bytes in a UTF-8 stream. Or the code units in a UTF-16
stream. If you're lucky, those would give you instant errors... if
you're not, well, who knows.

ChrisA