Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: Pyhon 2.x or 3.x, which is faster?
Date: Thu, 10 Mar 2016 18:55:52 +1100
Lines: 30
Message-ID: <mailman.111.1457596555.15725.python-list@python.org>
References: <mailman.238.1457265255.20602.python-list@python.org> <nbjrjm$m16$1@gioia.aioe.org> <nbjvas$h22$1@dont-email.me> <mailman.17.1457364684.10335.python-list@python.org> <nbkhei$dg6$1@dont-email.me> <mailman.43.1457377845.10335.python-list@python.org> <nbknir$avu$1@dont-email.me> <mailman.49.1457383632.10335.python-list@python.org> <nbkvq2$j5s$1@dont-email.me> <mailman.7.1457395587.15725.python-list@python.org> <nbl81v$hii$1@dont-email.me> <mailman.16.1457399576.15725.python-list@python.org> <nblapf$ofi$1@dont-email.me> <mailman.28.1457405184.15725.python-list@python.org> <nbmbnp$hsi$1@dont-email.me> <mailman.45.1457453433.15725.python-list@python.org> <nbn871$8np$1@dont-email.me> <56df6761$0$1588$c3e8da3$5496439d@news.astraweb.com> <mailman.72.1457512871.15725.python-list@python.org> <nbp36e$560$1@dont-email.me> <mailman.96.1457558042.15725.python-list@python.org> <nbqaja$5g5$1@dont-email.me> <mailman.98.1457566570.15725.python-list@python.org> <nbqgmi$mha$1@dont-email.me> <nbr7r6$91l$1@ger.gmane.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <nbr7r6$91l$1@ger.gmane.org>
Precedence: list
Xref: csiph.com comp.lang.python:104482

On Thu, Mar 10, 2016 at 6:30 PM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
>>  From what I've seen, a lot of software can't get [Unicode] right anyway.
>>
>
> Are you referring to PEP393 having taken notice of the RUE?

Even with PEP 393, there's no guarantee that a Python program will get
Unicode right. The bytes/text split in Python 3 is a huge help, but
proper handling of the entire Unicode range implies more than simply
being able to represent all characters (although that's a critical
prerequisite). There are design considerations with case folding (tip:
it's easiest and safest to be case sensitive), collation/sorting (tip:
it's impossible to be perfect unless you know which language is
involved), text directionality (you probably know that Arabic is
written right-to-left, but are you aware that there are also
characters with "weak" directionality, distinct from those with
"neutral" directionality?) and so on, plus a bunch of relatively
straight-forward coding considerations (eg comparing two strings for
equality generally requires NFC/NFC normalization, and might require
NFKC/NFKD), which a number of programs still don't get right. PEP 393
actually isn't very much about correctness; a "wide build" of pre-3.3
Python has the correct behaviour, but is wasteful with memory. By
removing the temptation to conserve memory using UTF-16, PEP 393 did
improve correctness on Windows, but its main focus is on memory
efficiency (and thus performance, thanks to cache locality).

But hey. Just being able to represent all characters is probably about
95% of Unicode correctness. The rest is the little stuff.

ChrisA