Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: The Cost of Dynamism (was Re: Pyhon 2.x or 3.x, which is faster?)
Date: Sun, 13 Mar 2016 00:40:56 +1100
Lines: 66
Message-ID: <mailman.39.1457790065.12893.python-list@python.org>
References: <mailman.8.1457732171.12893.python-list@python.org> <nbvqg5$3cm$1@dont-email.me> <mailman.23.1457749230.12893.python-list@python.org> <nc0vka$c5b$1@dont-email.me> <87oaajgahd.fsf@elektro.pacujo.net> <nc14on$u9c$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <nc14on$u9c$1@dont-email.me>
Precedence: list
Xref: csiph.com comp.lang.python:104703

On Sun, Mar 13, 2016 at 12:18 AM, BartC <bc@freeuk.com> wrote:
> On 12/03/2016 12:13, Marko Rauhamaa wrote:
>>
>> BartC <bc@freeuk.com>:
>>
>>> If you're looking at fast processing of language source code (in a
>>> thread partly about efficiency), then you cannot ignore the fact that
>>> the vast majority of characters being processed are going to have
>>> ASCII codes.
>>
>>
>> I don't know why you would optimize for inputting program source code.
>> Text in general has left ASCII behind a long time ago. Just go to
>> Wikipedia and click on any of the other languages.
>>
>> Why, look at the *English* page on Hillary Clinton:
>>
>>     Hillary Diane Rodham Clinton /=CB=88h=C9=AAl=C9=99ri da=C9=AA=CB=88=
=C3=A6n =CB=88r=C9=92d=C9=99m =CB=88kl=C9=AAnt=C9=99n/ (born
>>     October 26, 1947) is an American politician.
>>     <URL: https://en.wikipedia.org/wiki/Hillary_Clinton>
>>
>> You couldn't get past the first sentence in ASCII.
>
>
> I saved that page locally as a .htm file in UTF-8 encoding. I ran a modif=
ied
> version of my benchmark, and it appeared that 99.7% of the bytes had ASCI=
I
> codes. The other 0.3% presumably were multi-byte sequences, so that the
> actual proportion of Unicode characters would be even less.
>
> I then saved the Arabic version of the page, which visually, when rendere=
d,
> consists of 99% Arabic script. But the .htm file was still 80% ASCII!
>
> So what were you saying about ASCII being practically obsolete ... ?

Now take the same file and save it as plain text. See how much smaller
it is. If you then take that text and embed it in a 10GB file
consisting of nothing but byte value 246, it will be plainly obvious
that ASCII is almost completely obsolete, and that we should optimize
our code for byte 246. Or maybe, all you've proven is that *the
framing around the text* is entirely ASCII, which makes sense, since
HTML is trying to be compatible with a wide range of messy encodings
(many of them eight-bit ASCII-compatible ones).

The text itself may also consist primarily of ASCII characters, but
that's a separate point. In the Arabic version, that is far less
likely to be true (there'll still be a good number of ASCII characters
in it, as U+0020 SPACE is heavily used in Arabic text, but a far
smaller percentage). But neither of those says that ASCII is
"practically obsolete", any more than you could say that the numbers
from 1 to 10 become obsolete once a child learns to count further than
that. The ASCII characters are an important part of the Unicode set;
you can't ignore the rest of Unicode, but you certainly can't ignore
ASCII, and there'll be very few pieces of human-language text which
include no ASCII characters whatsoever. That's why UTF-8 is so
successful; even Chinese text is often more compact in UTF-8 than in
UTF-16 (despite many characters fitting into a single UTF-16 code
unit, but requiring three bytes in UTF-8), when framed in HTML.
However, once again, we have a sharp distinction: semantically, you
support all Unicode characters equally, but then you optimize for the
common ones.

ChrisA