Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: The Cost of Dynamism (was Re: Pyhon 2.x or 3.x, which is faster?)
Date: Tue, 22 Mar 2016 02:45:59 +1100
Lines: 29
Message-ID: <mailman.443.1458575169.12893.python-list@python.org>
References: <mailman.8.1457732171.12893.python-list@python.org> <nbvqg5$3cm$1@dont-email.me> <mailman.23.1457749230.12893.python-list@python.org> <56e44258$0$1598$c3e8da3$5496439d@news.astraweb.com> <mailman.46.1457819135.12893.python-list@python.org> <8737rvxs89.fsf@elektro.pacujo.net> <nc2a7s$6dv$1@dont-email.me> <mailman.48.1457831488.12893.python-list@python.org> <nc4ffn$f6s$1@dont-email.me> <56e7483d$0$1608$c3e8da3$5496439d@news.astraweb.com> <nc7kjr$jj8$1@dont-email.me> <ncnhp9$tt7$1@dont-email.me> <mailman.426.1458526919.12893.python-list@python.org> <ncopi1$vgc$1@dont-email.me> <mailman.438.1458565186.12893.python-list@python.org> <56effbc1$0$1622$c3e8da3$5496439d@news.astraweb.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <56effbc1$0$1622$c3e8da3$5496439d@news.astraweb.com>
Precedence: list
Xref: csiph.com comp.lang.python:105365

On Tue, Mar 22, 2016 at 12:48 AM, Steven D'Aprano <steve@pearwood.info> wrote:
> On Mon, 21 Mar 2016 11:59 pm, Chris Angelico wrote:
>
>> On Mon, Mar 21, 2016 at 11:34 PM, BartC <bc@freeuk.com> wrote:
>>> For Python I would have used a table of 0..255 functions, indexed by the
>>> ord() code of each character. So all 52 letter codes map to the same
>>> name-handling function. (No Dict is needed at this point.)
>>>
>>
>> Once again, you forget that there are not 256 characters - there are
>> 1114112. (Give or take.)
>
> Pardon me, do I understand you correctly? You're saying that the C parser is
> Unicode-aware and allows you to use Unicode in C source code? Because
> Bart's test is for a (simplified?) C tokeniser, and expecting his tokeniser
> to support character sets that C does not would be, well, Not Cricket, my
> good chap.

We nutted part of this out earlier in the thread; Python 3.x code is,
and any other modern language should be, defined to have Unicode
source. (And yes, MRAB, I'm aware that only a tiny fraction of
codepoints are defined; it's still a lot more than 256, and going to
make for a far larger lookup table.) While you could plausibly define
that your source code consists only of printable ASCII characters (eg
09,10,13,32-126), it is an extremely bad idea to declare that it has
256 possibilities - you're shackling your language to a parser
definition that includes both more and less than people will expect.

ChrisA