Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Chris Angelico Newsgroups: comp.lang.python Subject: Re: The Cost of Dynamism (was Re: Pyhon 2.x or 3.x, which is faster?) Date: Sat, 12 Mar 2016 23:52:13 +1100 Lines: 51 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: news.uni-berlin.de lUvCTmO3DEIzMH2svdh7+wrpbXnADHeAZF8uWnx8Nmmg== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'received:209.85.223': 0.03; 'cpython': 0.05; 'ignored': 0.05; 'pretend': 0.07; 'strings.': 0.07; 'cc:addr:python-list': 0.09; 'ignoring': 0.09; 'matched': 0.09; 'non-ascii': 0.09; 'subject:which': 0.09; 'typed': 0.09; 'unicode,': 0.09; 'vast': 0.09; 'thread': 0.10; 'python': 0.10; 'anyway': 0.11; 'syntax': 0.13; 'ignore': 0.14; '2016': 0.16; '3.3,': 0.16; 'ascii,': 0.16; 'count.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'identifiers': 0.16; 'invisible': 0.16; 'lexer': 0.16; 'lower,': 0.16; 'partly': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'something.': 0.16; 'sources,': 0.16; 'subject:?)': 0.16; 'translated.': 0.16; 'wrote:': 0.16; '(in': 0.18; 'language': 0.19; '>>>': 0.20; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'ascii': 0.22; 'assuming': 0.22; 'defined': 0.23; 'sat,': 0.23; 'header:In-Reply-To:1': 0.24; 'testing': 0.25; 'chris': 0.26; 'question': 0.27; 'message-id:@mail.gmail.com': 0.27; 'module.': 0.27; 'processed': 0.27; 'sequence': 0.27; 'skip:u 20': 0.28; '(my': 0.29; 'types.': 0.29; 'character': 0.29; "i'm": 0.30; 'code': 0.30; 'becomes': 0.30; 'certain': 0.31; "can't": 0.32; 'generally': 0.32; 'source': 0.33; 'usually': 0.33; 'consist': 0.33; 'languages': 0.34; 'file': 0.34; 'that,': 0.34; 'received:google.com': 0.35; 'could': 0.35; 'text': 0.35; 'done': 0.35; 'unicode': 0.35; 'asking': 0.35; "isn't": 0.35; 'but': 0.36; 'should': 0.36; 'there': 0.36; 'received:209.85': 0.36; 'possible': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'being': 0.37; '12,': 0.37; 'requirement': 0.37; 'received:209': 0.38; 'files': 0.38; 'does': 0.39; 'your': 0.60; 'subject:The': 0.61; 'skip:u 10': 0.61; 'entire': 0.61; 'show': 0.62; 'more': 0.63; 'within': 0.64; 'other.': 0.64; 'mar': 0.65; 'chinese': 0.79; 'counts': 0.81; 'benchmark': 0.84; 'chrisa': 0.84; "everything's": 0.84; 'ultimately,': 0.84; 'to:none': 0.91; 'task,': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc; bh=h5ff/fAHiKG+4ZXEShQm7DqlM5ROLkk6V5RMuIm1FLw=; b=lcSv+D8iCho/Eu5jdsISQ30eKoTL3pIgbuksg7rNJkZNhh4vsbTfDhZNUwFrsQQ+L1 wOWHSJ/mYaV79DimL2GcAhPD0K037vYstemUBaMTsuI9+8dEezJ8VLmMUBPfavkOoTzK a33tdDYIeU1BqfVG6T2cjr+CYNKO6nKOLRMKlozP8bB00t6gsg2RaPjNH0emDSf/Q/jB hkiB1kWd4HsaXJGSua5SNQuwv8qZ3alXm4wUCP5fF8VpQoUuBfY3PTlsS+E8z7mtenld hEc98T1Hc1Cu1uFxK9e3fV5+lZwC3F3oYV2QCOjIvAEc0RoZrTJI0GPZNskSvXzteUsb ZLbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc; bh=h5ff/fAHiKG+4ZXEShQm7DqlM5ROLkk6V5RMuIm1FLw=; b=Fj9R19tOOff9CZ+DzvYUoemg9gCf12Lqq6XQBXVpW099ZSEPkT/R6xW/x8W06Lc1Tw 1rgxBW1lHBsrv8+CDbPye9O1AR5SEZaf/0n/K7q3tQpZK1SDu5BFP830Uzz5PUeBD1Ej CQmq+P2jO7raMqUw9hBcigubLLQwTXeBCqdfTNO2UfFQTLGEzr5V7xRyo7FmlV2oAycG O3pwn49K0pndcRjkYzdaX3I1+fwrHXEyqdvD8dmbS/805SyU4iQmvasIMIboEfqNftwV VkqX7BIU3/0u4afQRFU2Qwi0quzbi5vGnWbrttlW/IjGHG2qPiZaqJNYtcT9cHcFHDbQ vA8w== X-Gm-Message-State: AD7BkJKkZa4zf5ZbxNfnCiI7sRnb2T6KQ9Jr+9vM51OkMzxN2AtqHQMHYtmXgK4g7Gchstf1urAAKDUjCQCETQ== X-Received: by 10.107.47.163 with SMTP id v35mr14786367iov.19.1457787133557; Sat, 12 Mar 2016 04:52:13 -0800 (PST) In-Reply-To: X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:104698 On Sat, Mar 12, 2016 at 10:50 PM, BartC wrote: > On 12/03/2016 02:20, Chris Angelico wrote: >> >> On Sat, Mar 12, 2016 at 12:16 PM, BartC wrote: > > >>> 'Switch' testing benchmark. The little program show below reads a text >>> file >>> (I used the entire CPython C sources, 6MB), and counts the number of >>> characters of each category in upper, lower, digit and other. >>> >>> (Note there are other ways to approach this task, but a proper 'lexer' >>> usually does more than count. 'Switch' then becomes invaluable.) >> >> >> Are you assuming that the files are entirely ASCII? (They're not.) Or >> are you simply declaring that all non-ASCII characters count as >> "other"? > > >> Once again, you cannot ignore Unicode and pretend that everything's >> ASCII, or eight-bit characters, or something. Asking if a character is >> upper/lower/digit/other is best done with the unicodedata module. > > > If you're looking at fast processing of language source code (in a thread > partly about efficiency), then you cannot ignore the fact that the vast > majority of characters being processed are going to have ASCII codes. > > Language syntax could anyway stipulate that certain tokens can only consist > of characters within the ASCII range. > > So I'm not ignoring Unicode, but being realistic. > > (My benchmark was anyway just demonstrating a possible use for 'switch' that > more or less matched your own example!) Generally languages these days are built using ASCII tokens, because they can be dependably typed on all keyboards. But there's no requirement for that, and I understand there's a Chinese Python that has all the language keywords translated. And identifiers can - and most definitely SHOULD - be defined in terms of Unicode characters and their types. So ultimately, the lexer needs to be Unicode-aware. But in terms of efficiency, yes, you can't ignore that most files will be all-ASCII. And since 3.3, Python has had an optimization for such strings. So the performance question isn't ignored - but it's an invisible optimization within a clearly-defined semantic, namely that Python source code is a sequence of Unicode characters. ChrisA