Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Chris Angelico Newsgroups: comp.lang.python Subject: Re: The Cost of Dynamism (was Re: Pyhon 2.x or 3.x, which is faster?) Date: Sun, 13 Mar 2016 00:40:56 +1100 Lines: 66 Message-ID: References: <87oaajgahd.fsf@elektro.pacujo.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: news.uni-berlin.de WLuLw3eU7mpWoJyI3Hw/RgKOQsESuzsi1QpWARygjvRw== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'heavily': 0.04; 'modified': 0.05; 'obsolete': 0.07; 'utf-8': 0.07; 'cc:addr :python-list': 0.09; 'compact': 0.09; 'encoding.': 0.09; 'subject:which': 0.09; 'unicode,': 0.09; 'vast': 0.09; 'thread': 0.10; 'ignore': 0.14; 'languages.': 0.15; '2016': 0.16; 'ascii,': 0.16; 'encodings': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'learns': 0.16; 'messy': 0.16; 'partly': 0.16; 'proportion': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'sense,': 0.16; 'set;': 0.16; 'sharp': 0.16; 'subject:?)': 0.16; 'why,': 0.16; 'wrote:': 0.16; 'byte': 0.18; 'bytes': 0.18; 'script.': 0.18; '(in': 0.18; 'language': 0.19; '>>>': 0.20; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'saying': 0.22; 'ascii': 0.22; 'ones.': 0.22; 'text,': 0.22; 'trying': 0.22; 'am,': 0.23; 'code.': 0.23; 'plain': 0.24; 'header :In-Reply-To:1': 0.24; 'rest': 0.26; 'compatible': 0.27; 'point.': 0.27; 'separate': 0.27; 'message-id:@mail.gmail.com': 0.27; 'pieces': 0.27; 'processed': 0.27; 'actual': 0.28; '*the': 0.29; '13,': 0.29; 'consisting': 0.29; 'embed': 0.29; 'sentence': 0.29; 'url:wikipedia': 0.29; 'that.': 0.30; 'url:wiki': 0.30; 'code': 0.30; 'certainly': 0.30; 'version,': 0.30; "can't": 0.32; 'says': 0.32; 'source': 0.33; 'common': 0.33; 'consist': 0.33; 'optimize': 0.33; 'file': 0.34; 'received:google.com': 0.35; 'behind': 0.35; 'could': 0.35; 'text': 0.35; 'primarily': 0.35; 'requiring': 0.35; 'saved': 0.35; 'text.': 0.35; 'unicode': 0.35; 'but': 0.36; 'should': 0.36; 'url:org': 0.36; 'received:209.85': 0.36; 'child': 0.36; 'smaller': 0.36; 'subject:: ': 0.37; 'being': 0.37; 'say': 0.37; 'received:209.85.213': 0.37; 'itself': 0.38; 'version': 0.38; 'received:209': 0.38; 'why': 0.39; 'url:en': 0.39; 'still': 0.40; 'space': 0.40; 'save': 0.60; 'subject:The': 0.61; 'wide': 0.61; 'ago.': 0.61; 'further': 0.62; 'is.': 0.63; 'more': 0.63; 'our': 0.64; 'mar': 0.65; 'american': 0.69; '26,': 0.72; 'obvious': 0.76; 'click': 0.76; 'chinese': 0.79; '80%': 0.84; 'arabic': 0.84; 'chrisa': 0.84; 'framed': 0.84; 'framing': 0.84; 'presumably': 0.84; "there'll": 0.84; 'whatsoever.': 0.84; 'to:none': 0.91; 'maybe,': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-transfer-encoding; bh=EcdwQzb18gUMlo7xMiV4GI0NM6gyESZt1Xv65T+pPlA=; b=ZNtlFF0U0ZZKcZHQFSSM5GWbDLq1h9LiDGDFAFE6KQJc+g1/7eaCanJzge7ZC77sAt kBnckCM2oOoNBHR0jk9CA23/Jor3j5JHGu+8jWRiCawnngXhvqOr4mQN7/oWF4QEEv5s Ar5mU1+rUbd5Oxq7jTh93/YsTCLpLHrh1t2JH853L4x255ncX76KVb4/bxr32wF+Mal0 5MfXtKnc2ruQh4THR46S8YWKXeqbutlSLnfNts2BYME3WaEfH9Ca0G2q4lJBl1UEn8iO uVBtUt3w7tPXPnyfxc5vRm/kdcuvvryqHBZp9SZyk9/m+419ScMEH3FaOFMR7+u/Ya4R ejuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc:content-transfer-encoding; bh=EcdwQzb18gUMlo7xMiV4GI0NM6gyESZt1Xv65T+pPlA=; b=Pb/7EC7Av72jDEDA6b1xnyFtUYAltbZUz7fqCRhe50lK05cp8e5JxhB5NATWMIDv6P Ew599Abi931ZEuPtdtSi/RRxvLCob1MLEyw8g/GBGt5lF+ZRJVhV4cNG2wf5IKnYWMo0 1bC7ZbAmsj+xn69KPnQ+saAI09f7oZiwXxM15l6M4rfTAM/UsNw37kxQsG+e61pWSn6S ZBA+lq2g9LLbzqIDWVb84R6M8YU4ah8VAcy6fp+gkrrAScnJfhMc7Ty7zyoJYjcw3uAd KVC8tMy4Ep5s4RuL1M7qctWxoHRU/BMkvAi9TD0Ux/2XPRU0ZnTB9lD5uDet9mSjlFzt 6wfQ== X-Gm-Message-State: AD7BkJJNHKuiYPOBkVAplQqx9LnZ0HrlOb0CLITgmndO+v3f88tA4idPpFdvhIg9Ar6CtjMwvBYlXHxoqBYCgw== X-Received: by 10.50.137.35 with SMTP id qf3mr9433720igb.92.1457790056443; Sat, 12 Mar 2016 05:40:56 -0800 (PST) In-Reply-To: X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:104703 On Sun, Mar 13, 2016 at 12:18 AM, BartC wrote: > On 12/03/2016 12:13, Marko Rauhamaa wrote: >> >> BartC : >> >>> If you're looking at fast processing of language source code (in a >>> thread partly about efficiency), then you cannot ignore the fact that >>> the vast majority of characters being processed are going to have >>> ASCII codes. >> >> >> I don't know why you would optimize for inputting program source code. >> Text in general has left ASCII behind a long time ago. Just go to >> Wikipedia and click on any of the other languages. >> >> Why, look at the *English* page on Hillary Clinton: >> >> Hillary Diane Rodham Clinton /=CB=88h=C9=AAl=C9=99ri da=C9=AA=CB=88= =C3=A6n =CB=88r=C9=92d=C9=99m =CB=88kl=C9=AAnt=C9=99n/ (born >> October 26, 1947) is an American politician. >> >> >> You couldn't get past the first sentence in ASCII. > > > I saved that page locally as a .htm file in UTF-8 encoding. I ran a modif= ied > version of my benchmark, and it appeared that 99.7% of the bytes had ASCI= I > codes. The other 0.3% presumably were multi-byte sequences, so that the > actual proportion of Unicode characters would be even less. > > I then saved the Arabic version of the page, which visually, when rendere= d, > consists of 99% Arabic script. But the .htm file was still 80% ASCII! > > So what were you saying about ASCII being practically obsolete ... ? Now take the same file and save it as plain text. See how much smaller it is. If you then take that text and embed it in a 10GB file consisting of nothing but byte value 246, it will be plainly obvious that ASCII is almost completely obsolete, and that we should optimize our code for byte 246. Or maybe, all you've proven is that *the framing around the text* is entirely ASCII, which makes sense, since HTML is trying to be compatible with a wide range of messy encodings (many of them eight-bit ASCII-compatible ones). The text itself may also consist primarily of ASCII characters, but that's a separate point. In the Arabic version, that is far less likely to be true (there'll still be a good number of ASCII characters in it, as U+0020 SPACE is heavily used in Arabic text, but a far smaller percentage). But neither of those says that ASCII is "practically obsolete", any more than you could say that the numbers from 1 to 10 become obsolete once a child learns to count further than that. The ASCII characters are an important part of the Unicode set; you can't ignore the rest of Unicode, but you certainly can't ignore ASCII, and there'll be very few pieces of human-language text which include no ASCII characters whatsoever. That's why UTF-8 is so successful; even Chinese text is often more compact in UTF-8 than in UTF-16 (despite many characters fitting into a single UTF-16 code unit, but requiring three bytes in UTF-8), when framed in HTML. However, once again, we have a sharp distinction: semantically, you support all Unicode characters equally, but then you optimize for the common ones. ChrisA