Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.014 X-Spam-Evidence: '*H*': 0.97; '*S*': 0.00; 'subject:text': 0.05; '"""': 0.07; 'assigning': 0.09; 'bits': 0.09; 'character,': 0.09; 'craft': 0.09; 'screen.': 0.09; 'subject:question': 0.10; 'url:blog': 0.10; 'cc:addr:python-list': 0.11; 'wrote': 0.14; '(around': 0.16; '(code': 0.16; '16-bit': 0.16; 'blocks': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'planes': 0.16; 'well-known': 0.16; 'appropriate': 0.16; 'wrote:': 0.18; 'thu,': 0.19; 'feb': 0.22; 'cc:addr:python.org': 0.22; 'either.': 0.24; 'unicode': 0.24; 'url:02': 0.24; 'looks': 0.24; 'cc:2**0': 0.24; '2.0': 0.26; 'header:In-Reply-To:1': 0.27; "we'd": 0.29; 'words': 0.29; 'message-id:@mail.gmail.com': 0.30; 'code': 0.31; 'that.': 0.31; '4.0': 0.31; 'everyone': 0.33; 'becomes': 0.33; 'something': 0.35; 'computing': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'version': 0.36; 'installing': 0.36; 'url:org': 0.36; 'should': 0.36; 'being': 0.38; 'stopped': 0.38; 'needed': 0.38; 'fact': 0.38; 'pm,': 0.38; 'does': 0.39; 'either': 0.39; 'how': 0.40; 'skip:u 10': 0.60; 'units': 0.60; 'impact': 0.61; 'numbers': 0.61; 'matter': 0.61; 'simple': 0.61; 'discuss': 0.62; 'kind': 0.63; 'such': 0.63; 'soon': 0.63; '26,': 0.68; 'rendering': 0.68; '\xe2\x80\x93': 0.77; '2015': 0.84; 'devices,': 0.84; 'absolutely': 0.87; '5.2': 0.91; 'these.': 0.91; 'to:none': 0.92 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type:content-transfer-encoding; bh=DDf5TvZrgDAgBCJW/H5MK/h/xUbOqWrzb5g3pEHiA5o=; b=GOhc/hT2R8QXYp/8chEiU1pX003nDK5SRgi25Y2oKDBIILL0hHDIS2kdyKvvwY3NB3 YSFy0erLMzLNlB0wYG/HfxpoWmLQAPUrQ+AcW00yDDaAjaKMg5yYfU57nf+ZKVYCEdQf UBwHeOAm5ZVrUdUOFytiZT8MmjMcQBOugQ0/TxLwEA6gn/9xVY7oEofacHS6wAuDHz49 UgTqHuH/Lz+6fM5tthyg36D6GRa3o+xRXoJi5Yhw0n+GGjT/PjJ3HvNWp3WVmDNRx4Ya geyfged3+rbHJIxc5fTnt+59NlimH3ced8369mR7lfAwiP432FeHVpd7vhFfULoieX8c 7pEQ== MIME-Version: 1.0 X-Received: by 10.107.160.212 with SMTP id j203mr11498855ioe.43.1424957043675; Thu, 26 Feb 2015 05:24:03 -0800 (PST) In-Reply-To: References: <54EC5FA4.6070703@davea.name> <201502241455.t1OEtffT016452@fido.openend.se> <201502241507.t1OF7aUm018883@fido.openend.se> <201502241524.t1OFO09k022270@fido.openend.se> <201502241620.t1OGKf4n002146@fido.openend.se> <54ECB134.5090304@davea.name> <201502241945.t1OJjshO013092@fido.openend.se> <201502241957.t1OJvrJS015604@fido.openend.se> Date: Fri, 27 Feb 2015 00:24:03 +1100 Subject: Re: Newbie question about text encoding From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 54 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1424957046 news.xs4all.nl 2857 [2001:888:2000:d::a6]:54222 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:86499 On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote= : > Wrote something up on why we should stop using ASCII: > http://blog.languager.org/2015/02/universal-unicode.html >From that post: """ 5.1 Gibberish When going from the original 2-byte unicode (around version 3?) to the one having supplemental planes, the unicode consortium added blocks such as * Egyptian hieroglyphs * Cuneiform * Shavian * Deseret * Mahjong * Klingon To me (a layman) it looks unprofessional =E2=80=93 as though they are playi= ng games =E2=80=93 that billions of computing devices, each having billions of storage words should have their storage wasted on blocks such as these. """ The shift from Unicode as a 16-bit code to having multiple planes came in with Unicode 2.0, but the various blocks were assigned separately: * Egyptian hieroglyphs: Unicode 5.2 * Cuneiform: Unicode 5.0 * Shavian: Unicode 4.0 * Deseret: Unicode 3.1 * Mahjong Tiles: Unicode 5.1 * Klingon: Not part of any current standard However, I don't think historians will appreciate you calling all of these "gibberish". To adequately describe and discuss old texts without these Unicode blocks, we'd have to either do everything with images, or craft some kind of reversible transliteration system and have dedicated software to render the texts on screen. Instead, what we have is a well-known and standardized system for transliterating all of these into numbers (code points), and rendering them becomes a simple matter of installing an appropriate font. Also, how does assigning meanings to codepoints "waste storage"? As soon as Unicode 2.0 hit and 16-bit code units stopped being sufficient, everyone needed to allocate storage - either 32 bits per character, or some other system - and the fact that some codepoints were unassigned had absolutely no impact on that. This is decidedly NOT unprofessional, and it's not wasteful either. ChrisA