Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Chris Angelico Newsgroups: comp.lang.python Subject: Re: Unicode normalisation [was Re: [beginner] What's wrong?] Date: Sat, 9 Apr 2016 03:50:16 +1000 Lines: 19 Message-ID: References: <99234e90-fcd4-4a05-b97f-b47228dde20c@googlegroups.com> <1459571270.714249.566352882.6ADCD0CC@webmail.messagingengine.com> <87bn5sqcac.fsf@elektro.pacujo.net> <56ffedf1$0$1611$c3e8da3$5496439d@news.astraweb.com> <87h9fkq7tl.fsf@elektro.pacujo.net> <3524319.g0I1c1cpMS@PointedEars.de> <2796705.edb3E9ArW3@PointedEars.de> <1584744.4h7ToaqLat@PointedEars.de> <5705b9ef$0$1611$c3e8da3$5496439d@news.astraweb.com> <570748ec$0$1620$c3e8da3$5496439d@news.astraweb.com> <874mbcgfmd.fsf@elektro.pacujo.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: news.uni-berlin.de dRRdHIDt0e4ss6ttBttc3gSPRuspVqAIPmGecO6u6U/g== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.004 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'subject:: [': 0.03; 'computing,': 0.07; 'cc:addr:python-list': 0.09; 'transcoding': 0.09; 'argument': 0.15; 'encoding': 0.15; '2016': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'subject:Unicode': 0.16; 'subject:beginner': 0.16; 'unicode.': 0.16; 'wrote:': 0.16; 'subject:] ': 0.19; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; "we'd": 0.21; 'ascii': 0.22; 'new,': 0.22; 'am,': 0.23; 'bit': 0.23; 'sat,': 0.23; 'header:In-Reply-To:1': 0.24; 'points': 0.27; 'message-id:@mail.gmail.com': 0.27; 'dawn': 0.29; 'subject: [': 0.29; 'character': 0.29; 'code': 0.30; 'etc.)': 0.32; 'legacy': 0.33; 'received:google.com': 0.35; 'unicode': 0.35; 'quite': 0.35; 'but': 0.36; 'too': 0.36; 'there': 0.36; 'received:209.85': 0.36; 'subject:: ': 0.37; 'received:209.85.213': 0.37; 'received:209': 0.38; 'software': 0.40; 'more': 0.63; 'direct': 0.68; 'bag': 0.72; 'chrisa': 0.84; 'posed': 0.84; 'subject:skip:n 10': 0.84; 'to:none': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-transfer-encoding; bh=/5e+c7gCq+1Js1VYJGhsnJ9YQb2YMlyjWxifCMPdhfY=; b=FB8jvQ3mO3nAAXuTlTue/DjGtCb6LWKbPchpBrngJxktZ2yUODnrtIR2ZEMWHKV+6C 1K7kaEwwTFysVLx5Qjbub3gEVbASP2DejQ7DlXie1hhB/4WEDuHaTgHYAq7/tnULEt2j QCIeKmVoV84dwy6P/yBao6M0CTn5ZML6emf4seaMfDXTKF1sYAK9mNCMwsaY/c12HAfj RdqKa2SSHFSPisjqRe/uaYyhHdDZVhUdXWi8Dp+hTiUNiSzpHySBG2UwsCk3bQZGv0ZR aeB0G2BPMIP8CRMS/JSRoDt79CsE3SuitC6A3m1F2YRDoPMekIMPLVxhhTgb8nOrp7cc +COA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc:content-transfer-encoding; bh=/5e+c7gCq+1Js1VYJGhsnJ9YQb2YMlyjWxifCMPdhfY=; b=KDF3DWtJj8LkRmFAwoCkWfEwmvKGCfuarh1Ifgw27RS2Tn5z34mqCBJYDBIDXAqg9S OhxKlhlOSDTnOIhs3YMTL8x4ENGgV8C05yBa328b93SuWTogqQITi8q0LlyIZKyRXZKm 6kxdr8WyvT7RAkOU9YDgCGEBGss3JZt5SgPcnD2e5+qM4LOG/3RcXxnZu/53DppJoWd9 QcDaWz0SizvDVErgElO1IWyPvP6BRMCNOBu2F/tEDGT15ZsJWsRlM3d3Xco31iS8WAbH jk8qnpOE5sfjItjPzSorqsjVusLK0pJ5nVvVA880Mn0TrSNQ8b5hKPW82smBLfXHwbTQ ophw== X-Gm-Message-State: AD7BkJLbxZ8pXGiuK7VoGt8UIITWJHmpgIfEPWdrfTIupwWreHnW+s6+AunjfGNkhOe11aI9wubTQASNl9pXwQ== X-Received: by 10.50.112.169 with SMTP id ir9mr5276959igb.92.1460137816454; Fri, 08 Apr 2016 10:50:16 -0700 (PDT) In-Reply-To: <874mbcgfmd.fsf@elektro.pacujo.net> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: X-Mailman-Original-References: <99234e90-fcd4-4a05-b97f-b47228dde20c@googlegroups.com> <1459571270.714249.566352882.6ADCD0CC@webmail.messagingengine.com> <87bn5sqcac.fsf@elektro.pacujo.net> <56ffedf1$0$1611$c3e8da3$5496439d@news.astraweb.com> <87h9fkq7tl.fsf@elektro.pacujo.net> <3524319.g0I1c1cpMS@PointedEars.de> <2796705.edb3E9ArW3@PointedEars.de> <1584744.4h7ToaqLat@PointedEars.de> <5705b9ef$0$1611$c3e8da3$5496439d@news.astraweb.com> <570748ec$0$1620$c3e8da3$5496439d@news.astraweb.com> <874mbcgfmd.fsf@elektro.pacujo.net> Xref: csiph.com comp.lang.python:106698 On Sat, Apr 9, 2016 at 3:44 AM, Marko Rauhamaa wrote: > Unicode heroically and definitively solved the problems ASCII had posed > but introduced a bag of new, trickier problems. > > (As for ligatures, I understand that there might be quite a bit of > legacy software that dedicated code points and code pages for ligatures. > Translating that legacy software to Unicode was made more > straightforward by introducing analogous codepoints to Unicode. Unicode > has quite many such codepoints: =C2=B5, =E2=84=AA, =E2=84=A6 etc.) More specifically, Unicode solved the problems that *codepages* had posed. And one of the principles of its design was that every character in every legacy encoding had a direct representation as a Unicode codepoint, allowing bidirectional transcoding for compatibility. Perhaps if Unicode had existed from the dawn of computing, we'd have less characters; but backward compatibility is way too important to let a narrow purity argument sway it. ChrisA