Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!1.eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.007 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'subject:Python': 0.05; 'strings.': 0.07; 'utf-8': 0.07; 'ambiguity': 0.09; 'libraries.': 0.09; 'cc:addr:python-list': 0.10; 'python': 0.11; 'do,': 0.15; '*string': 0.16; 'czech': 0.16; 'decent': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'informally,': 0.16; 'module:': 0.16; 'programmer,': 0.16; 'stream.': 0.16; 'string",': 0.16; 'wrote:': 0.16; "wouldn't": 0.16; 'string': 0.17; 'bytes': 0.18; 'else,': 0.18; 'say,': 0.18; "shouldn't": 0.18; 'language': 0.19; '>>>': 0.20; 'library': 0.20; 'handling': 0.20; 'cc:2**0': 0.21; 'cc:addr:python.org': 0.21; 'suppose': 0.22; 'programming': 0.23; '2015': 0.23; 'header:In- Reply-To:1': 0.24; 'question': 0.26; 'not,': 0.27; 'issue,': 0.27; 'message-id:@mail.gmail.com': 0.28; 'went': 0.28; "i'm": 0.29; 'allowed,': 0.29; 'itself,': 0.29; 'subject:other': 0.29; 'skip:u 20': 0.30; 'that.': 0.30; "we're": 0.30; 'becomes': 0.31; 'certainly': 0.31; 'e.g.': 0.31; 'operations': 0.31; 'code': 0.31; 'language.': 0.32; 'operate': 0.32; 'subject:all': 0.32; 'problem': 0.33; "d'aprano": 0.33; 'steven': 0.33; 'though.': 0.33; 'case,': 0.34; 'presence': 0.34; 'received:google.com': 0.34; 'useful': 0.35; 'wrong': 0.35; 'false': 0.35; 'unicode': 0.35; 'something': 0.35; 'but': 0.36; 'there': 0.36; 'basic': 0.36; 'beginning': 0.36; 'two': 0.37; 'should': 0.37; 'agree': 0.37; 'subject:: ': 0.37; 'rather': 0.38; 'say': 0.38; 'mean': 0.38; 'pm,': 0.39; 'expect': 0.39; 'sure': 0.40; 'challenge': 0.61; 'provide': 0.61; 'skip:n 10': 0.63; 'here': 0.66; 'compliant': 0.66; 'reverse': 0.66; 'tasks.': 0.66; 'letters': 0.67; 'guaranteed': 0.67; 'natural': 0.67; 'choose': 0.68; 'subject:have': 0.80; 'chrisa': 0.84; 'contains.': 0.84; 'people"': 0.84; 'subject:you': 0.88; 'to:none': 0.90; 'tricky': 0.93; 'instant': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type:content-transfer-encoding; bh=jLk2FY+iZyCfzl0ntFnbH1ttn/PKle/ORqs3pO/6nJs=; b=t/yr/cFl74eEeR835llDTGjXESTuNrHieKdVEKSlys/FC4Z0/8Zyaao3+r/Gv9ATVd XsVrMYA0WJV8OZMzTgV6YA1QO95yFGinA/kzJUGny2GomnOZO7Iz4ATb4jEg8M00tAoQ sr/BSvdcdW8BI6mveg0kHwLhMmTHcOJL0mEsihw/Gj91nTLXSfy2RfHQJ5XU6fVi0ufJ st4cAgtDFBxqHNJEBJl/0LQLsTxvHaI3qVmTcjL+Equ/58nWVSi0lvTS7KSB+STi9Bhc XlOk9ohc0jatn1lIuIMrDyTir3RLAbDcFteqDfzANlkR5/7wqqXKbA2mcZuNUKOE2J3w TQIg== MIME-Version: 1.0 X-Received: by 10.107.131.196 with SMTP id n65mr14313530ioi.53.1433678886421; Sun, 07 Jun 2015 05:08:06 -0700 (PDT) In-Reply-To: <55742e0e$0$12980$c3e8da3$5496439d@news.astraweb.com> References: <555f440a$0$12990$c3e8da3$5496439d@news.astraweb.com> <2212595.DFZ6OqehRn@PointedEars.de> <55607a1b$0$13011$c3e8da3$5496439d@news.astraweb.com> <2c4d029c-8ea5-465b-8adc-6c35185bd150@googlegroups.com> <2483375.eHyISxeWLQ@PointedEars.de> <55742e0e$0$12980$c3e8da3$5496439d@news.astraweb.com> Date: Sun, 7 Jun 2015 22:08:06 +1000 Subject: Re: Ah Python, you have spoiled me for all other languages From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 59 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1433678894 news.xs4all.nl 2853 [2001:888:2000:d::a6]:33202 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:92237 On Sun, Jun 7, 2015 at 9:42 PM, Steven D'Aprano wrote= : > My opinion is that a programming language like Python or ECMAScript shoul= d > operate on *code points*. If we want to call them "characters" informally= , > that should be allowed, but whenever there is ambiguity we should remembe= r > we're dealing with code points. The implementation shouldn't matter: > compliant Python interpreters might choose to use UTF-8 internally, or > UTF-16, or UTF-32, or something else, and still agree on how many > characters a string contains. Normalisation is still an issue, of course, > but any decent Unicode implementation will include a way to normalise or > denormalise strings. If by "normalise" you mean the NF[K]C/NF[K]D composition and decomposition, then yes, any decent Unicode library will provide that. I'm not sure it's critical to string handling itself, though; and Python defers the operation to the unicodedata module: >>> s1 =3D "\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}" >>> s2 =3D "\N{LATIN SMALL LETTER A WITH ACUTE}" >>> s1 =3D=3D s2 False >>> unicodedata.normalize("NFC", s1) =3D=3D s2 True It's a useful operation to be able to do, but I would never expect that *string comparison* or other operations should automatically normalize. (Unless you want to say that all strings are guaranteed to be NFC/NFD normalized, such that s1 and s2 would actually be identical, which I suppose is plausible. I'm not sure what the advantage would be, though. And certainly you wouldn't want to K-normalize strings automatically.) > The question of graphemes (what "ordinary people" consider letters and > characters, e.g. "ch" is two letters to an English speaker but one letter > to a Czech speaker) should be left to libraries. It's a much harder probl= em > to solve in the full general case, requires localisation, and is overkill > for many string-processing tasks. Yeah. The basic challenge to a beginning programmer, "reverse this string", becomes rather tricky in the presence of natural language. >>> s1 +=3D "e" >>> s1 'a=CC=81e' >>> s1[::-1] 'e=CC=81a' Oops. But hey. It's easier to understand what went wrong here than, say, if you reverse the bytes in a UTF-8 stream. Or the code units in a UTF-16 stream. If you're lucky, those would give you instant errors... if you're not, well, who knows. ChrisA