Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.010 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'subject:Python': 0.05; 'encoded': 0.05; 'apis': 0.07; 'filename': 0.07; 'filenames': 0.07; 'utf-8': 0.07; 'encode': 0.09; 'cc:addr:python-list': 0.10; 'useful,': 0.13; 'encoding': 0.15; 'decode': 0.16; 'encodings,': 0.16; 'folding.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'unicode?': 0.16; 'wrote:': 0.16; "wouldn't": 0.16; 'basically': 0.18; 'language': 0.19; 'cc:2**0': 0.21; 'cc:addr:python.org': 0.21; 'either.': 0.22; 'fine,': 0.22; 'level,': 0.22; 'stopping': 0.22; 'suppose': 0.22; '2015': 0.23; 'normally': 0.23; 'header:In-Reply-To:1': 0.24; 'thus': 0.24; 'linux': 0.26; 'skip:" 20': 0.26; 'least': 0.27; 'message- id:@mail.gmail.com': 0.28; "i'm": 0.29; 'strings,': 0.29; 'subject:other': 0.29; 'way?': 0.29; 'certainly': 0.31; 'subject:all': 0.32; 'point': 0.33; "d'aprano": 0.33; 'steven': 0.33; 'though.': 0.33; 'languages': 0.34; 'file': 0.34; 'received:google.com': 0.34; 'could': 0.35; 'next': 0.35; 'done': 0.35; 'something': 0.35; "isn't": 0.35; 'but': 0.36; 'there': 0.36; '(and': 0.36; 'apple': 0.36; 'should': 0.37; 'subject:: ': 0.37; 'correctly': 0.37; 'mac': 0.37; 'instead': 0.38; 'world,': 0.38; 'say': 0.38; 'pm,': 0.39; 'method': 0.39; 'sure': 0.40; 'where': 0.40; 'some': 0.40; 'your': 0.60; "you've": 0.61; 'skip:n 10': 0.63; 'different': 0.64; 'situation': 0.67; 'soon': 0.67; 'guaranteed': 0.67; 'believe': 0.67; 'subject:have': 0.80; 'actually,': 0.84; 'chrisa': 0.84; 'enforcement': 0.84; "it'd": 0.84; 'subject:you': 0.88; 'to:none': 0.90; 'canonical': 0.91; 'confidence': 0.95 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=u80PFRBX6Aj54iT4ocDGcvApIE+7nfoWew80yh5O6QE=; b=oEuH+6tV4jKRlVNn1y6Pqj5l0fV/hZBcoB1OAh25OZNJoCN8XqFkJDz/wVLT8jAA6z iFS2cKsLkXTypA+Wy7RwHe9fV3fZ94Tc8oXE5X+eTdNTyHmRUD4lzEmdUanNe9OwpGdp SqGoltT4AJSXn8vqowIR2L8rQJt/ZBXMI6lmza5YMYCoHK2kaCURHfb3f8r0hAVwAXvY +6nxdkmhXpcJiycwY12xooccidHyL4E/YXOqzCjWFul1a0niNnECrz+JW3dkiThw1ffT Pwj6WR/7yg9Kg9hvE6oZ/7yf7iAw4/rmm++o6KDVAU8dbubn9JlsQIZxqBUl+ZfJ8sD7 BdsQ== MIME-Version: 1.0 X-Received: by 10.107.131.196 with SMTP id n65mr14977357ioi.53.1433688434519; Sun, 07 Jun 2015 07:47:14 -0700 (PDT) In-Reply-To: <557445f6$0$12997$c3e8da3$5496439d@news.astraweb.com> References: <555f440a$0$12990$c3e8da3$5496439d@news.astraweb.com> <2212595.DFZ6OqehRn@PointedEars.de> <55607a1b$0$13011$c3e8da3$5496439d@news.astraweb.com> <2c4d029c-8ea5-465b-8adc-6c35185bd150@googlegroups.com> <2483375.eHyISxeWLQ@PointedEars.de> <55742e0e$0$12980$c3e8da3$5496439d@news.astraweb.com> <557445f6$0$12997$c3e8da3$5496439d@news.astraweb.com> Date: Mon, 8 Jun 2015 00:47:14 +1000 Subject: Re: Ah Python, you have spoiled me for all other languages From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 38 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1433688442 news.xs4all.nl 2846 [2001:888:2000:d::a6]:38417 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:92257 On Sun, Jun 7, 2015 at 11:24 PM, Steven D'Aprano wrote: >> (Unless you want to say that all strings are guaranteed to >> be NFC/NFD normalized, such that s1 and s2 would actually be >> identical, which I suppose is plausible. I'm not sure what the >> advantage would be, though. And certainly you wouldn't want to >> K-normalize strings automatically.) > > I believe that filenames on Apple file systems (HFS+ if I remember > correctly) are guaranteed to be both normalised and correctly encoded as > UTF-8. If you could live in a purely Apple world, you'd have far fewer > filename hassles. Yep. Actually, there should be nothing stopping the next Linux file system ("ext5" or whatever) from enforcing the same thing; byte-oriented filename APIs would still work just fine, but you could have some confidence that at least local file systems will normally be decodable as UTF-8. Then the only time you'd have to worry about encoding problems would be network or removable file systems - no worrying about "what's the FS encoding", because it'll just be UTF-8. (Hmm. Point of interest: What happens on a Mac if you network-mount something that isn't Unicode? If the enforcement of UTF-8 and normalization is done at the file system level, it's no different from the current Linux situation, where basically anything goes.) But that's file names, not strings in a program. I'm not sure that mandating that strings be normalized is particularly useful, but on the flip side, I'm not sure of any situation where it'd be majorly problematic either. There are ambiguities in some encodings, and as soon as you decode from them and re-encode, you've effectively folded those ambiguities to some canonical form; if your language automatically normalized strings, you'd just have the same effect of folding. And then you could have encode methods that stipulate the other form of normalization - say you NFD everything internally, you could then have a method "a\u0301".encode("utf-8", combine=True) which NFC normalizes prior to encoding (and would thus be C3 A1 instead of 61 CC 81). Are there any languages out there that work this way? ChrisA