Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python,': 0.02; 'encoding': 0.05; 'subject:text': 0.05; 'encoded': 0.07; 'laura': 0.07; 'patterns.': 0.07; 'ascii': 0.09; 'creighton': 0.09; 'mess': 0.09; 'stating': 0.09; 'subset': 0.09; 'titles,': 0.09; 'subject:question': 0.10; 'cc:addr:python-list': 0.11; 'python': 0.11; 'bug': 0.12; 'assume': 0.14; 'language.': 0.14; '(just': 0.16; 'ah,': 0.16; 'ascii,': 0.16; 'encodings': 0.16; 'enough.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'ignoring': 0.16; 'olds': 0.16; 'ought': 0.16; 'reliably': 0.16; 'language': 0.16; 'wrote:': 0.18; 'wed,': 0.18; 'basically': 0.19; 'normally': 0.19; 'fit': 0.20; 'written': 0.21; 'feb': 0.22; 'cc:addr:python.org': 0.22; 'print': 0.22; 'byte': 0.24; 'recognize': 0.24; 'text,': 0.24; 'text.': 0.24; 'cc:2**0': 0.24; 'sort': 0.25; 'speakers': 0.26; 'certain': 0.27; 'header:In-Reply- To:1': 0.27; 'am,': 0.29; 'words': 0.29; "doesn't": 0.30; 'characters': 0.30; 'friends,': 0.30; 'message- id:@mail.gmail.com': 0.30; 'that.': 0.31; '25,': 0.31; 'piece': 0.31; 'file': 0.32; 'probably': 0.32; 'know.': 0.32; 'text': 0.33; 'trouble': 0.34; 'could': 0.34; "can't": 0.35; 'but': 0.35; 'received:google.com': 0.35; 'data,': 0.36; 'european': 0.36; 'right?': 0.36; 'should': 0.36; 'being': 0.38; 'system,': 0.38; 'rather': 0.38; 'how': 0.40; 'skip:u 10': 0.60; 'life,': 0.60; 'most': 0.60; 'tell': 0.60; 'french': 0.61; "you're": 0.61; 'first': 0.61; 'times': 0.62; "you'll": 0.62; 'information': 0.63; 'our': 0.64; 'pick': 0.64; 'teaching': 0.64; 'more': 0.64; 'taking': 0.65; 'spot': 0.65; 'believe': 0.68; 'yes': 0.68; 'saving': 0.69; 'obvious': 0.74; 'music': 0.75; '2015': 0.84; 'dealt': 0.91; 'notion': 0.91; 'to:none': 0.92; 'confidence': 0.95; 'taught': 0.96 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=lBOlLadEMPqx8Bz0Pk1Kjt8BrWAqkneAzjfdeXrMJj8=; b=y8kGnrFo6H/dHIZJ+lqvq8y9Lgohuk13wXrqC8IkHOyNG7OeB0OZpGQ8f2bb9ntNZx /OPoR/ws4fsEQxjW5VpMymh8ZWYS1OwKjGpIGXvgG/QrKdPSPlfj1LtH1sdzBSQsdd9h 41u91ZgLH888IBBNgkNogM3ppF5ksiRW0MWyzAYWOt+5ns/iRECo+ccXaeuBWOOH2zPl 0q1WgHlR5FX4qLRNz9LGJmsHXgPEb9yEWdmsEvgj5QY7TyNwweb8hEXoWhQOO63DlNor 9LnFKksF6RVmfggPfCbVRpBezEcQybDCqPY17EKtrjhW8F3TMUUjRmGUKqg3xeOStoiZ XUHw== MIME-Version: 1.0 X-Received: by 10.42.64.197 with SMTP id h5mr18015633ici.12.1424792010769; Tue, 24 Feb 2015 07:33:30 -0800 (PST) In-Reply-To: <201502241524.t1OFO09k022270@fido.openend.se> References: <54EC5FA4.6070703@davea.name> <201502241455.t1OEtffT016452@fido.openend.se> <201502241507.t1OF7aUm018883@fido.openend.se> <201502241524.t1OFO09k022270@fido.openend.se> Date: Wed, 25 Feb 2015 02:33:30 +1100 Subject: Re: Newbie question about text encoding From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 48 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1424792018 news.xs4all.nl 2955 [2001:888:2000:d::a6]:45908 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:86326 On Wed, Feb 25, 2015 at 2:24 AM, Laura Creighton wrote: > Ah, yes, you are right about that. I see CP-1252 about 2 times every 10 > years, and latin1 every minute of my life, so I am biased to assume I > know what I am seeing. Fair enough. CP-1252 is still a possibility, but the difference can be dealt with later. > ChrisA, you come from an English speaking country, right? Yes (Australia, to be specific). > For those of us who come from countries whose language doesn't fit in > ASCII, the notion of 'understand the data' doesn't work very well. We > already understand the data -- its a set of words in our native language. > The hard part isn't understanding the data, but rather understanding how > the hell Python could be so stupid as to not understand it. :) The > notion that Python normally only understands the subset of the > characters in your native language than English speakers use in their > language is not the most obvious thing. Also a reasonable baseline assumption; but the trouble is that if you automatically assume that text is encoded in your favourite eight-bit system, you're taking a huge risk. Now, you have a huge leg up on me, in that you actually recognize the *words* in that piece of text. That means you can have MUCH greater confidence in stating that it's Latin-1 than I can. But that's precisely what I mean by "understand the data". If you, being a native French speaker, pick up a file written in (say) Polish, and encoded Latin-2, you'll recognize by the ASCII characters that it's not French text, and probably you'd be able to spot that it ought to be Latin-2 rather than Latin-1. That's understanding the data, that's having more information than just the byte patterns. A computer can't reliably do that (just look up the "Bush hid the facts" bug if you don't believe me), but a human often can. > And having taught countless European kids how to write their very first > program in Python, I can tell you for certain that the sort of deep > understanding of encoding methods is not what 10 year olds who just > want to print out the names of their friends, and their favourite > music titles, and their favourite musicians want to know. :) Right, so you should be teaching them to use Python 3, and always saving everything in UTF-8, and basically ignoring the whole mess of eight-bit encodings :) ChrisA