Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.004 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'python,': 0.02; 'encoding': 0.05; 'subject:text': 0.05; 'advocate': 0.07; 'laura': 0.07; 'creighton': 0.09; 'subset': 0.09; 'titles,': 0.09; 'subject:question': 0.10; 'cc:addr:python-list': 0.11; 'python': 0.11; 'assume': 0.14; 'language.': 0.14; '>on': 0.16; 'ah,': 0.16; 'ascii,': 0.16; 'olds': 0.16; 'received:openend.se': 0.16; 'received:theraft.openend.se': 0.16; 'recognise': 0.16; 'language': 0.16; 'wrote:': 0.18; 'wed,': 0.18; 'normally': 0.19; 'fit': 0.20; 'feb': 0.22; '>>>': 0.22; 'cc:addr:python.org': 0.22; 'print': 0.22; 'cc:2**1': 0.23; 'cc:no real name:2**0': 0.24; 'sort': 0.25; 'speakers': 0.26; 'certain': 0.27; 'header:In-Reply- To:1': 0.27; 'chris': 0.29; 'am,': 0.29; 'words': 0.29; "doesn't": 0.30; 'characters': 0.30; 'friends,': 0.30; "i'm": 0.30; 'that.': 0.31; '25,': 0.31; 'fixing': 0.31; 'writes:': 0.31; 'know.': 0.32; 'could': 0.34; "can't": 0.35; 'case,': 0.35; 'but': 0.35; 'data,': 0.36; 'european': 0.36; 'right?': 0.36; 'charset:us-ascii': 0.36; 'level': 0.37; 'problems': 0.38; 'rather': 0.38; 'sure': 0.39; 'how': 0.40; 'even': 0.60; 'skip:u 10': 0.60; 'life,': 0.60; 'most': 0.60; 'tell': 0.60; 'french': 0.61; 'first': 0.61; 'times': 0.62; 'header:Message-Id:1': 0.63; 'our': 0.64; 'places': 0.64; 'to:addr:gmail.com': 0.65; 'between': 0.67; 'obvious': 0.74; 'music': 0.75; '"just': 0.84; '2015': 0.84; 'sporting': 0.84; 'received:89': 0.85; 'notion': 0.91; 'taught': 0.96 To: Chris Angelico From: Laura Creighton Subject: Re: Newbie question about text encoding In-Reply-To: Message from Chris Angelico of "Wed, 25 Feb 2015 02:10:42 +1100." References: <54EC5FA4.6070703@davea.name> <201502241455.t1OEtffT016452@fido.openend.se> <201502241507.t1OF7aUm018883@fido.openend.se> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <22268.1424791440.1@fido> Date: Tue, 24 Feb 2015 16:24:00 +0100 Cc: "python-list@python.org" , lac@openend.se X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 39 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1424791449 news.xs4all.nl 2972 [2001:888:2000:d::a6]:42932 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:86325 In a message of Wed, 25 Feb 2015 02:10:42 +1100, Chris Angelico writes: >On Wed, Feb 25, 2015 at 2:07 AM, Laura Creighton wrote: >>>Can you be sure it's Latin-1? I'm not certain of that. In any case, I >>>never advocate fixing encoding problems by "just do this and it'll all >>>go away"; you have to understand your data before you can decode it. >>> >>>ChrisA >> >> I can, I speak French and I recognise the data. It's French place names, >> places where sporting events are held. :) > >Ah, okay. :) But even with that level of confidence, you still have to >pick between Latin-1 and CP-1252, which you can't tell based on this >one snippet. Welcome to untagged encodings. > >ChrisA Ah, yes, you are right about that. I see CP-1252 about 2 times every 10 years, and latin1 every minute of my life, so I am biased to assume I know what I am seeing. ChrisA, you come from an English speaking country, right? For those of us who come from countries whose language doesn't fit in ASCII, the notion of 'understand the data' doesn't work very well. We already understand the data -- its a set of words in our native language. The hard part isn't understanding the data, but rather understanding how the hell Python could be so stupid as to not understand it. :) The notion that Python normally only understands the subset of the characters in your native language than English speakers use in their language is not the most obvious thing. And having taught countless European kids how to write their very first program in Python, I can tell you for certain that the sort of deep understanding of encoding methods is not what 10 year olds who just want to print out the names of their friends, and their favourite music titles, and their favourite musicians want to know. :) Laura