Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <201502241524.t1OFO09k022270@fido.openend.se>
References: <aae131a7-29a1-4f79-ac16-d1e223616c51@googlegroups.com> <davea@davea.name> <54EC5FA4.6070703@davea.name> <201502241455.t1OEtffT016452@fido.openend.se> <CAPTjJmqT_VnXDRpuX_yRLzUtDzedZqUNx5Zhba+d6ZVD9+PNdg@mail.gmail.com> <201502241507.t1OF7aUm018883@fido.openend.se> <rosuav@gmail.com> <CAPTjJmpg+Ar-83fLPN5Pg3U5udLbkS0tBqF+aGQbiLrCVJ5aSw@mail.gmail.com> <201502241524.t1OFO09k022270@fido.openend.se>
Date: Wed, 25 Feb 2015 02:33:30 +1100
Subject: Re: Newbie question about text encoding
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.19138.1424792018.18130.python-list@python.org>
Lines: 48
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:86326

On Wed, Feb 25, 2015 at 2:24 AM, Laura Creighton <lac@openend.se> wrote:
> Ah, yes, you are right about that.  I see CP-1252 about 2 times every 10
> years, and latin1 every minute of my life, so I am biased to assume I
> know what I am seeing.

Fair enough. CP-1252 is still a possibility, but the difference can be
dealt with later.

> ChrisA, you come from an English speaking country, right?

Yes (Australia, to be specific).

> For those of us who come from countries whose language doesn't fit in
> ASCII, the notion of 'understand the data' doesn't work very well.  We
> already understand the data -- its a set of words in our native language.
> The hard part isn't understanding the data, but rather understanding how
> the hell Python could be so stupid as to not understand it. :)  The
> notion that Python normally only understands the subset of the
> characters in your native language than English speakers use in their
> language is not the most obvious thing.

Also a reasonable baseline assumption; but the trouble is that if you
automatically assume that text is encoded in your favourite eight-bit
system, you're taking a huge risk.

Now, you have a huge leg up on me, in that you actually recognize the
*words* in that piece of text. That means you can have MUCH greater
confidence in stating that it's Latin-1 than I can. But that's
precisely what I mean by "understand the data". If you, being a native
French speaker, pick up a file written in (say) Polish, and encoded
Latin-2, you'll recognize by the ASCII characters that it's not French
text, and probably you'd be able to spot that it ought to be Latin-2
rather than Latin-1. That's understanding the data, that's having more
information than just the byte patterns. A computer can't reliably do
that (just look up the "Bush hid the facts" bug if you don't believe
me), but a human often can.

> And having taught countless European kids how to write their very first
> program in Python, I can tell you for certain that the sort of deep
> understanding of encoding methods is not what 10 year olds who just
> want to print out the names of their friends, and their favourite
> music titles, and their favourite musicians want to know. :)

Right, so you should be teaching them to use Python 3, and always
saving everything in UTF-8, and basically ignoring the whole mess of
eight-bit encodings :)

ChrisA