Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: Pyhon 2.x or 3.x, which is faster?
Date: Thu, 10 Mar 2016 02:58:34 +1100
Lines: 58
Message-ID: <mailman.86.1457539118.15725.python-list@python.org>
References: <mailman.238.1457265255.20602.python-list@python.org> <nbjmv7$ad5$1@dont-email.me> <87d1r6iltx.fsf@elektro.pacujo.net> <nbjp1e$jhv$1@dont-email.me> <nbjrjm$m16$1@gioia.aioe.org> <nbjvas$h22$1@dont-email.me> <mailman.17.1457364684.10335.python-list@python.org> <nbkhei$dg6$1@dont-email.me> <mailman.43.1457377845.10335.python-list@python.org> <nbknir$avu$1@dont-email.me> <56de28a1$0$1604$c3e8da3$5496439d@news.astraweb.com> <nblae9$nl0$1@dont-email.me> <56de57b5$0$1590$c3e8da3$5496439d@news.astraweb.com> <nbml2q$l4n$1@dont-email.me> <56df6873$0$1588$c3e8da3$5496439d@news.astraweb.com> <nbnu1m$u1m$1@dont-email.me> <56df87f7$0$1620$c3e8da3$5496439d@news.astraweb.com> <nbpaaa$uic$1@dont-email.me> <mailman.77.1457532681.15725.python-list@python.org> <nbpcdd$71l$1@dont-email.me> <mailman.79.1457535287.15725.python-list@python.org> <56e0424b$0$1603$c3e8da3$5496439d@news.astraweb.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <56e0424b$0$1603$c3e8da3$5496439d@news.astraweb.com>
Precedence: list
Xref: csiph.com comp.lang.python:104427

On Thu, Mar 10, 2016 at 2:33 AM, Steven D'Aprano <steve@pearwood.info> wrote:
> On Thu, 10 Mar 2016 01:54 am, Chris Angelico wrote:
>
>> I have a source of occasional text files that basically just dumps
>> stuff on me without any metadata, and I have to figure out (a) what
>> the encoding is, and (b) what language the text is in.
>
> https://pypi.python.org/pypi/chardet
>
>> then I have two levels of heuristics to try to guess a
>> most-likely encoding
>
> I'm curious, what do you do?

Collect subtitles files from random internet contributors and
determine whether they add to the existing corpus of material. The
first heuristic level is chardet, as mentioned; but with the specific
files that I'm processing, it has some semi-consistent errors, so I
scripted around that - eg "if chardet says ISO-8859-2, and these byte
patterns exist, it's probably actually codepage 1250". IIRC the second
level is entirely translating from an ISO-8859 to the
nearest-equivalent Windows codepage.

> (I stress that trying to guess the character set or encoding from the text
> itself is a second-last ditch tactic, for when you really don't know and
> can't find out what the encoding is. The final, last-ditch tactic is to
> just say "bugger it, I'll pretend it's Latin-1" and get a mess of
> moji-bake, but at least an ASCII characters will decode alright, and as an
> English speaker, that's all that's important to me :-)

What I do is attempt to guess, *and then hand it to the user*. I have
a little "cdless" script that does a chardet on a file, decodes
accordingly, and pipes the result into 'less' [1]. The most powerful
character encoding detection tool in my arsenal is 'less'.

Pretending that text is Latin-1 is actually a pretty good start. If I
didn't have chardet, I'd be mainly using this:

https://github.com/Rosuav/shed/blob/master/charconv.py

With no args, this will take the beginning of the file (it tries to
get one paragraph of up to 1KB) and decode it using all the ISO-8859-*
encodings, displaying the results for human analysis. That's
surprisingly effective for a manual job. A large number of European
languages use a lot of ASCII letters and then each have their own
distinct non-ASCII characters in between; the only truly confusable
encodings are the ones that are entirely non-ASCII (Cyrillic, Arabic,
Greek, Hebrew - ISO-8859-5 through 8), and mis-decoding one as another
usually results in complete nonsense (words with impossible
vowel/consonant combinations, for instance). It does take *linguistic*
analysis (as opposed to purely mathematical/charcode), but it isn't
too hard.

ChrisA

[1] ... and since Unix pipes carry bytes, not text, this involves
encoding it as UTF-8. But that's an implementation detail between
cdless and less.