Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #104427

Re: Pyhon 2.x or 3.x, which is faster?

From Chris Angelico <rosuav@gmail.com>
Newsgroups comp.lang.python
Subject Re: Pyhon 2.x or 3.x, which is faster?
Date 2016-03-10 02:58 +1100
Message-ID <mailman.86.1457539118.15725.python-list@python.org> (permalink)
References (17 earlier) <nbpaaa$uic$1@dont-email.me> <mailman.77.1457532681.15725.python-list@python.org> <nbpcdd$71l$1@dont-email.me> <mailman.79.1457535287.15725.python-list@python.org> <56e0424b$0$1603$c3e8da3$5496439d@news.astraweb.com>

Show all headers | View raw


On Thu, Mar 10, 2016 at 2:33 AM, Steven D'Aprano <steve@pearwood.info> wrote:
> On Thu, 10 Mar 2016 01:54 am, Chris Angelico wrote:
>
>> I have a source of occasional text files that basically just dumps
>> stuff on me without any metadata, and I have to figure out (a) what
>> the encoding is, and (b) what language the text is in.
>
> https://pypi.python.org/pypi/chardet
>
>> then I have two levels of heuristics to try to guess a
>> most-likely encoding
>
> I'm curious, what do you do?

Collect subtitles files from random internet contributors and
determine whether they add to the existing corpus of material. The
first heuristic level is chardet, as mentioned; but with the specific
files that I'm processing, it has some semi-consistent errors, so I
scripted around that - eg "if chardet says ISO-8859-2, and these byte
patterns exist, it's probably actually codepage 1250". IIRC the second
level is entirely translating from an ISO-8859 to the
nearest-equivalent Windows codepage.

> (I stress that trying to guess the character set or encoding from the text
> itself is a second-last ditch tactic, for when you really don't know and
> can't find out what the encoding is. The final, last-ditch tactic is to
> just say "bugger it, I'll pretend it's Latin-1" and get a mess of
> moji-bake, but at least an ASCII characters will decode alright, and as an
> English speaker, that's all that's important to me :-)

What I do is attempt to guess, *and then hand it to the user*. I have
a little "cdless" script that does a chardet on a file, decodes
accordingly, and pipes the result into 'less' [1]. The most powerful
character encoding detection tool in my arsenal is 'less'.

Pretending that text is Latin-1 is actually a pretty good start. If I
didn't have chardet, I'd be mainly using this:

https://github.com/Rosuav/shed/blob/master/charconv.py

With no args, this will take the beginning of the file (it tries to
get one paragraph of up to 1KB) and decode it using all the ISO-8859-*
encodings, displaying the results for human analysis. That's
surprisingly effective for a manual job. A large number of European
languages use a lot of ASCII letters and then each have their own
distinct non-ASCII characters in between; the only truly confusable
encodings are the ones that are entirely non-ASCII (Cyrillic, Arabic,
Greek, Hebrew - ISO-8859-5 through 8), and mis-decoding one as another
usually results in complete nonsense (words with impossible
vowel/consonant combinations, for instance). It does take *linguistic*
analysis (as opposed to purely mathematical/charcode), but it isn't
too hard.

ChrisA

[1] ... and since Unix pipes carry bytes, not text, this involves
encoding it as UTF-8. But that's an implementation detail between
cdless and less.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Pyhon 2.x or 3.x, which is faster? Tony van der Hoff <tony@vanderhoff.org> - 2016-03-06 11:34 +0000
  Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-07 01:41 +1100
    Re: Pyhon 2.x or 3.x, which is faster? Tony van der Hoff <tony@vanderhoff.org> - 2016-03-07 10:45 +0000
    Re: Pyhon 2.x or 3.x, which is faster? Andrew Jaffe <a.h.jaffe@gmail.com> - 2016-03-07 11:54 +0000
    Re: Pyhon 2.x or 3.x, which is faster? Terry Reedy <tjreedy@udel.edu> - 2016-03-07 17:33 -0500
  Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-07 11:02 +0000
    Re: Pyhon 2.x or 3.x, which is faster? Marko Rauhamaa <marko@pacujo.net> - 2016-03-07 13:11 +0200
      Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-07 11:38 +0000
        Re: Pyhon 2.x or 3.x, which is faster? Fabien <fabien.maussion@gmail.com> - 2016-03-07 13:19 +0100
          Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-07 13:25 +0000
            Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-08 02:31 +1100
              Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-07 18:34 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-08 06:10 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-07 20:19 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-08 07:47 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-07 22:39 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-08 10:40 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 00:22 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-08 00:43 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-08 11:45 +1100
                Re: Pyhon 2.x or 3.x, which is faster? MRAB <python@mrabarnett.plus.com> - 2016-03-08 00:47 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2016-03-07 20:29 -0500
                Re: Pyhon 2.x or 3.x, which is faster? Terry Reedy <tjreedy@udel.edu> - 2016-03-07 22:51 -0500
                Re: Pyhon 2.x or 3.x, which is faster? Michael Torrie <torriem@gmail.com> - 2016-03-08 17:34 -0700
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-09 13:01 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-09 11:38 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Ben Finney <ben+python@benfinney.id.au> - 2016-03-08 11:05 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 01:00 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-08 01:12 +0000
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 01:47 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-08 02:45 +0000
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 11:09 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-08 16:09 +0000
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 19:15 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-08 20:44 +0000
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 22:38 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-09 10:59 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-09 08:40 +0000
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-09 12:02 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-09 21:13 +0000
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-09 23:14 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-09 23:35 +0000
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-10 00:58 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-10 12:28 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-10 07:30 +0000
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-10 11:50 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-10 12:15 +0000
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-10 12:47 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-11 00:08 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-10 14:22 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-10 19:26 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-11 16:29 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-11 18:57 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-11 21:59 +0000
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-11 22:24 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-12 16:59 +1100
                Re: Pyhon 2.x or 3.x, which is faster? alister <alister.ware@ntlworld.com> - 2016-03-12 10:06 +0000
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-12 10:31 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-12 10:51 +0000
                Re: Pyhon 2.x or 3.x, which is faster? alister <alister.ware@ntlworld.com> - 2016-03-12 15:36 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-13 14:22 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-12 10:34 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-12 21:40 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-11 07:07 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-11 16:06 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-11 16:36 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-10 13:18 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-11 00:30 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-10 13:46 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Ben Finney <ben+python@benfinney.id.au> - 2016-03-10 18:43 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-10 18:55 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-10 12:59 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-10 12:19 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-10 10:38 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2016-03-09 23:48 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-10 11:03 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2016-03-10 02:38 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-10 14:43 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-10 01:30 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Ben Finney <ben+python@benfinney.id.au> - 2016-03-10 13:29 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-10 14:32 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-10 13:45 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Ben Finney <ben+python@benfinney.id.au> - 2016-03-10 11:21 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-08 12:23 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 01:33 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Ben Finney <ben+python@benfinney.id.au> - 2016-03-08 12:38 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-08 12:40 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 02:02 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Ben Finney <ben+python@benfinney.id.au> - 2016-03-08 13:28 +1100
                Re: Pyhon 2.x or 3.x, which is faster? MRAB <python@mrabarnett.plus.com> - 2016-03-08 02:47 +0000
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 11:15 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-03-08 13:45 +0200
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 12:09 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Terry Reedy <tjreedy@udel.edu> - 2016-03-07 22:39 -0500
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-08 03:48 +0000
                What will I get when reading from a file? (was: Pyhon 2.x or 3.x, which is faster?) Ben Finney <ben+python@benfinney.id.au> - 2016-03-08 11:09 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-08 13:12 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 11:53 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-09 10:28 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-09 00:09 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-09 11:36 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2016-03-08 21:03 -0500
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-10 03:07 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Serhiy Storchaka <storchaka@gmail.com> - 2016-03-08 14:48 +0200
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-08 12:34 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-08 12:49 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Serhiy Storchaka <storchaka@gmail.com> - 2016-03-08 15:05 +0200
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-08 12:19 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 01:41 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-03-08 15:40 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 13:49 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-08 16:15 +0000
                Re: Pyhon 2.x or 3.x, which is faster? wxjmfauth@gmail.com - 2016-03-08 09:23 -0800
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-08 19:02 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-09 11:04 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-09 01:28 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-09 13:18 +1100
                Re: Pyhon 2.x or 3.x, which is faster? wxjmfauth@gmail.com - 2016-03-09 02:11 -0800
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-09 14:03 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-10 01:11 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-09 14:39 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-10 01:54 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-10 02:33 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-10 02:58 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2016-03-09 14:56 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-10 02:28 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-10 01:57 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Chris Angelico <rosuav@gmail.com> - 2016-03-10 02:04 +1100
                Re: Pyhon 2.x or 3.x, which is faster? BartC <bc@freeuk.com> - 2016-03-09 16:53 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-10 01:54 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2016-03-09 15:06 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Tim Golden <mail@timgolden.me.uk> - 2016-03-09 15:15 +0000
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-10 02:38 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Terry Reedy <tjreedy@udel.edu> - 2016-03-09 10:42 -0500
                Re: Pyhon 2.x or 3.x, which is faster? wxjmfauth@gmail.com - 2016-03-09 09:04 -0800
                Re: Pyhon 2.x or 3.x, which is faster? Marko Rauhamaa <marko@pacujo.net> - 2016-03-09 08:08 +0200
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-09 22:52 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Marko Rauhamaa <marko@pacujo.net> - 2016-03-09 14:53 +0200
                Re: Pyhon 2.x or 3.x, which is faster? Steven D'Aprano <steve@pearwood.info> - 2016-03-10 03:53 +1100
                Re: Pyhon 2.x or 3.x, which is faster? Michael Torrie <torriem@gmail.com> - 2016-03-08 17:42 -0700
        Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-08 02:53 +0000
    Re: Pyhon 2.x or 3.x, which is faster? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-07 19:02 +0000

csiph-web