Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Chris Angelico Newsgroups: comp.lang.python Subject: Re: Pyhon 2.x or 3.x, which is faster? Date: Thu, 10 Mar 2016 02:58:34 +1100 Lines: 58 Message-ID: References: <87d1r6iltx.fsf@elektro.pacujo.net> <56de28a1$0$1604$c3e8da3$5496439d@news.astraweb.com> <56de57b5$0$1590$c3e8da3$5496439d@news.astraweb.com> <56df6873$0$1588$c3e8da3$5496439d@news.astraweb.com> <56df87f7$0$1620$c3e8da3$5496439d@news.astraweb.com> <56e0424b$0$1603$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: news.uni-berlin.de uganjKm/fewhxv1jM2ysxwLtZzc0MPSYziLjLhhhVkVQ== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'url:pypi': 0.03; 'patterns': 0.04; 'tries': 0.05; '(b)': 0.07; 'exist,': 0.07; 'pretend': 0.07; 'cc:addr:python-list': 0.09; '"if': 0.09; 'args,': 0.09; 'bytes,': 0.09; 'decodes': 0.09; 'mess': 0.09; 'non-ascii': 0.09; 'processing,': 0.09; 'subject:which': 0.09; 'url:github': 0.09; ':-)': 0.12; 'encoding': 0.15; 'file,': 0.15; 'thu,': 0.15; '*and': 0.16; '2016': 0.16; '[1].': 0.16; 'accordingly,': 0.16; 'contributors': 0.16; 'decode': 0.16; 'distinct': 0.16; 'encodings': 0.16; 'encodings,': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'iirc': 0.16; 'pipes': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'url:py': 0.16; 'wrote:': 0.16; 'basically': 0.18; 'byte': 0.18; 'language': 0.19; 'windows': 0.20; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; '(a)': 0.22; 'ascii': 0.22; 'text,': 0.22; 'trying': 0.22; 'am,': 0.23; 'this:': 0.23; 'second': 0.24; 'unix': 0.24; 'header:In-Reply- To:1': 0.24; 'script': 0.25; 'chris': 0.26; 'figure': 0.27; 'least': 0.27; 'message-id:@mail.gmail.com': 0.27; 'opposed': 0.27; '(it': 0.29; 'dumps': 0.29; 'occasional': 0.29; 'character': 0.29; 'random': 0.29; "i'm": 0.30; "i'd": 0.31; 'guess': 0.31; 'probably': 0.31; 'another': 0.32; "can't": 0.32; 'says': 0.32; '[1]': 0.32; 'source': 0.33; 'url:python': 0.33; 'usually': 0.33; "d'aprano": 0.33; 'errors,': 0.33; 'european': 0.33; 'steven': 0.33; "i'll": 0.33; 'languages': 0.34; 'file': 0.34; 'add': 0.34; 'received:google.com': 0.35; 'ones': 0.35; 'text': 0.35; 'attempt': 0.35; 'level': 0.35; "isn't": 0.35; 'but': 0.36; 'too': 0.36; 'url:org': 0.36; 'tool': 0.36; 'received:209.85': 0.36; 'beginning': 0.36; 'subject:?': 0.36; 'subject:: ': 0.37; 'really': 0.37; 'two': 0.37; 'say': 0.37; 'received:209.85.213': 0.37; 'detail': 0.38; 'itself': 0.38; 'manual': 0.38; 'received:209': 0.38; 'stuff': 0.38; 'files': 0.38; 'does': 0.39; "didn't": 0.39; 'some': 0.40; 'determine': 0.61; 'skip:n 10': 0.62; 'is.': 0.63; 'complete': 0.63; 'effective': 0.63; 'between': 0.65; 'mar': 0.65; 'results': 0.66; 'letters': 0.67; 'internet': 0.70; 'levels': 0.70; 'analysis': 0.72; 'hand': 0.82; 'alright,': 0.84; 'chrisa': 0.84; 'hard.': 0.84; 'start.': 0.84; 'surprisingly': 0.84; 'url:master': 0.84; 'to:none': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc; bh=mJ8p1lquxFWGUgfFp2gkjG83zmk13T+8M+C5u+D9Nxs=; b=et7f1HmEhp2FZaYGzJYhKJAoCPJTeJ1gIz9J2fdxR9iWA6xJxtVuE4LH2WCs8Zy5Cm f3dAFlN0jNrqCOJTMCjO2s8tEfHE8j/p6WOl+8UKCNtKPANqptsHV3tAbktgURh4+ejm uvmMNH+aYda77bTfafDCWNQilotzvYhWEKTgLzp7k0aXN3IE4lfQ8r7U+OtkVQLY+ZYb S78oDFPXQGVRAB9ESVA4kcQquLrBTG3DlrPEjmvKtqSsYKxRzT/S7MHWK1lhkOCPjWlC C8TqJaZOrtw0Fy78aYDuUcWc6ICV33ppLIENl/hlDZIzqiiX4P7OrMeUjhNJ78X/bF/b 7i6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc; bh=mJ8p1lquxFWGUgfFp2gkjG83zmk13T+8M+C5u+D9Nxs=; b=mc0rW1d/XVGyViUygUcY8ujteD3vL3fzS5aikrLxDrJjEquD8cOphZ141iA7bwKvyg 9lJ2BEwswB/v05wzRqrLbXXWslZ4l9yR2rSq+YAt+LYXaglrDMHj7LA1+mXR4zl1xSMV EQQOe2VueJBPolN0MPxPGXhBQvga6mfT2TQwtSya5LyeDTRms4H+kFbl/mZmyXS014HT GateWZ8Rp+/7Tca8us9R2TmaEUhg62MpsHAb0KkZlS430Q8s6zFRNzWOP20CHyW0FvrC LmcLjFUhnqW1IBY5mIhu2duaTsAIS25GnYL97RloHMkn7EDtdzsBrpWSxiTygoBh27+s TPJw== X-Gm-Message-State: AD7BkJJ1CR8jhvHDmrfSJYIR27M9zyOZz72lYR31SZ0kIdnTMrJhTrdQ3yR+Dw4jRqdKYwpwtRfri5qnm4K7mw== X-Received: by 10.51.17.34 with SMTP id gb2mr25424722igd.13.1457539114730; Wed, 09 Mar 2016 07:58:34 -0800 (PST) In-Reply-To: <56e0424b$0$1603$c3e8da3$5496439d@news.astraweb.com> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:104427 On Thu, Mar 10, 2016 at 2:33 AM, Steven D'Aprano wrote: > On Thu, 10 Mar 2016 01:54 am, Chris Angelico wrote: > >> I have a source of occasional text files that basically just dumps >> stuff on me without any metadata, and I have to figure out (a) what >> the encoding is, and (b) what language the text is in. > > https://pypi.python.org/pypi/chardet > >> then I have two levels of heuristics to try to guess a >> most-likely encoding > > I'm curious, what do you do? Collect subtitles files from random internet contributors and determine whether they add to the existing corpus of material. The first heuristic level is chardet, as mentioned; but with the specific files that I'm processing, it has some semi-consistent errors, so I scripted around that - eg "if chardet says ISO-8859-2, and these byte patterns exist, it's probably actually codepage 1250". IIRC the second level is entirely translating from an ISO-8859 to the nearest-equivalent Windows codepage. > (I stress that trying to guess the character set or encoding from the text > itself is a second-last ditch tactic, for when you really don't know and > can't find out what the encoding is. The final, last-ditch tactic is to > just say "bugger it, I'll pretend it's Latin-1" and get a mess of > moji-bake, but at least an ASCII characters will decode alright, and as an > English speaker, that's all that's important to me :-) What I do is attempt to guess, *and then hand it to the user*. I have a little "cdless" script that does a chardet on a file, decodes accordingly, and pipes the result into 'less' [1]. The most powerful character encoding detection tool in my arsenal is 'less'. Pretending that text is Latin-1 is actually a pretty good start. If I didn't have chardet, I'd be mainly using this: https://github.com/Rosuav/shed/blob/master/charconv.py With no args, this will take the beginning of the file (it tries to get one paragraph of up to 1KB) and decode it using all the ISO-8859-* encodings, displaying the results for human analysis. That's surprisingly effective for a manual job. A large number of European languages use a lot of ASCII letters and then each have their own distinct non-ASCII characters in between; the only truly confusable encodings are the ones that are entirely non-ASCII (Cyrillic, Arabic, Greek, Hebrew - ISO-8859-5 through 8), and mis-decoding one as another usually results in complete nonsense (words with impossible vowel/consonant combinations, for instance). It does take *linguistic* analysis (as opposed to purely mathematical/charcode), but it isn't too hard. ChrisA [1] ... and since Unix pipes carry bytes, not text, this involves encoding it as UTF-8. But that's an implementation detail between cdless and less.