Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.005 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'encoding': 0.05; 'parser': 0.07; 'ascii': 0.09; 'assuming': 0.09; 'bytes.': 0.09; 'http': 0.09; 'indicates': 0.09; 'parsing': 0.09; 'seen,': 0.09; 'sure,': 0.09; 'will,': 0.09; 'assume': 0.14; '8-bit': 0.16; 'content- type': 0.16; 'encodings': 0.16; 'encodings,': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'personally,': 0.16; 'slight': 0.16; 'tag,': 0.16; 'subject: ?': 0.16; 'wrote:': 0.18; 'subject:page': 0.19; 'thu,': 0.19; 'otherwise,': 0.22; 'affects': 0.24; 'bytes': 0.24; 'header': 0.24; "haven't": 0.24; 'header:In-Reply-To:1': 0.27; 'chris': 0.29; 'errors': 0.30; 'message-id:@mail.gmail.com': 0.30; 'this.': 0.32; 'guess': 0.33; 'subject:the': 0.34; "i'd": 0.34; 'problem': 0.35; 'received:209.85': 0.35; 'possible.': 0.35; 'received:209.85.220': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'received:209': 0.37; 'handle': 0.38; 'to:addr:python-list': 0.38; 'pm,': 0.38; 'track': 0.38; 'to:addr:python.org': 0.39; 'tag': 0.61; 'new': 0.61; "you're": 0.61; 'back': 0.62; 'real': 0.63; 'such': 0.63; 'due': 0.66; 'covers': 0.68; 'nobody': 0.68; 'band.': 0.84; 'characters,': 0.84; 'impact.': 0.84; "it'd": 0.84; 'technically': 0.84; 'top.': 0.84; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=BATSJzrnWkq8Qt3GJAEnboufedk3bWSmbNikB1Cu3i8=; b=XwWFNnjkB1bPrs+4bnX//udLG/JDljlPU3lz0ASbYaWwVZUwg/iKoigSbyzikBtGJ+ ELmFoCcTHNVxmzpmjoEXEuBuYrXygR0CMaVLxoLq7QRxE3ZVnc3/bGvEhF7FboK6N8i5 6zBatPcIyFasSR/vc8oo/nudbknmOW9MuyKZbmwgJxslYGLvtMfMGNDGhlkGSHiABFIy Ei/P96W2x3TbBdS/ZzIsE4DjBcEQh5Bom08Rsy1vEK/AwtJDjx7OKOxSrPqOBOSkmRCw y8mkxieTzx1qF2h/H921euBNYyycmSMyqEYWmiE6Wb+p6M7+5V3ZtnwpcY8NHVYutm46 XMqA== MIME-Version: 1.0 X-Received: by 10.52.65.238 with SMTP id a14mr18246448vdt.24.1370502846460; Thu, 06 Jun 2013 00:14:06 -0700 (PDT) In-Reply-To: References: <29a6b839-1e3d-42ba-acf3-a58a5fcb9f5c@googlegroups.com> Date: Thu, 6 Jun 2013 17:14:06 +1000 Subject: Re: how to detect the character encoding in a web page ? From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 31 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1370502849 news.xs4all.nl 15962 [2001:888:2000:d::a6]:34586 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:47196 On Thu, Jun 6, 2013 at 4:22 PM, Nobody wrote: > On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote: > >> The HTTP header is completely out of band. This is the best way to >> transmit encoding information. Otherwise, you assume 7-bit ASCII and start >> parsing. Once you find a meta tag, you stop parsing and go back to the >> top, decoding in the new way. > > Provided that the meta tag indicates an ASCII-compatible encoding, and you > haven't encountered any decode errors due to 8-bit characters, then > there's no need to go back to the top. Technically and conceptually, you go back to the start and re-parse. Sure, you might optimize that if you can, but not every parser will, hence it's advisable to put the content-type as early as possible. >> "ASCII-compatible" covers a huge number of >> encodings, so it's not actually much of a problem to do this. > > With slight modifications, you can also handle some > almost-ASCII-compatible encodings such as shift-JIS. > > Personally, I'd start by assuming ISO-8859-1, keep track of which bytes > have actually been seen, and only re-start parsing from the top if the > encoding change actually affects the interpretation of any of those bytes. Hrm, it'd be equally valid to guess UTF-8. But as long as you're prepared to re-parse after finding the content-type, that's just a choice of optimization and has no real impact. ChrisA