Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.018 X-Spam-Evidence: '*H*': 0.96; '*S*': 0.00; 'encoding': 0.05; 'detect': 0.07; 'ascii': 0.09; 'http': 0.09; 'parsing': 0.09; 'assume': 0.14; '(massively': 0.16; 'encodings,': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'tag,': 0.16; 'subject: ?': 0.16; 'wrote:': 0.18; 'subject:page': 0.19; 'thu,': 0.19; 'otherwise,': 0.22; 'rules': 0.22; 'header': 0.24; 'header:In- Reply-To:1': 0.27; 'am,': 0.29; 'character': 0.29; 'message- id:@mail.gmail.com': 0.30; 'this.': 0.32; 'url:python': 0.33; 'subject:the': 0.34; 'problem': 0.35; 'received:209.85': 0.35; 'received:209.85.220': 0.35; 'received:google.com': 0.35; 'url:org': 0.36; 'received:209': 0.37; 'ahead': 0.38; 'to:addr :python-list': 0.38; 'to:addr:python.org': 0.39; 'how': 0.40; 'tag': 0.61; 'new': 0.61; 'back': 0.62; 'such': 0.63; 'covers': 0.68; '8bit%:66': 0.84; 'band.': 0.84; 'mate': 0.84; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=MPqKE3dUl4uZQpMQoU71+31Q1utK5UQdnGMqN02q2xI=; b=geOviNzYnCyIICGJ7firSnzDzLhV0fAwvi7TvAx/Z4W6kaToe7YQvW5DndpeQbinyq ihQnjLbwp9N9ns67fK6HuRmLVGB4FbgO1ERPJJmnTUS7s9JzS4x2d6N7Snt6ZB2A4aYj QXMXagGG7FRuAYddUf2vUDtXZH2XlR1OfnuCSMmXB2xKifzrghWErSmFf5nbSTtPj8Nx bLwWM73+yXuu+yofaRaX0HA6CfpwFewiPKS/nHlgrexwB+7ZWQRAd6BUzljRwi2/pcMA Bq1l98B3ZHQPtZ+WcsYjjMXz6Q4bAFYGVI1cpUZ73I+TJVE1LPx2yLQByyeSlUudxugK /whw== MIME-Version: 1.0 X-Received: by 10.220.109.66 with SMTP id i2mr20770708vcp.51.1370454911775; Wed, 05 Jun 2013 10:55:11 -0700 (PDT) In-Reply-To: <29a6b839-1e3d-42ba-acf3-a58a5fcb9f5c@googlegroups.com> References: <29a6b839-1e3d-42ba-acf3-a58a5fcb9f5c@googlegroups.com> Date: Thu, 6 Jun 2013 03:55:11 +1000 Subject: Re: how to detect the character encoding in a web page ? From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 28 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1370454914 news.xs4all.nl 15916 [2001:888:2000:d::a6]:37515 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:47124 On Thu, Jun 6, 2013 at 1:14 AM, iMath wrote: > =E5=9C=A8 2012=E5=B9=B412=E6=9C=8824=E6=97=A5=E6=98=9F=E6=9C=9F=E4=B8=80U= TC+8=E4=B8=8A=E5=8D=888=E6=97=B634=E5=88=8647=E7=A7=92=EF=BC=8CiMath=E5=86= =99=E9=81=93=EF=BC=9A >> how to detect the character encoding in a web page ? >> >> such as this page >> >> >> >> http://python.org/ > > by the way ,we cannot get character encoding programmatically from the m= ate data without knowing the character encoding ahead ! The rules for web pages are (massively oversimplified): 1) HTTP header 2) ASCII-compatible encoding and meta tag The HTTP header is completely out of band. This is the best way to transmit encoding information. Otherwise, you assume 7-bit ASCII and start parsing. Once you find a meta tag, you stop parsing and go back to the top, decoding in the new way. "ASCII-compatible" covers a huge number of encodings, so it's not actually much of a problem to do this. ChrisA