Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!ecngs!feeder2.ecngs.de!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.038 X-Spam-Evidence: '*H*': 0.92; '*S*': 0.00; 'charset': 0.09; 'definition,': 0.09; 'useless': 0.09; 'kurt': 0.11; 'dec': 0.15; '24,': 0.16; 'gpg': 0.16; 'tags.': 0.16; 'mon,': 0.16; 'wrote:': 0.17; 'detect': 0.17; 'subject:page': 0.17; 'somebody': 0.23; 'second': 0.24; 'header:In-Reply-To:1': 0.25; 'am,': 0.27; 'websites,': 0.27; 'message-id:@mail.gmail.com': 0.27; 'subject: ?': 0.30; 'url:python': 0.32; 'to:addr:python-list': 0.33; 'received:google.com': 0.34; 'received:209.85': 0.35; 'something': 0.35; 'but': 0.36; 'url:org': 0.36; 'uses': 0.37; 'received:209': 0.37; 'subject:: ': 0.38; 'to:addr:python.org': 0.39; 'header:Received:5': 0.40; 'websites': 0.66; 'today.': 0.69; 'guaranteed': 0.76; 'html5,': 0.84; 'today\xe2\x80\x99s': 0.84; 'url:tk': 0.93; 'confidence': 0.95 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=q9K2e8vuopwOfzpJvmiNPs++l3/EaEAzeJENirjBeCY=; b=TkgapCdd96vkncAEe+NhZN6giPEhgqnctLs38oh4qITlSGP1Sk1MKGJ6ABRpPUsYxp m1vAh4zvf8J0VqwO54L74ortON+yR3vb0nXb2AG16LlLflHG8Sxr4LI7+DOYGuhP3mgV 9fKyKH7NAwD5zSsCHc+S6bVamnMa37ItNv98rfenwkLCe4UYQ6qyj7MBENq1/7xxLPWG ary+48jU5yjZOTCWe0mWGJ3FdT+4P9j4cW+NbexkMLNTe5XNQ+51rZ4G1T0xIRyBO0pw y6wYKaW3RuBVgQVQyBQ/vcsicZPWyYkNl6HkLI5YoDUT8xdV26Z/stNcRMvlr6Cv9vfZ 1HmA== MIME-Version: 1.0 In-Reply-To: <5C06B25F-066B-421E-9849-2E1B2EAFFEBE@gmail.com> References: <2324928c-32de-4f9d-8ff1-5db6dcf5543a@googlegroups.com> <5C06B25F-066B-421E-9849-2E1B2EAFFEBE@gmail.com> Date: Mon, 24 Dec 2012 13:16:16 +0100 Subject: Re: how to detect the character encoding in a web page ? From: Kwpolska To: python-list@python.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 30 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1356351379 news.xs4all.nl 6939 [2001:888:2000:d::a6]:49457 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:35455 On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller wrote: > $ wget -q -O - http://python.org/ | chardetect.py > stdin: ISO-8859-2 with confidence 0.803579722043 > $ And it sucks, because it uses magic, and not reading the HTML tags. The RIGHT thing to do for websites is detect the meta charset definition, which is or The second one for HTML5 websites, and both may require case conversion and the useless ` /` at the end. But if somebody is using HTML5, you are pretty much guaranteed to get UTF-8. In today=E2=80=99s world, the proper assumption to make is =E2=80=9CUTF-8 o= r GTFO=E2=80=9D. Because nobody in the right mind would use something else today. --=20 Kwpolska stop html mail | always bottom-post www.asciiribbon.org | www.netmeister.org/news/learn2quote.html GPG KEY: 5EAAEA16