Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.025 X-Spam-Evidence: '*H*': 0.95; '*S*': 0.00; 'example:': 0.03; 'url:pypi': 0.03; 'url:msdn': 0.07; 'utf-8': 0.07; 'python': 0.09; 'encoding': 0.15; 'detect': 0.17; 'subject:page': 0.17; 'code,': 0.18; 'header:In-Reply-To:1': 0.25; 'guess': 0.27; 'character': 0.29; 'subject: ?': 0.30; 'url:python': 0.32; 'to:addr:python- list': 0.33; 'received:google.com': 0.34; 'but': 0.36; 'message- id:@gmail.com': 0.36; 'received:74.125': 0.36; 'url:org': 0.36; 'url:library': 0.36; 'subject:: ': 0.38; 'to:addr:python.org': 0.39; 'received:192': 0.39; 'url:office': 0.39; 'url:microsoft': 0.39; 'received:192.168': 0.40; 'url:aspx': 0.60; 'you.': 0.61; 'header:Message-Id:1': 0.62; 'email addr:gmail.com': 0.63; 'url :en-us': 0.65; 'subject: ': 0.66; 'confidence': 0.95 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; bh=BsTcCKm18LqTgwlkdD9W/AQ4nfYnUXQHgK7T1Bfmtfk=; b=tocc4WFSlhX8k/Eyz2MfSo7BvPIi/gbiqhseSDXNfUpgivB65p1Nn9TZ0+FB6nilai i8EyCenDUSberZROPT7GY4u+Uo7JK9FZ4YZzu48s0hZT2JxFNfmSq3uAbZSzC3b8W6Nf GgN5zoVOWaDRCrP+Kv7szpy3Uqm1UeFi+AkLbbfcgZpr50he7aXIc6rA+QrwIm/4aXqj gPksY4+rIyMsvlQYBW2FR1DxZYzv5N2y4KHUk/ZifbDonfobgWQ5WDX6eHyTDybfANth h0SS+oa8SK54/D8JRcL9P+ebTUZAvw646WdaMMS7uD1GTY+dFB6nWzqR1AXDnC5dZs49 b54A== X-Received: by 10.14.225.4 with SMTP id y4mr53448235eep.6.1356338089225; Mon, 24 Dec 2012 00:34:49 -0800 (PST) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1085) Subject: Re: how to detect the character encoding in a web page ? From: Kurt Mueller In-Reply-To: <2324928c-32de-4f9d-8ff1-5db6dcf5543a@googlegroups.com> Date: Mon, 24 Dec 2012 09:34:16 +0100 Content-Transfer-Encoding: quoted-printable References: <2324928c-32de-4f9d-8ff1-5db6dcf5543a@googlegroups.com> To: python-list@python.org X-Mailer: Apple Mail (2.1085) X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 30 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1356338098 news.xs4all.nl 6890 [2001:888:2000:d::a6]:59920 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:35447 Am 24.12.2012 um 04:03 schrieb iMath: > but how to let python do it for you ?=20 > such as these 2 pages=20 > http://python.org/=20 > http://msdn.microsoft.com/en-us/library/bb802962(v=3Doffice.12).aspx > how to detect the character encoding in these 2 pages by python ? If you have the html code, let=20 chardetect.py=20 do an educated guess for you. http://pypi.python.org/pypi/chardet Example: $ wget -q -O - http://python.org/ | chardetect.py=20 stdin: ISO-8859-2 with confidence 0.803579722043 $=20 $ wget -q -O - = 'http://msdn.microsoft.com/en-us/library/bb802962(v=3Doffice.12).aspx' | = chardetect.py=20 stdin: utf-8 with confidence 0.87625 $=20 Gr=FCessli --=20 kurt.alfred.mueller@gmail.com