Groups > comp.lang.python > #35421 > unrolled thread

how to detect the character encoding in a web page ?

Started by	iMath <redstone-cold@163.com>
First post	2012-12-23 16:34 -0800
Last post	2013-01-07 01:23 -0800
Articles	14 — 10 participants

Back to article view | Back to comp.lang.python

  how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 16:34 -0800
    Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2012-12-24 12:23 +1100
    Re: how to detect the character encoding  in a web page ? Hans Mulder <hansmu@xs4all.nl> - 2012-12-24 02:30 +0100
    Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 18:57 -0800
    Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 19:03 -0800
    Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 19:03 -0800
      Re: how to detect the character encoding  in a web page ? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2012-12-24 09:34 +0100
      Re: how to detect the character encoding in a web page ? Kwpolska <kwpolska@gmail.com> - 2012-12-24 13:16 +0100
        Re: how to detect the character encoding in a web page ? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-24 13:50 +0000
          Re: how to detect the character encoding in a web page ? Alister <alister.ware@ntlworld.com> - 2012-12-24 16:27 +0000
            Re: how to detect the character encoding in a web page ? Roy Smith <roy@panix.com> - 2012-12-24 11:46 -0500
              Re: how to detect the character encoding in a web page ? albert@spenarnc.xs4all.nl (Albert van der Horst) - 2013-01-14 12:50 +0000
    Re: how to detect the character encoding  in a web page ? python培训 <51mmj.com@gmail.com> - 2012-12-28 06:30 -0800
    Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2013-01-07 01:23 -0800

#35421 — how to detect the character encoding in a web page ?

From	iMath <redstone-cold@163.com>
Date	2012-12-23 16:34 -0800
Subject	how to detect the character encoding in a web page ?
Message-ID	<c15bad9a-a7f7-456e-8dc5-b1af67fbdd44@googlegroups.com>

how to detect the character encoding  in a web page ?
such as this page 

http://python.org/

[toc] | [next] | [standalone]

#35424 — Re: how to detect the character encoding in a web page ?

From	Chris Angelico <rosuav@gmail.com>
Date	2012-12-24 12:23 +1100
Subject	Re: how to detect the character encoding in a web page ?
Message-ID	<mailman.1231.1356312219.29569.python-list@python.org>
In reply to	#35421

On Mon, Dec 24, 2012 at 11:34 AM, iMath <redstone-cold@163.com> wrote:
> how to detect the character encoding  in a web page ?
> such as this page
>
> http://python.org/

You read part-way into the page, where you find this:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

That tells you that the character set is UTF-8.

ChrisA

[toc] | [prev] | [next] | [standalone]

#35426

From	Hans Mulder <hansmu@xs4all.nl>
Date	2012-12-24 02:30 +0100
Message-ID	<50d7b047$0$6963$e4fe514c@news2.news.xs4all.nl>
In reply to	#35421

On 24/12/12 01:34:47, iMath wrote:
> how to detect the character encoding  in a web page ?

That depends on the site: different sites indicate
their encoding differently.

> such as this page:  http://python.org/

If you download that page and look at the HTML code, you'll find a line:

  <meta http-equiv="content-type" content="text/html; charset=utf-8" />

So it's encoded as utf-8.

Other sites declare their charset in the Content-Type HTTP header line.
And then there are sites relying on the default.  And sites that get
it wrong, and send data in a different encoding from what they declare.

Welcome to the real world,

-- HansM

[toc] | [prev] | [next] | [standalone]

#35432

From	iMath <redstone-cold@163.com>
Date	2012-12-23 18:57 -0800
Message-ID	<212044ed-396f-4b2d-acec-8832e31723ad@googlegroups.com>
In reply to	#35421

在 2012年12月24日星期一UTC+8上午8时34分47秒，iMath写道：
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

but how to let python do it for you ?

such as this page 

http://python.org/ 

how to  detect the character encoding in this web page by python ?

[toc] | [prev] | [next] | [standalone]

#35433

From	iMath <redstone-cold@163.com>
Date	2012-12-23 19:03 -0800
Message-ID	<10a96dbc-40e2-43ee-acb9-88ebafec7bd5@googlegroups.com>
In reply to	#35421

在 2012年12月24日星期一UTC+8上午8时34分47秒，iMath写道：
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

but how to let python do it for you ? 

such as these 2 pages 

http://python.org/ 
http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx

how to  detect the character encoding in these 2 pages  by python ?

[toc] | [prev] | [next] | [standalone]

#35434

From	iMath <redstone-cold@163.com>
Date	2012-12-23 19:03 -0800
Message-ID	<2324928c-32de-4f9d-8ff1-5db6dcf5543a@googlegroups.com>
In reply to	#35421

在 2012年12月24日星期一UTC+8上午8时34分47秒，iMath写道：
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

but how to let python do it for you ? 

such as these 2 pages 

http://python.org/ 
http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx

how to  detect the character encoding in these 2 pages  by python ?

[toc] | [prev] | [next] | [standalone]

#35447

From	Kurt Mueller <kurt.alfred.mueller@gmail.com>
Date	2012-12-24 09:34 +0100
Message-ID	<mailman.1245.1356338098.29569.python-list@python.org>
In reply to	#35434

Am 24.12.2012 um 04:03 schrieb iMath:
> but how to let python do it for you ? 
> such as these 2 pages 
> http://python.org/ 
> http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx
> how to  detect the character encoding in these 2 pages  by python ?


If you have the html code, let 
chardetect.py 
do an educated guess for you.

http://pypi.python.org/pypi/chardet

Example:
$ wget -q -O - http://python.org/ | chardetect.py 
stdin: ISO-8859-2 with confidence 0.803579722043
$ 

$ wget -q -O - 'http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx' | chardetect.py 
stdin: utf-8 with confidence 0.87625
$ 


Grüessli
-- 
kurt.alfred.mueller@gmail.com

[toc] | [prev] | [next] | [standalone]

#35455 — Re: how to detect the character encoding in a web page ?

From	Kwpolska <kwpolska@gmail.com>
Date	2012-12-24 13:16 +0100
Subject	Re: how to detect the character encoding in a web page ?
Message-ID	<mailman.1253.1356351379.29569.python-list@python.org>
In reply to	#35434

On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
<kurt.alfred.mueller@gmail.com> wrote:
> $ wget -q -O - http://python.org/ | chardetect.py
> stdin: ISO-8859-2 with confidence 0.803579722043
> $

And it sucks, because it uses magic, and not reading the HTML tags.
The RIGHT thing to do for websites is detect the meta charset
definition, which is

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

or

    <meta charset="utf-8">

The second one for HTML5 websites, and both may require case
conversion and the useless ` /` at the end.  But if somebody is using
HTML5, you are pretty much guaranteed to get UTF-8.

In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
Because nobody in the right mind would use something else today.

-- 
Kwpolska <http://kwpolska.tk>
stop html mail      | always bottom-post
www.asciiribbon.org | www.netmeister.org/news/learn2quote.html
GPG KEY: 5EAAEA16

[toc] | [prev] | [next] | [standalone]

#35457 — Re: how to detect the character encoding in a web page ?

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-12-24 13:50 +0000
Subject	Re: how to detect the character encoding in a web page ?
Message-ID	<50d85daf$0$29967$c3e8da3$5496439d@news.astraweb.com>
In reply to	#35455

On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:

> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
> <kurt.alfred.mueller@gmail.com> wrote:
>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>> with confidence 0.803579722043 $
> 
> And it sucks, because it uses magic, and not reading the HTML tags. The
> RIGHT thing to do for websites is detect the meta charset definition,
> which is
> 
>     <meta http-equiv="content-type" content="text/html; charset=utf-8">
> 
> or
> 
>     <meta charset="utf-8">
> 
> The second one for HTML5 websites, and both may require case conversion
> and the useless ` /` at the end.  But if somebody is using HTML5, you
> are pretty much guaranteed to get UTF-8.
> 
> In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
> Because nobody in the right mind would use something else today.

Alas, there are many, many, many, MANY websites that are created by 
people who are *not* in their right mind. To say nothing of 15 year old 
websites that use a legacy encoding. And to support those, you may need 
to guess the encoding, and for that, chardetect.py is the solution.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#35466 — Re: how to detect the character encoding in a web page ?

From	Alister <alister.ware@ntlworld.com>
Date	2012-12-24 16:27 +0000
Subject	Re: how to detect the character encoding in a web page ?
Message-ID	<rn%Bs.693798$nB6.605938@fx21.am4>
In reply to	#35457

On Mon, 24 Dec 2012 13:50:39 +0000, Steven D'Aprano wrote:

> On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:
> 
>> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
>> <kurt.alfred.mueller@gmail.com> wrote:
>>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>>> with confidence 0.803579722043 $
>> 
>> And it sucks, because it uses magic, and not reading the HTML tags. The
>> RIGHT thing to do for websites is detect the meta charset definition,
>> which is
>> 
>>     <meta http-equiv="content-type" content="text/html; charset=utf-8">
>> 
>> or
>> 
>>     <meta charset="utf-8">
>> 
>> The second one for HTML5 websites, and both may require case conversion
>> and the useless ` /` at the end.  But if somebody is using HTML5, you
>> are pretty much guaranteed to get UTF-8.
>> 
>> In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
>> Because nobody in the right mind would use something else today.
> 
> Alas, there are many, many, many, MANY websites that are created by
> people who are *not* in their right mind. To say nothing of 15 year old
> websites that use a legacy encoding. And to support those, you may need
> to guess the encoding, and for that, chardetect.py is the solution.

Indeed due to the poor quality of most websites it is not possible to be 
100% accurate for all sites.

personally I would start by checking the doc type & then the meta data as 
these should be quick & correct, I then use chardectect only if these 
fail to provide any result.


-- 
I have found little that is good about human beings.  In my experience
most of them are trash.
		-- Sigmund Freud

[toc] | [prev] | [next] | [standalone]

#35470 — Re: how to detect the character encoding in a web page ?

From	Roy Smith <roy@panix.com>
Date	2012-12-24 11:46 -0500
Subject	Re: how to detect the character encoding in a web page ?
Message-ID	<roy-DF05DA.11460324122012@news.panix.com>
In reply to	#35466

In article <rn%Bs.693798$nB6.605938@fx21.am4>,
 Alister <alister.ware@ntlworld.com> wrote:

> Indeed due to the poor quality of most websites it is not possible to be 
> 100% accurate for all sites.
> 
> personally I would start by checking the doc type & then the meta data as 
> these should be quick & correct, I then use chardectect only if these 
> fail to provide any result.

I agree that checking the metadata is the right thing to do.  But, I 
wouldn't go so far as to assume it will always be correct.  There's a 
lot of crap out there with perfectly formed metadata which just happens 
to be wrong.

Although it pains me greatly to quote Ronald Reagan as a source of 
wisdom, I have to admit he got it right with "Trust, but verify".  It's 
the only way to survive in the unicode world.  Write defensive code.  
Wrap try blocks around calls that might raise exceptions if the external 
data is borked w/r/t what the metadata claims it should be.

[toc] | [prev] | [next] | [standalone]

#36782 — Re: how to detect the character encoding in a web page ?

From	albert@spenarnc.xs4all.nl (Albert van der Horst)
Date	2013-01-14 12:50 +0000
Subject	Re: how to detect the character encoding in a web page ?
Message-ID	<50f3ff0f$0$6347$e4fe514c@dreader35.news.xs4all.nl>
In reply to	#35470

In article <roy-DF05DA.11460324122012@news.panix.com>,
Roy Smith  <roy@panix.com> wrote:
>In article <rn%Bs.693798$nB6.605938@fx21.am4>,
> Alister <alister.ware@ntlworld.com> wrote:
>
>> Indeed due to the poor quality of most websites it is not possible to be
>> 100% accurate for all sites.
>>
>> personally I would start by checking the doc type & then the meta data as
>> these should be quick & correct, I then use chardectect only if these
>> fail to provide any result.
>
>I agree that checking the metadata is the right thing to do.  But, I
>wouldn't go so far as to assume it will always be correct.  There's a
>lot of crap out there with perfectly formed metadata which just happens
>to be wrong.
>
>Although it pains me greatly to quote Ronald Reagan as a source of
>wisdom, I have to admit he got it right with "Trust, but verify".  It's

Not surprisingly, as an actor, Reagan was as good as his script.
This one he got from Stalin.

>the only way to survive in the unicode world.  Write defensive code.
>Wrap try blocks around calls that might raise exceptions if the external
>data is borked w/r/t what the metadata claims it should be.

The way to go, of course.

Groetjes Albert
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

[toc] | [prev] | [next] | [standalone]

#35694

From	python培训 <51mmj.com@gmail.com>
Date	2012-12-28 06:30 -0800
Message-ID	<5af3055a-c460-477e-90ef-72bbc0612a25@googlegroups.com>
In reply to	#35421

在 2012年12月24日星期一UTC+8上午8时34分47秒，iMath写道：
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

first setup  chardet 


import chardet
#抓取网页html
html_1 = urllib2.urlopen(line,timeout=120).read()
#print html_1
mychar=chardet.detect(html_1)
#print mychar
bianma=mychar['encoding']
if bianma == 'utf-8' or bianma == 'UTF-8':
    #html=html.decode('utf-8','ignore').encode('utf-8')
   html=html_1
else :
    html =html_1.decode('gb2312','ignore').encode('utf-8')

[toc] | [prev] | [next] | [standalone]

#36331

From	iMath <redstone-cold@163.com>
Date	2013-01-07 01:23 -0800
Message-ID	<570913d9-d0f6-4ad4-9400-194d008a8384@googlegroups.com>
In reply to	#35421

在 2012年12月24日星期一UTC+8上午8时34分47秒，iMath写道：
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

up to now , maybe chadet is the only way to let python automatically do it .

[toc] | [prev] | [standalone]

csiph-web

how to detect the character encoding in a web page ?

Contents

#35421 — how to detect the character encoding in a web page ?

#35424 — Re: how to detect the character encoding in a web page ?

#35426

#35432

#35433

#35434

#35447

#35455 — Re: how to detect the character encoding in a web page ?

#35457 — Re: how to detect the character encoding in a web page ?

#35466 — Re: how to detect the character encoding in a web page ?

#35470 — Re: how to detect the character encoding in a web page ?

#36782 — Re: how to detect the character encoding in a web page ?

#35694

#36331