Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #35421 > unrolled thread

how to detect the character encoding in a web page ?

Started byiMath <redstone-cold@163.com>
First post2012-12-23 16:34 -0800
Last post2013-01-07 01:23 -0800
Articles 14 — 10 participants

Back to article view | Back to comp.lang.python


Contents

  how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 16:34 -0800
    Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2012-12-24 12:23 +1100
    Re: how to detect the character encoding  in a web page ? Hans Mulder <hansmu@xs4all.nl> - 2012-12-24 02:30 +0100
    Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 18:57 -0800
    Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 19:03 -0800
    Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 19:03 -0800
      Re: how to detect the character encoding  in a web page ? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2012-12-24 09:34 +0100
      Re: how to detect the character encoding in a web page ? Kwpolska <kwpolska@gmail.com> - 2012-12-24 13:16 +0100
        Re: how to detect the character encoding in a web page ? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-24 13:50 +0000
          Re: how to detect the character encoding in a web page ? Alister <alister.ware@ntlworld.com> - 2012-12-24 16:27 +0000
            Re: how to detect the character encoding in a web page ? Roy Smith <roy@panix.com> - 2012-12-24 11:46 -0500
              Re: how to detect the character encoding in a web page ? albert@spenarnc.xs4all.nl (Albert van der Horst) - 2013-01-14 12:50 +0000
    Re: how to detect the character encoding  in a web page ? python培训 <51mmj.com@gmail.com> - 2012-12-28 06:30 -0800
    Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2013-01-07 01:23 -0800

#35421 — how to detect the character encoding in a web page ?

FromiMath <redstone-cold@163.com>
Date2012-12-23 16:34 -0800
Subjecthow to detect the character encoding in a web page ?
Message-ID<c15bad9a-a7f7-456e-8dc5-b1af67fbdd44@googlegroups.com>
how to detect the character encoding  in a web page ?
such as this page 

http://python.org/

[toc] | [next] | [standalone]


#35424 — Re: how to detect the character encoding in a web page ?

FromChris Angelico <rosuav@gmail.com>
Date2012-12-24 12:23 +1100
SubjectRe: how to detect the character encoding in a web page ?
Message-ID<mailman.1231.1356312219.29569.python-list@python.org>
In reply to#35421
On Mon, Dec 24, 2012 at 11:34 AM, iMath <redstone-cold@163.com> wrote:
> how to detect the character encoding  in a web page ?
> such as this page
>
> http://python.org/

You read part-way into the page, where you find this:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

That tells you that the character set is UTF-8.

ChrisA

[toc] | [prev] | [next] | [standalone]


#35426

FromHans Mulder <hansmu@xs4all.nl>
Date2012-12-24 02:30 +0100
Message-ID<50d7b047$0$6963$e4fe514c@news2.news.xs4all.nl>
In reply to#35421
On 24/12/12 01:34:47, iMath wrote:
> how to detect the character encoding  in a web page ?

That depends on the site: different sites indicate
their encoding differently.

> such as this page:  http://python.org/

If you download that page and look at the HTML code, you'll find a line:

  <meta http-equiv="content-type" content="text/html; charset=utf-8" />

So it's encoded as utf-8.

Other sites declare their charset in the Content-Type HTTP header line.
And then there are sites relying on the default.  And sites that get
it wrong, and send data in a different encoding from what they declare.


Welcome to the real world,

-- HansM

[toc] | [prev] | [next] | [standalone]


#35432

FromiMath <redstone-cold@163.com>
Date2012-12-23 18:57 -0800
Message-ID<212044ed-396f-4b2d-acec-8832e31723ad@googlegroups.com>
In reply to#35421
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

but how to let python do it for you ?

such as this page 

http://python.org/ 

how to  detect the character encoding in this web page by python ?

[toc] | [prev] | [next] | [standalone]


#35433

FromiMath <redstone-cold@163.com>
Date2012-12-23 19:03 -0800
Message-ID<10a96dbc-40e2-43ee-acb9-88ebafec7bd5@googlegroups.com>
In reply to#35421
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

but how to let python do it for you ? 

such as these 2 pages 

http://python.org/ 
http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx

how to  detect the character encoding in these 2 pages  by python ?

[toc] | [prev] | [next] | [standalone]


#35434

FromiMath <redstone-cold@163.com>
Date2012-12-23 19:03 -0800
Message-ID<2324928c-32de-4f9d-8ff1-5db6dcf5543a@googlegroups.com>
In reply to#35421
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

but how to let python do it for you ? 

such as these 2 pages 

http://python.org/ 
http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx

how to  detect the character encoding in these 2 pages  by python ?

[toc] | [prev] | [next] | [standalone]


#35447

FromKurt Mueller <kurt.alfred.mueller@gmail.com>
Date2012-12-24 09:34 +0100
Message-ID<mailman.1245.1356338098.29569.python-list@python.org>
In reply to#35434
Am 24.12.2012 um 04:03 schrieb iMath:
> but how to let python do it for you ? 
> such as these 2 pages 
> http://python.org/ 
> http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx
> how to  detect the character encoding in these 2 pages  by python ?


If you have the html code, let 
chardetect.py 
do an educated guess for you.

http://pypi.python.org/pypi/chardet

Example:
$ wget -q -O - http://python.org/ | chardetect.py 
stdin: ISO-8859-2 with confidence 0.803579722043
$ 

$ wget -q -O - 'http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx' | chardetect.py 
stdin: utf-8 with confidence 0.87625
$ 


Grüessli
-- 
kurt.alfred.mueller@gmail.com

[toc] | [prev] | [next] | [standalone]


#35455 — Re: how to detect the character encoding in a web page ?

FromKwpolska <kwpolska@gmail.com>
Date2012-12-24 13:16 +0100
SubjectRe: how to detect the character encoding in a web page ?
Message-ID<mailman.1253.1356351379.29569.python-list@python.org>
In reply to#35434
On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
<kurt.alfred.mueller@gmail.com> wrote:
> $ wget -q -O - http://python.org/ | chardetect.py
> stdin: ISO-8859-2 with confidence 0.803579722043
> $

And it sucks, because it uses magic, and not reading the HTML tags.
The RIGHT thing to do for websites is detect the meta charset
definition, which is

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

or

    <meta charset="utf-8">

The second one for HTML5 websites, and both may require case
conversion and the useless ` /` at the end.  But if somebody is using
HTML5, you are pretty much guaranteed to get UTF-8.

In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
Because nobody in the right mind would use something else today.

-- 
Kwpolska <http://kwpolska.tk>
stop html mail      | always bottom-post
www.asciiribbon.org | www.netmeister.org/news/learn2quote.html
GPG KEY: 5EAAEA16

[toc] | [prev] | [next] | [standalone]


#35457 — Re: how to detect the character encoding in a web page ?

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-12-24 13:50 +0000
SubjectRe: how to detect the character encoding in a web page ?
Message-ID<50d85daf$0$29967$c3e8da3$5496439d@news.astraweb.com>
In reply to#35455
On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:

> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
> <kurt.alfred.mueller@gmail.com> wrote:
>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>> with confidence 0.803579722043 $
> 
> And it sucks, because it uses magic, and not reading the HTML tags. The
> RIGHT thing to do for websites is detect the meta charset definition,
> which is
> 
>     <meta http-equiv="content-type" content="text/html; charset=utf-8">
> 
> or
> 
>     <meta charset="utf-8">
> 
> The second one for HTML5 websites, and both may require case conversion
> and the useless ` /` at the end.  But if somebody is using HTML5, you
> are pretty much guaranteed to get UTF-8.
> 
> In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
> Because nobody in the right mind would use something else today.

Alas, there are many, many, many, MANY websites that are created by 
people who are *not* in their right mind. To say nothing of 15 year old 
websites that use a legacy encoding. And to support those, you may need 
to guess the encoding, and for that, chardetect.py is the solution.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#35466 — Re: how to detect the character encoding in a web page ?

FromAlister <alister.ware@ntlworld.com>
Date2012-12-24 16:27 +0000
SubjectRe: how to detect the character encoding in a web page ?
Message-ID<rn%Bs.693798$nB6.605938@fx21.am4>
In reply to#35457
On Mon, 24 Dec 2012 13:50:39 +0000, Steven D'Aprano wrote:

> On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:
> 
>> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
>> <kurt.alfred.mueller@gmail.com> wrote:
>>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>>> with confidence 0.803579722043 $
>> 
>> And it sucks, because it uses magic, and not reading the HTML tags. The
>> RIGHT thing to do for websites is detect the meta charset definition,
>> which is
>> 
>>     <meta http-equiv="content-type" content="text/html; charset=utf-8">
>> 
>> or
>> 
>>     <meta charset="utf-8">
>> 
>> The second one for HTML5 websites, and both may require case conversion
>> and the useless ` /` at the end.  But if somebody is using HTML5, you
>> are pretty much guaranteed to get UTF-8.
>> 
>> In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
>> Because nobody in the right mind would use something else today.
> 
> Alas, there are many, many, many, MANY websites that are created by
> people who are *not* in their right mind. To say nothing of 15 year old
> websites that use a legacy encoding. And to support those, you may need
> to guess the encoding, and for that, chardetect.py is the solution.

Indeed due to the poor quality of most websites it is not possible to be 
100% accurate for all sites.

personally I would start by checking the doc type & then the meta data as 
these should be quick & correct, I then use chardectect only if these 
fail to provide any result.


-- 
I have found little that is good about human beings.  In my experience
most of them are trash.
		-- Sigmund Freud

[toc] | [prev] | [next] | [standalone]


#35470 — Re: how to detect the character encoding in a web page ?

FromRoy Smith <roy@panix.com>
Date2012-12-24 11:46 -0500
SubjectRe: how to detect the character encoding in a web page ?
Message-ID<roy-DF05DA.11460324122012@news.panix.com>
In reply to#35466
In article <rn%Bs.693798$nB6.605938@fx21.am4>,
 Alister <alister.ware@ntlworld.com> wrote:

> Indeed due to the poor quality of most websites it is not possible to be 
> 100% accurate for all sites.
> 
> personally I would start by checking the doc type & then the meta data as 
> these should be quick & correct, I then use chardectect only if these 
> fail to provide any result.

I agree that checking the metadata is the right thing to do.  But, I 
wouldn't go so far as to assume it will always be correct.  There's a 
lot of crap out there with perfectly formed metadata which just happens 
to be wrong.

Although it pains me greatly to quote Ronald Reagan as a source of 
wisdom, I have to admit he got it right with "Trust, but verify".  It's 
the only way to survive in the unicode world.  Write defensive code.  
Wrap try blocks around calls that might raise exceptions if the external 
data is borked w/r/t what the metadata claims it should be.

[toc] | [prev] | [next] | [standalone]


#36782 — Re: how to detect the character encoding in a web page ?

Fromalbert@spenarnc.xs4all.nl (Albert van der Horst)
Date2013-01-14 12:50 +0000
SubjectRe: how to detect the character encoding in a web page ?
Message-ID<50f3ff0f$0$6347$e4fe514c@dreader35.news.xs4all.nl>
In reply to#35470
In article <roy-DF05DA.11460324122012@news.panix.com>,
Roy Smith  <roy@panix.com> wrote:
>In article <rn%Bs.693798$nB6.605938@fx21.am4>,
> Alister <alister.ware@ntlworld.com> wrote:
>
>> Indeed due to the poor quality of most websites it is not possible to be
>> 100% accurate for all sites.
>>
>> personally I would start by checking the doc type & then the meta data as
>> these should be quick & correct, I then use chardectect only if these
>> fail to provide any result.
>
>I agree that checking the metadata is the right thing to do.  But, I
>wouldn't go so far as to assume it will always be correct.  There's a
>lot of crap out there with perfectly formed metadata which just happens
>to be wrong.
>
>Although it pains me greatly to quote Ronald Reagan as a source of
>wisdom, I have to admit he got it right with "Trust, but verify".  It's

Not surprisingly, as an actor, Reagan was as good as his script.
This one he got from Stalin.

>the only way to survive in the unicode world.  Write defensive code.
>Wrap try blocks around calls that might raise exceptions if the external
>data is borked w/r/t what the metadata claims it should be.

The way to go, of course.

Groetjes Albert
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

[toc] | [prev] | [next] | [standalone]


#35694

Frompython培训 <51mmj.com@gmail.com>
Date2012-12-28 06:30 -0800
Message-ID<5af3055a-c460-477e-90ef-72bbc0612a25@googlegroups.com>
In reply to#35421
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

first setup  chardet 


import chardet
#抓取网页html
html_1 = urllib2.urlopen(line,timeout=120).read()
#print html_1
mychar=chardet.detect(html_1)
#print mychar
bianma=mychar['encoding']
if bianma == 'utf-8' or bianma == 'UTF-8':
    #html=html.decode('utf-8','ignore').encode('utf-8')
   html=html_1
else :
    html =html_1.decode('gb2312','ignore').encode('utf-8')

[toc] | [prev] | [next] | [standalone]


#36331

FromiMath <redstone-cold@163.com>
Date2013-01-07 01:23 -0800
Message-ID<570913d9-d0f6-4ad4-9400-194d008a8384@googlegroups.com>
In reply to#35421
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

up to now , maybe chadet is the only way to let python automatically do it .

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web