Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #35421 > unrolled thread
| Started by | iMath <redstone-cold@163.com> |
|---|---|
| First post | 2012-12-23 16:34 -0800 |
| Last post | 2013-01-07 01:23 -0800 |
| Articles | 14 — 10 participants |
Back to article view | Back to comp.lang.python
how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 16:34 -0800
Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2012-12-24 12:23 +1100
Re: how to detect the character encoding in a web page ? Hans Mulder <hansmu@xs4all.nl> - 2012-12-24 02:30 +0100
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 18:57 -0800
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 19:03 -0800
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 19:03 -0800
Re: how to detect the character encoding in a web page ? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2012-12-24 09:34 +0100
Re: how to detect the character encoding in a web page ? Kwpolska <kwpolska@gmail.com> - 2012-12-24 13:16 +0100
Re: how to detect the character encoding in a web page ? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-24 13:50 +0000
Re: how to detect the character encoding in a web page ? Alister <alister.ware@ntlworld.com> - 2012-12-24 16:27 +0000
Re: how to detect the character encoding in a web page ? Roy Smith <roy@panix.com> - 2012-12-24 11:46 -0500
Re: how to detect the character encoding in a web page ? albert@spenarnc.xs4all.nl (Albert van der Horst) - 2013-01-14 12:50 +0000
Re: how to detect the character encoding in a web page ? python培训 <51mmj.com@gmail.com> - 2012-12-28 06:30 -0800
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2013-01-07 01:23 -0800
| From | iMath <redstone-cold@163.com> |
|---|---|
| Date | 2012-12-23 16:34 -0800 |
| Subject | how to detect the character encoding in a web page ? |
| Message-ID | <c15bad9a-a7f7-456e-8dc5-b1af67fbdd44@googlegroups.com> |
how to detect the character encoding in a web page ? such as this page http://python.org/
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-12-24 12:23 +1100 |
| Subject | Re: how to detect the character encoding in a web page ? |
| Message-ID | <mailman.1231.1356312219.29569.python-list@python.org> |
| In reply to | #35421 |
On Mon, Dec 24, 2012 at 11:34 AM, iMath <redstone-cold@163.com> wrote: > how to detect the character encoding in a web page ? > such as this page > > http://python.org/ You read part-way into the page, where you find this: <meta http-equiv="content-type" content="text/html; charset=utf-8" /> That tells you that the character set is UTF-8. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Hans Mulder <hansmu@xs4all.nl> |
|---|---|
| Date | 2012-12-24 02:30 +0100 |
| Message-ID | <50d7b047$0$6963$e4fe514c@news2.news.xs4all.nl> |
| In reply to | #35421 |
On 24/12/12 01:34:47, iMath wrote: > how to detect the character encoding in a web page ? That depends on the site: different sites indicate their encoding differently. > such as this page: http://python.org/ If you download that page and look at the HTML code, you'll find a line: <meta http-equiv="content-type" content="text/html; charset=utf-8" /> So it's encoded as utf-8. Other sites declare their charset in the Content-Type HTTP header line. And then there are sites relying on the default. And sites that get it wrong, and send data in a different encoding from what they declare. Welcome to the real world, -- HansM
[toc] | [prev] | [next] | [standalone]
| From | iMath <redstone-cold@163.com> |
|---|---|
| Date | 2012-12-23 18:57 -0800 |
| Message-ID | <212044ed-396f-4b2d-acec-8832e31723ad@googlegroups.com> |
| In reply to | #35421 |
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ but how to let python do it for you ? such as this page http://python.org/ how to detect the character encoding in this web page by python ?
[toc] | [prev] | [next] | [standalone]
| From | iMath <redstone-cold@163.com> |
|---|---|
| Date | 2012-12-23 19:03 -0800 |
| Message-ID | <10a96dbc-40e2-43ee-acb9-88ebafec7bd5@googlegroups.com> |
| In reply to | #35421 |
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ but how to let python do it for you ? such as these 2 pages http://python.org/ http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx how to detect the character encoding in these 2 pages by python ?
[toc] | [prev] | [next] | [standalone]
| From | iMath <redstone-cold@163.com> |
|---|---|
| Date | 2012-12-23 19:03 -0800 |
| Message-ID | <2324928c-32de-4f9d-8ff1-5db6dcf5543a@googlegroups.com> |
| In reply to | #35421 |
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ but how to let python do it for you ? such as these 2 pages http://python.org/ http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx how to detect the character encoding in these 2 pages by python ?
[toc] | [prev] | [next] | [standalone]
| From | Kurt Mueller <kurt.alfred.mueller@gmail.com> |
|---|---|
| Date | 2012-12-24 09:34 +0100 |
| Message-ID | <mailman.1245.1356338098.29569.python-list@python.org> |
| In reply to | #35434 |
Am 24.12.2012 um 04:03 schrieb iMath: > but how to let python do it for you ? > such as these 2 pages > http://python.org/ > http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx > how to detect the character encoding in these 2 pages by python ? If you have the html code, let chardetect.py do an educated guess for you. http://pypi.python.org/pypi/chardet Example: $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2 with confidence 0.803579722043 $ $ wget -q -O - 'http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx' | chardetect.py stdin: utf-8 with confidence 0.87625 $ Grüessli -- kurt.alfred.mueller@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Kwpolska <kwpolska@gmail.com> |
|---|---|
| Date | 2012-12-24 13:16 +0100 |
| Subject | Re: how to detect the character encoding in a web page ? |
| Message-ID | <mailman.1253.1356351379.29569.python-list@python.org> |
| In reply to | #35434 |
On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
<kurt.alfred.mueller@gmail.com> wrote:
> $ wget -q -O - http://python.org/ | chardetect.py
> stdin: ISO-8859-2 with confidence 0.803579722043
> $
And it sucks, because it uses magic, and not reading the HTML tags.
The RIGHT thing to do for websites is detect the meta charset
definition, which is
<meta http-equiv="content-type" content="text/html; charset=utf-8">
or
<meta charset="utf-8">
The second one for HTML5 websites, and both may require case
conversion and the useless ` /` at the end. But if somebody is using
HTML5, you are pretty much guaranteed to get UTF-8.
In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
Because nobody in the right mind would use something else today.
--
Kwpolska <http://kwpolska.tk>
stop html mail | always bottom-post
www.asciiribbon.org | www.netmeister.org/news/learn2quote.html
GPG KEY: 5EAAEA16
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-12-24 13:50 +0000 |
| Subject | Re: how to detect the character encoding in a web page ? |
| Message-ID | <50d85daf$0$29967$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #35455 |
On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote: > On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller > <kurt.alfred.mueller@gmail.com> wrote: >> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2 >> with confidence 0.803579722043 $ > > And it sucks, because it uses magic, and not reading the HTML tags. The > RIGHT thing to do for websites is detect the meta charset definition, > which is > > <meta http-equiv="content-type" content="text/html; charset=utf-8"> > > or > > <meta charset="utf-8"> > > The second one for HTML5 websites, and both may require case conversion > and the useless ` /` at the end. But if somebody is using HTML5, you > are pretty much guaranteed to get UTF-8. > > In today’s world, the proper assumption to make is “UTF-8 or GTFO”. > Because nobody in the right mind would use something else today. Alas, there are many, many, many, MANY websites that are created by people who are *not* in their right mind. To say nothing of 15 year old websites that use a legacy encoding. And to support those, you may need to guess the encoding, and for that, chardetect.py is the solution. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Alister <alister.ware@ntlworld.com> |
|---|---|
| Date | 2012-12-24 16:27 +0000 |
| Subject | Re: how to detect the character encoding in a web page ? |
| Message-ID | <rn%Bs.693798$nB6.605938@fx21.am4> |
| In reply to | #35457 |
On Mon, 24 Dec 2012 13:50:39 +0000, Steven D'Aprano wrote: > On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote: > >> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller >> <kurt.alfred.mueller@gmail.com> wrote: >>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2 >>> with confidence 0.803579722043 $ >> >> And it sucks, because it uses magic, and not reading the HTML tags. The >> RIGHT thing to do for websites is detect the meta charset definition, >> which is >> >> <meta http-equiv="content-type" content="text/html; charset=utf-8"> >> >> or >> >> <meta charset="utf-8"> >> >> The second one for HTML5 websites, and both may require case conversion >> and the useless ` /` at the end. But if somebody is using HTML5, you >> are pretty much guaranteed to get UTF-8. >> >> In today’s world, the proper assumption to make is “UTF-8 or GTFO”. >> Because nobody in the right mind would use something else today. > > Alas, there are many, many, many, MANY websites that are created by > people who are *not* in their right mind. To say nothing of 15 year old > websites that use a legacy encoding. And to support those, you may need > to guess the encoding, and for that, chardetect.py is the solution. Indeed due to the poor quality of most websites it is not possible to be 100% accurate for all sites. personally I would start by checking the doc type & then the meta data as these should be quick & correct, I then use chardectect only if these fail to provide any result. -- I have found little that is good about human beings. In my experience most of them are trash. -- Sigmund Freud
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-12-24 11:46 -0500 |
| Subject | Re: how to detect the character encoding in a web page ? |
| Message-ID | <roy-DF05DA.11460324122012@news.panix.com> |
| In reply to | #35466 |
In article <rn%Bs.693798$nB6.605938@fx21.am4>, Alister <alister.ware@ntlworld.com> wrote: > Indeed due to the poor quality of most websites it is not possible to be > 100% accurate for all sites. > > personally I would start by checking the doc type & then the meta data as > these should be quick & correct, I then use chardectect only if these > fail to provide any result. I agree that checking the metadata is the right thing to do. But, I wouldn't go so far as to assume it will always be correct. There's a lot of crap out there with perfectly formed metadata which just happens to be wrong. Although it pains me greatly to quote Ronald Reagan as a source of wisdom, I have to admit he got it right with "Trust, but verify". It's the only way to survive in the unicode world. Write defensive code. Wrap try blocks around calls that might raise exceptions if the external data is borked w/r/t what the metadata claims it should be.
[toc] | [prev] | [next] | [standalone]
| From | albert@spenarnc.xs4all.nl (Albert van der Horst) |
|---|---|
| Date | 2013-01-14 12:50 +0000 |
| Subject | Re: how to detect the character encoding in a web page ? |
| Message-ID | <50f3ff0f$0$6347$e4fe514c@dreader35.news.xs4all.nl> |
| In reply to | #35470 |
In article <roy-DF05DA.11460324122012@news.panix.com>, Roy Smith <roy@panix.com> wrote: >In article <rn%Bs.693798$nB6.605938@fx21.am4>, > Alister <alister.ware@ntlworld.com> wrote: > >> Indeed due to the poor quality of most websites it is not possible to be >> 100% accurate for all sites. >> >> personally I would start by checking the doc type & then the meta data as >> these should be quick & correct, I then use chardectect only if these >> fail to provide any result. > >I agree that checking the metadata is the right thing to do. But, I >wouldn't go so far as to assume it will always be correct. There's a >lot of crap out there with perfectly formed metadata which just happens >to be wrong. > >Although it pains me greatly to quote Ronald Reagan as a source of >wisdom, I have to admit he got it right with "Trust, but verify". It's Not surprisingly, as an actor, Reagan was as good as his script. This one he got from Stalin. >the only way to survive in the unicode world. Write defensive code. >Wrap try blocks around calls that might raise exceptions if the external >data is borked w/r/t what the metadata claims it should be. The way to go, of course. Groetjes Albert -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- being exponential -- ultimately falters. albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
[toc] | [prev] | [next] | [standalone]
| From | python培训 <51mmj.com@gmail.com> |
|---|---|
| Date | 2012-12-28 06:30 -0800 |
| Message-ID | <5af3055a-c460-477e-90ef-72bbc0612a25@googlegroups.com> |
| In reply to | #35421 |
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
first setup chardet
import chardet
#抓取网页html
html_1 = urllib2.urlopen(line,timeout=120).read()
#print html_1
mychar=chardet.detect(html_1)
#print mychar
bianma=mychar['encoding']
if bianma == 'utf-8' or bianma == 'UTF-8':
#html=html.decode('utf-8','ignore').encode('utf-8')
html=html_1
else :
html =html_1.decode('gb2312','ignore').encode('utf-8')
[toc] | [prev] | [next] | [standalone]
| From | iMath <redstone-cold@163.com> |
|---|---|
| Date | 2013-01-07 01:23 -0800 |
| Message-ID | <570913d9-d0f6-4ad4-9400-194d008a8384@googlegroups.com> |
| In reply to | #35421 |
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ up to now , maybe chadet is the only way to let python automatically do it .
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web