Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #35455
| References | <c15bad9a-a7f7-456e-8dc5-b1af67fbdd44@googlegroups.com> <2324928c-32de-4f9d-8ff1-5db6dcf5543a@googlegroups.com> <5C06B25F-066B-421E-9849-2E1B2EAFFEBE@gmail.com> |
|---|---|
| Date | 2012-12-24 13:16 +0100 |
| Subject | Re: how to detect the character encoding in a web page ? |
| From | Kwpolska <kwpolska@gmail.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.1253.1356351379.29569.python-list@python.org> (permalink) |
On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
<kurt.alfred.mueller@gmail.com> wrote:
> $ wget -q -O - http://python.org/ | chardetect.py
> stdin: ISO-8859-2 with confidence 0.803579722043
> $
And it sucks, because it uses magic, and not reading the HTML tags.
The RIGHT thing to do for websites is detect the meta charset
definition, which is
<meta http-equiv="content-type" content="text/html; charset=utf-8">
or
<meta charset="utf-8">
The second one for HTML5 websites, and both may require case
conversion and the useless ` /` at the end. But if somebody is using
HTML5, you are pretty much guaranteed to get UTF-8.
In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
Because nobody in the right mind would use something else today.
--
Kwpolska <http://kwpolska.tk>
stop html mail | always bottom-post
www.asciiribbon.org | www.netmeister.org/news/learn2quote.html
GPG KEY: 5EAAEA16
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 16:34 -0800
Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2012-12-24 12:23 +1100
Re: how to detect the character encoding in a web page ? Hans Mulder <hansmu@xs4all.nl> - 2012-12-24 02:30 +0100
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 18:57 -0800
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 19:03 -0800
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 19:03 -0800
Re: how to detect the character encoding in a web page ? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2012-12-24 09:34 +0100
Re: how to detect the character encoding in a web page ? Kwpolska <kwpolska@gmail.com> - 2012-12-24 13:16 +0100
Re: how to detect the character encoding in a web page ? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-24 13:50 +0000
Re: how to detect the character encoding in a web page ? Alister <alister.ware@ntlworld.com> - 2012-12-24 16:27 +0000
Re: how to detect the character encoding in a web page ? Roy Smith <roy@panix.com> - 2012-12-24 11:46 -0500
Re: how to detect the character encoding in a web page ? albert@spenarnc.xs4all.nl (Albert van der Horst) - 2013-01-14 12:50 +0000
Re: how to detect the character encoding in a web page ? python培训 <51mmj.com@gmail.com> - 2012-12-28 06:30 -0800
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2013-01-07 01:23 -0800
csiph-web