Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #35470
| From | Roy Smith <roy@panix.com> |
|---|---|
| Newsgroups | comp.lang.python |
| Subject | Re: how to detect the character encoding in a web page ? |
| Date | 2012-12-24 11:46 -0500 |
| Organization | PANIX Public Access Internet and UNIX, NYC |
| Message-ID | <roy-DF05DA.11460324122012@news.panix.com> (permalink) |
| References | (1 earlier) <2324928c-32de-4f9d-8ff1-5db6dcf5543a@googlegroups.com> <5C06B25F-066B-421E-9849-2E1B2EAFFEBE@gmail.com> <mailman.1253.1356351379.29569.python-list@python.org> <50d85daf$0$29967$c3e8da3$5496439d@news.astraweb.com> <rn%Bs.693798$nB6.605938@fx21.am4> |
In article <rn%Bs.693798$nB6.605938@fx21.am4>, Alister <alister.ware@ntlworld.com> wrote: > Indeed due to the poor quality of most websites it is not possible to be > 100% accurate for all sites. > > personally I would start by checking the doc type & then the meta data as > these should be quick & correct, I then use chardectect only if these > fail to provide any result. I agree that checking the metadata is the right thing to do. But, I wouldn't go so far as to assume it will always be correct. There's a lot of crap out there with perfectly formed metadata which just happens to be wrong. Although it pains me greatly to quote Ronald Reagan as a source of wisdom, I have to admit he got it right with "Trust, but verify". It's the only way to survive in the unicode world. Write defensive code. Wrap try blocks around calls that might raise exceptions if the external data is borked w/r/t what the metadata claims it should be.
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 16:34 -0800
Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2012-12-24 12:23 +1100
Re: how to detect the character encoding in a web page ? Hans Mulder <hansmu@xs4all.nl> - 2012-12-24 02:30 +0100
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 18:57 -0800
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 19:03 -0800
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 19:03 -0800
Re: how to detect the character encoding in a web page ? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2012-12-24 09:34 +0100
Re: how to detect the character encoding in a web page ? Kwpolska <kwpolska@gmail.com> - 2012-12-24 13:16 +0100
Re: how to detect the character encoding in a web page ? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-24 13:50 +0000
Re: how to detect the character encoding in a web page ? Alister <alister.ware@ntlworld.com> - 2012-12-24 16:27 +0000
Re: how to detect the character encoding in a web page ? Roy Smith <roy@panix.com> - 2012-12-24 11:46 -0500
Re: how to detect the character encoding in a web page ? albert@spenarnc.xs4all.nl (Albert van der Horst) - 2013-01-14 12:50 +0000
Re: how to detect the character encoding in a web page ? python培训 <51mmj.com@gmail.com> - 2012-12-28 06:30 -0800
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2013-01-07 01:23 -0800
csiph-web