Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #35470

Re: how to detect the character encoding in a web page ?

From Roy Smith <roy@panix.com>
Newsgroups comp.lang.python
Subject Re: how to detect the character encoding in a web page ?
Date 2012-12-24 11:46 -0500
Organization PANIX Public Access Internet and UNIX, NYC
Message-ID <roy-DF05DA.11460324122012@news.panix.com> (permalink)
References (1 earlier) <2324928c-32de-4f9d-8ff1-5db6dcf5543a@googlegroups.com> <5C06B25F-066B-421E-9849-2E1B2EAFFEBE@gmail.com> <mailman.1253.1356351379.29569.python-list@python.org> <50d85daf$0$29967$c3e8da3$5496439d@news.astraweb.com> <rn%Bs.693798$nB6.605938@fx21.am4>

Show all headers | View raw


In article <rn%Bs.693798$nB6.605938@fx21.am4>,
 Alister <alister.ware@ntlworld.com> wrote:

> Indeed due to the poor quality of most websites it is not possible to be 
> 100% accurate for all sites.
> 
> personally I would start by checking the doc type & then the meta data as 
> these should be quick & correct, I then use chardectect only if these 
> fail to provide any result.

I agree that checking the metadata is the right thing to do.  But, I 
wouldn't go so far as to assume it will always be correct.  There's a 
lot of crap out there with perfectly formed metadata which just happens 
to be wrong.

Although it pains me greatly to quote Ronald Reagan as a source of 
wisdom, I have to admit he got it right with "Trust, but verify".  It's 
the only way to survive in the unicode world.  Write defensive code.  
Wrap try blocks around calls that might raise exceptions if the external 
data is borked w/r/t what the metadata claims it should be.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 16:34 -0800
  Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2012-12-24 12:23 +1100
  Re: how to detect the character encoding  in a web page ? Hans Mulder <hansmu@xs4all.nl> - 2012-12-24 02:30 +0100
  Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 18:57 -0800
  Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 19:03 -0800
  Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2012-12-23 19:03 -0800
    Re: how to detect the character encoding  in a web page ? Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2012-12-24 09:34 +0100
    Re: how to detect the character encoding in a web page ? Kwpolska <kwpolska@gmail.com> - 2012-12-24 13:16 +0100
      Re: how to detect the character encoding in a web page ? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-24 13:50 +0000
        Re: how to detect the character encoding in a web page ? Alister <alister.ware@ntlworld.com> - 2012-12-24 16:27 +0000
          Re: how to detect the character encoding in a web page ? Roy Smith <roy@panix.com> - 2012-12-24 11:46 -0500
            Re: how to detect the character encoding in a web page ? albert@spenarnc.xs4all.nl (Albert van der Horst) - 2013-01-14 12:50 +0000
  Re: how to detect the character encoding  in a web page ? python培训 <51mmj.com@gmail.com> - 2012-12-28 06:30 -0800
  Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2013-01-07 01:23 -0800

csiph-web