Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #47193

Re: how to detect the character encoding in a web page ?

From Nobody <nobody@nowhere.com>
Subject Re: how to detect the character encoding in a web page ?
Date 2013-06-06 07:22 +0100
Message-Id <pan.2013.06.06.06.22.36.896000@nowhere.com>
Newsgroups comp.lang.python
References <c15bad9a-a7f7-456e-8dc5-b1af67fbdd44@googlegroups.com> <29a6b839-1e3d-42ba-acf3-a58a5fcb9f5c@googlegroups.com> <mailman.2752.1370454914.3114.python-list@python.org>
Organization Zen Internet

Show all headers | View raw


On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote:

> The HTTP header is completely out of band. This is the best way to
> transmit encoding information. Otherwise, you assume 7-bit ASCII and start
> parsing. Once you find a meta tag, you stop parsing and go back to the
> top, decoding in the new way.

Provided that the meta tag indicates an ASCII-compatible encoding, and you
haven't encountered any decode errors due to 8-bit characters, then
there's no need to go back to the top.

> "ASCII-compatible" covers a huge number of
> encodings, so it's not actually much of a problem to do this.

With slight modifications, you can also handle some
almost-ASCII-compatible encodings such as shift-JIS.

Personally, I'd start by assuming ISO-8859-1, keep track of which bytes
have actually been seen, and only re-start parsing from the top if the
encoding change actually affects the interpretation of any of those bytes.

And if the encoding isn't even remotely ASCII-compatible, you aren't going
to be able to recognise the meta tag in the first place. But I don't think
I've ever seen a web page encoded in UTF-16 or EBCDIC.

Tools like chardet are meant for the situation where either no encoding is
specified or the specified encoding can't be trusted (which is rather
common; why else would web browsers have a menu to allow the user to
select the encoding?).

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2013-06-05 08:14 -0700
  Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2013-06-06 03:55 +1000
    Re: how to detect the character encoding in a web page ? Nobody <nobody@nowhere.com> - 2013-06-06 07:22 +0100
      Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2013-06-06 17:14 +1000

csiph-web