Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #47196

Re: how to detect the character encoding in a web page ?

References <c15bad9a-a7f7-456e-8dc5-b1af67fbdd44@googlegroups.com> <29a6b839-1e3d-42ba-acf3-a58a5fcb9f5c@googlegroups.com> <mailman.2752.1370454914.3114.python-list@python.org> <pan.2013.06.06.06.22.36.896000@nowhere.com>
Date 2013-06-06 17:14 +1000
Subject Re: how to detect the character encoding in a web page ?
From Chris Angelico <rosuav@gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.2787.1370502849.3114.python-list@python.org> (permalink)

Show all headers | View raw


On Thu, Jun 6, 2013 at 4:22 PM, Nobody <nobody@nowhere.com> wrote:
> On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote:
>
>> The HTTP header is completely out of band. This is the best way to
>> transmit encoding information. Otherwise, you assume 7-bit ASCII and start
>> parsing. Once you find a meta tag, you stop parsing and go back to the
>> top, decoding in the new way.
>
> Provided that the meta tag indicates an ASCII-compatible encoding, and you
> haven't encountered any decode errors due to 8-bit characters, then
> there's no need to go back to the top.

Technically and conceptually, you go back to the start and re-parse.
Sure, you might optimize that if you can, but not every parser will,
hence it's advisable to put the content-type as early as possible.

>> "ASCII-compatible" covers a huge number of
>> encodings, so it's not actually much of a problem to do this.
>
> With slight modifications, you can also handle some
> almost-ASCII-compatible encodings such as shift-JIS.
>
> Personally, I'd start by assuming ISO-8859-1, keep track of which bytes
> have actually been seen, and only re-start parsing from the top if the
> encoding change actually affects the interpretation of any of those bytes.

Hrm, it'd be equally valid to guess UTF-8. But as long as you're
prepared to re-parse after finding the content-type, that's just a
choice of optimization and has no real impact.

ChrisA

Back to comp.lang.python | Previous | NextPrevious in thread | Find similar | Unroll thread


Thread

Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2013-06-05 08:14 -0700
  Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2013-06-06 03:55 +1000
    Re: how to detect the character encoding in a web page ? Nobody <nobody@nowhere.com> - 2013-06-06 07:22 +0100
      Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2013-06-06 17:14 +1000

csiph-web