Groups > comp.lang.python > #47089 > unrolled thread

Re: how to detect the character encoding in a web page ?

Started by	iMath <redstone-cold@163.com>
First post	2013-06-05 08:14 -0700
Last post	2013-06-06 17:14 +1000
Articles	4 — 3 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2013-06-05 08:14 -0700
    Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2013-06-06 03:55 +1000
      Re: how to detect the character encoding in a web page ? Nobody <nobody@nowhere.com> - 2013-06-06 07:22 +0100
        Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2013-06-06 17:14 +1000

#47089 — Re: how to detect the character encoding in a web page ?

From	iMath <redstone-cold@163.com>
Date	2013-06-05 08:14 -0700
Subject	Re: how to detect the character encoding in a web page ?
Message-ID	<29a6b839-1e3d-42ba-acf3-a58a5fcb9f5c@googlegroups.com>

在 2012年12月24日星期一UTC+8上午8时34分47秒，iMath写道：
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

by the way  ,we cannot get character encoding programmatically from the mate data without knowing the  character encoding  ahead !

[toc] | [next] | [standalone]

#47124 — Re: how to detect the character encoding in a web page ?

From	Chris Angelico <rosuav@gmail.com>
Date	2013-06-06 03:55 +1000
Subject	Re: how to detect the character encoding in a web page ?
Message-ID	<mailman.2752.1370454914.3114.python-list@python.org>
In reply to	#47089

On Thu, Jun 6, 2013 at 1:14 AM, iMath <redstone-cold@163.com> wrote:
> 在 2012年12月24日星期一UTC+8上午8时34分47秒，iMath写道：
>> how to detect the character encoding  in a web page ?
>>
>> such as this page
>>
>>
>>
>> http://python.org/
>
> by the way  ,we cannot get character encoding programmatically from the mate data without knowing the  character encoding  ahead !

The rules for web pages are (massively oversimplified):

1) HTTP header
2) ASCII-compatible encoding and meta tag

The HTTP header is completely out of band. This is the best way to
transmit encoding information. Otherwise, you assume 7-bit ASCII and
start parsing. Once you find a meta tag, you stop parsing and go back
to the top, decoding in the new way. "ASCII-compatible" covers a huge
number of encodings, so it's not actually much of a problem to do
this.

ChrisA

[toc] | [prev] | [next] | [standalone]

#47193 — Re: how to detect the character encoding in a web page ?

From	Nobody <nobody@nowhere.com>
Date	2013-06-06 07:22 +0100
Subject	Re: how to detect the character encoding in a web page ?
Message-ID	<pan.2013.06.06.06.22.36.896000@nowhere.com>
In reply to	#47124

On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote:

> The HTTP header is completely out of band. This is the best way to
> transmit encoding information. Otherwise, you assume 7-bit ASCII and start
> parsing. Once you find a meta tag, you stop parsing and go back to the
> top, decoding in the new way.

Provided that the meta tag indicates an ASCII-compatible encoding, and you
haven't encountered any decode errors due to 8-bit characters, then
there's no need to go back to the top.

> "ASCII-compatible" covers a huge number of
> encodings, so it's not actually much of a problem to do this.

With slight modifications, you can also handle some
almost-ASCII-compatible encodings such as shift-JIS.

Personally, I'd start by assuming ISO-8859-1, keep track of which bytes
have actually been seen, and only re-start parsing from the top if the
encoding change actually affects the interpretation of any of those bytes.

And if the encoding isn't even remotely ASCII-compatible, you aren't going
to be able to recognise the meta tag in the first place. But I don't think
I've ever seen a web page encoded in UTF-16 or EBCDIC.

Tools like chardet are meant for the situation where either no encoding is
specified or the specified encoding can't be trusted (which is rather
common; why else would web browsers have a menu to allow the user to
select the encoding?).

[toc] | [prev] | [next] | [standalone]

#47196 — Re: how to detect the character encoding in a web page ?

From	Chris Angelico <rosuav@gmail.com>
Date	2013-06-06 17:14 +1000
Subject	Re: how to detect the character encoding in a web page ?
Message-ID	<mailman.2787.1370502849.3114.python-list@python.org>
In reply to	#47193

On Thu, Jun 6, 2013 at 4:22 PM, Nobody <nobody@nowhere.com> wrote:
> On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote:
>
>> The HTTP header is completely out of band. This is the best way to
>> transmit encoding information. Otherwise, you assume 7-bit ASCII and start
>> parsing. Once you find a meta tag, you stop parsing and go back to the
>> top, decoding in the new way.
>
> Provided that the meta tag indicates an ASCII-compatible encoding, and you
> haven't encountered any decode errors due to 8-bit characters, then
> there's no need to go back to the top.

Technically and conceptually, you go back to the start and re-parse.
Sure, you might optimize that if you can, but not every parser will,
hence it's advisable to put the content-type as early as possible.

>> "ASCII-compatible" covers a huge number of
>> encodings, so it's not actually much of a problem to do this.
>
> With slight modifications, you can also handle some
> almost-ASCII-compatible encodings such as shift-JIS.
>
> Personally, I'd start by assuming ISO-8859-1, keep track of which bytes
> have actually been seen, and only re-start parsing from the top if the
> encoding change actually affects the interpretation of any of those bytes.

Hrm, it'd be equally valid to guess UTF-8. But as long as you're
prepared to re-parse after finding the content-type, that's just a
choice of optimization and has no real impact.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Re: how to detect the character encoding in a web page ?

Contents

#47089 — Re: how to detect the character encoding in a web page ?

#47124 — Re: how to detect the character encoding in a web page ?

#47193 — Re: how to detect the character encoding in a web page ?

#47196 — Re: how to detect the character encoding in a web page ?