Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #47124

Re: how to detect the character encoding in a web page ?

References <c15bad9a-a7f7-456e-8dc5-b1af67fbdd44@googlegroups.com> <29a6b839-1e3d-42ba-acf3-a58a5fcb9f5c@googlegroups.com>
Date 2013-06-06 03:55 +1000
Subject Re: how to detect the character encoding in a web page ?
From Chris Angelico <rosuav@gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.2752.1370454914.3114.python-list@python.org> (permalink)

Show all headers | View raw


On Thu, Jun 6, 2013 at 1:14 AM, iMath <redstone-cold@163.com> wrote:
> 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
>> how to detect the character encoding  in a web page ?
>>
>> such as this page
>>
>>
>>
>> http://python.org/
>
> by the way  ,we cannot get character encoding programmatically from the mate data without knowing the  character encoding  ahead !

The rules for web pages are (massively oversimplified):

1) HTTP header
2) ASCII-compatible encoding and meta tag

The HTTP header is completely out of band. This is the best way to
transmit encoding information. Otherwise, you assume 7-bit ASCII and
start parsing. Once you find a meta tag, you stop parsing and go back
to the top, decoding in the new way. "ASCII-compatible" covers a huge
number of encodings, so it's not actually much of a problem to do
this.

ChrisA

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2013-06-05 08:14 -0700
  Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2013-06-06 03:55 +1000
    Re: how to detect the character encoding in a web page ? Nobody <nobody@nowhere.com> - 2013-06-06 07:22 +0100
      Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2013-06-06 17:14 +1000

csiph-web