Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #47124
| References | <c15bad9a-a7f7-456e-8dc5-b1af67fbdd44@googlegroups.com> <29a6b839-1e3d-42ba-acf3-a58a5fcb9f5c@googlegroups.com> |
|---|---|
| Date | 2013-06-06 03:55 +1000 |
| Subject | Re: how to detect the character encoding in a web page ? |
| From | Chris Angelico <rosuav@gmail.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.2752.1370454914.3114.python-list@python.org> (permalink) |
On Thu, Jun 6, 2013 at 1:14 AM, iMath <redstone-cold@163.com> wrote: > 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: >> how to detect the character encoding in a web page ? >> >> such as this page >> >> >> >> http://python.org/ > > by the way ,we cannot get character encoding programmatically from the mate data without knowing the character encoding ahead ! The rules for web pages are (massively oversimplified): 1) HTTP header 2) ASCII-compatible encoding and meta tag The HTTP header is completely out of band. This is the best way to transmit encoding information. Otherwise, you assume 7-bit ASCII and start parsing. Once you find a meta tag, you stop parsing and go back to the top, decoding in the new way. "ASCII-compatible" covers a huge number of encodings, so it's not actually much of a problem to do this. ChrisA
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2013-06-05 08:14 -0700
Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2013-06-06 03:55 +1000
Re: how to detect the character encoding in a web page ? Nobody <nobody@nowhere.com> - 2013-06-06 07:22 +0100
Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2013-06-06 17:14 +1000
csiph-web