Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #47089 > unrolled thread
| Started by | iMath <redstone-cold@163.com> |
|---|---|
| First post | 2013-06-05 08:14 -0700 |
| Last post | 2013-06-06 17:14 +1000 |
| Articles | 4 — 3 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2013-06-05 08:14 -0700
Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2013-06-06 03:55 +1000
Re: how to detect the character encoding in a web page ? Nobody <nobody@nowhere.com> - 2013-06-06 07:22 +0100
Re: how to detect the character encoding in a web page ? Chris Angelico <rosuav@gmail.com> - 2013-06-06 17:14 +1000
| From | iMath <redstone-cold@163.com> |
|---|---|
| Date | 2013-06-05 08:14 -0700 |
| Subject | Re: how to detect the character encoding in a web page ? |
| Message-ID | <29a6b839-1e3d-42ba-acf3-a58a5fcb9f5c@googlegroups.com> |
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ by the way ,we cannot get character encoding programmatically from the mate data without knowing the character encoding ahead !
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-06 03:55 +1000 |
| Subject | Re: how to detect the character encoding in a web page ? |
| Message-ID | <mailman.2752.1370454914.3114.python-list@python.org> |
| In reply to | #47089 |
On Thu, Jun 6, 2013 at 1:14 AM, iMath <redstone-cold@163.com> wrote: > 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: >> how to detect the character encoding in a web page ? >> >> such as this page >> >> >> >> http://python.org/ > > by the way ,we cannot get character encoding programmatically from the mate data without knowing the character encoding ahead ! The rules for web pages are (massively oversimplified): 1) HTTP header 2) ASCII-compatible encoding and meta tag The HTTP header is completely out of band. This is the best way to transmit encoding information. Otherwise, you assume 7-bit ASCII and start parsing. Once you find a meta tag, you stop parsing and go back to the top, decoding in the new way. "ASCII-compatible" covers a huge number of encodings, so it's not actually much of a problem to do this. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Nobody <nobody@nowhere.com> |
|---|---|
| Date | 2013-06-06 07:22 +0100 |
| Subject | Re: how to detect the character encoding in a web page ? |
| Message-ID | <pan.2013.06.06.06.22.36.896000@nowhere.com> |
| In reply to | #47124 |
On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote: > The HTTP header is completely out of band. This is the best way to > transmit encoding information. Otherwise, you assume 7-bit ASCII and start > parsing. Once you find a meta tag, you stop parsing and go back to the > top, decoding in the new way. Provided that the meta tag indicates an ASCII-compatible encoding, and you haven't encountered any decode errors due to 8-bit characters, then there's no need to go back to the top. > "ASCII-compatible" covers a huge number of > encodings, so it's not actually much of a problem to do this. With slight modifications, you can also handle some almost-ASCII-compatible encodings such as shift-JIS. Personally, I'd start by assuming ISO-8859-1, keep track of which bytes have actually been seen, and only re-start parsing from the top if the encoding change actually affects the interpretation of any of those bytes. And if the encoding isn't even remotely ASCII-compatible, you aren't going to be able to recognise the meta tag in the first place. But I don't think I've ever seen a web page encoded in UTF-16 or EBCDIC. Tools like chardet are meant for the situation where either no encoding is specified or the specified encoding can't be trusted (which is rather common; why else would web browsers have a menu to allow the user to select the encoding?).
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-06 17:14 +1000 |
| Subject | Re: how to detect the character encoding in a web page ? |
| Message-ID | <mailman.2787.1370502849.3114.python-list@python.org> |
| In reply to | #47193 |
On Thu, Jun 6, 2013 at 4:22 PM, Nobody <nobody@nowhere.com> wrote: > On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote: > >> The HTTP header is completely out of band. This is the best way to >> transmit encoding information. Otherwise, you assume 7-bit ASCII and start >> parsing. Once you find a meta tag, you stop parsing and go back to the >> top, decoding in the new way. > > Provided that the meta tag indicates an ASCII-compatible encoding, and you > haven't encountered any decode errors due to 8-bit characters, then > there's no need to go back to the top. Technically and conceptually, you go back to the start and re-parse. Sure, you might optimize that if you can, but not every parser will, hence it's advisable to put the content-type as early as possible. >> "ASCII-compatible" covers a huge number of >> encodings, so it's not actually much of a problem to do this. > > With slight modifications, you can also handle some > almost-ASCII-compatible encodings such as shift-JIS. > > Personally, I'd start by assuming ISO-8859-1, keep track of which bytes > have actually been seen, and only re-start parsing from the top if the > encoding change actually affects the interpretation of any of those bytes. Hrm, it'd be equally valid to guess UTF-8. But as long as you're prepared to re-parse after finding the content-type, that's just a choice of optimization and has no real impact. ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web