Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #33461
| Path | csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!feeds.phibee-telecom.net!dedekind.zen.co.uk!zen.net.uk!hamilton.zen.co.uk!reader01.nrc01.news.zen.net.uk.POSTED!not-for-mail |
|---|---|
| From | Nobody <nobody@nowhere.com> |
| Subject | Re: latin1 and cp1252 inconsistent? |
| Date | Sat, 17 Nov 2012 00:33:14 +0000 |
| User-Agent | Pan/0.14.2 (This is not a psychotic episode. It's a cleansing moment of clarity.) |
| Message-Id | <pan.2012.11.17.00.33.13.539000@nowhere.com> |
| Newsgroups | comp.lang.python |
| References | <f063ebaf-89ee-4558-a762-0241efa39dcc@googlegroups.com> |
| MIME-Version | 1.0 |
| Content-Type | text/plain; charset=UTF-8 |
| Content-Transfer-Encoding | 8bit |
| Lines | 28 |
| Organization | Zen Internet |
| NNTP-Posting-Host | 26512efb.news.zen.co.uk |
| X-Trace | DXC=el`EB1gEaVHeI3m:Khi9dJa0UP_O8AJoL=dR0\ckLKG@WeZ<[7LZNRF_\XEijnVSgMM2Z^cWRFGAK3SW@Ok;2`dO |
| X-Complaints-To | abuse@zen.co.uk |
| Xref | csiph.com comp.lang.python:33461 |
Show key headers only | View raw
On Fri, 16 Nov 2012 13:44:03 -0800, buck wrote: > When a user agent [browser] would otherwise use a character encoding given > in the first column [ISO-8859-1, aka latin1] of the following table to > either convert content to Unicode characters or convert Unicode characters > to bytes, it must instead use the encoding given in the cell in the second > column of the same row [windows-1252, aka cp1252]. It goes on to say: The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification, motivated by a desire for compatibility with legacy content. [CHARMOD] IOW: Microsoft's "embrace, extend, extinguish" strategy has been too successful and now we have to deal with it. If HTML content is tagged as using ISO-8859-1, it's more likely that it's actually Windows-1252 content generated by someone who doesn't know the difference. Given that the only differences between the two are for code points which are in the C1 range (0x80-0x9F), which should never occur in HTML, parsing ISO-8859-1 as Windows-1252 should be harmless. If you need to support either, you can parse it as ISO-8859-1 then explicitly convert C1 codes to their Windows-1252 equivalents as a post-processing step, e.g. using the .translate() method.
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-16 13:44 -0800
Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-16 15:33 -0700
Re: latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-16 15:27 -0800
Re: latin1 and cp1252 inconsistent? Dave Angel <d@davea.name> - 2012-11-16 19:05 -0500
Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-16 17:20 -0700
Re: latin1 and cp1252 inconsistent? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-11-18 01:48 -0500
Re: latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-16 15:27 -0800
Re: latin1 and cp1252 inconsistent? Nobody <nobody@nowhere.com> - 2012-11-17 00:33 +0000
Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-16 18:08 -0700
Re: latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-17 08:56 -0800
Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-17 11:08 -0700
Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-17 11:13 -0700
Re: latin1 and cp1252 inconsistent? Nobody <nobody@nowhere.com> - 2012-11-17 19:15 +0000
csiph-web