Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.mixmin.net!feed.xsnews.nl!border-1.ams.xsnews.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Terry Reedy <tjreedy@udel.edu>
Subject: Re: How do I display unicode value stored in a string variable using ord()
Date: Sun, 19 Aug 2012 17:59:43 -0400
References: <f801e06f-f7b2-4aca-b352-66856a939746@googlegroups.com> <308df2af-abe7-4043-b199-0a39f440e0ab@googlegroups.com> <502f8a2a$0$29978$c3e8da3$5496439d@news.astraweb.com> <7xehn4vyya.fsf@ruckus.brouhaha.com> <5030832d$0$29978$c3e8da3$5496439d@news.astraweb.com> <7x8vdbmho6.fsf@ruckus.brouhaha.com> <mailman.3511.1345397678.4697.python-list@python.org> <7xfw7ilqnd.fsf@ruckus.brouhaha.com> <28f35cee-3e55-43af-afc8-1ded199c53d9@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20120713 Thunderbird/14.0
In-Reply-To: <28f35cee-3e55-43af-afc8-1ded199c53d9@googlegroups.com>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3529.1345413613.4697.python-list@python.org>
Lines: 56
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:27436

On 8/19/2012 2:11 PM, wxjmfauth@gmail.com wrote:

> Well, it seems some software producers know what they
> are doing.
>
>>>> '=E2=82=AC'.encode('cp1252')
> b'\x80'
>>>> '=E2=82=AC'.encode('mac-roman')
> b'\xdb'
>>>> '=E2=82=AC'.encode('iso-8859-1')
> Traceback (most recent call last):
>    File "<eta last command>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
> in position 0: ordinal not in range(256)

Yes, Python lets you choose your byte encoding from those and a hundred=20
others. I believe all the codecs are now tested in both directions. It=20
was not an easy task.

As to the examples: Latin-1 dates to 1985 and before and the 1988=20
version was published as a standard in 1992.
https://en.wikipedia.org/wiki/Latin-1
"The name euro was officially adopted on 16 December 1995."
https://en.wikipedia.org/wiki/Euro
No wonder Latin-1 does not contain the Euro sign. International=20
standards organizations standards are relatively fixed. (The unicode=20
consortium will not even correct misspelled character names.) Instead,=20
new standards with a new number are adopted.

For better or worse, private mappings are more flexible. In its Mac=20
mapping Apple "replaced the generic currency sign =C2=A4 with the euro si=
gn=20
=E2=82=AC". (See Latin-1 reference.) Great if you use Euros, not so great=
 if you=20
were using the previous sign for something else.

Microsoft changed an unneeded code to the Euro for Windows cp-1252.
https://en.wikipedia.org/wiki/Windows-1252
"It is very common to mislabel Windows-1252 text with the charset label=20
ISO-8859-1. A common result was that all the quotes and apostrophes=20
(produced by "smart quotes" in Microsoft software) were replaced with=20
question marks or boxes on non-Windows operating systems, making text=20
difficult to read. Most modern web browsers and e-mail clients treat the =

MIME charset ISO-8859-1 as Windows-1252 in order to accommodate such=20
mislabeling. This is now standard behavior in the draft HTML 5=20
specification, which requires that documents advertised as ISO-8859-1=20
actually be parsed with the Windows-1252 encoding.[1]"

Lots of fun. Too bad Microsoft won't push utf-8 so we can all=20
communicate text with much less chance of ambiguity.

--=20
Terry Jan Reedy