Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!ecngs!feeder2.ecngs.de!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Newsgroups: comp.lang.python
Date: Wed, 21 Nov 2012 03:24:01 -0800 (PST)
In-Reply-To: <mailman.115.1353452627.29569.python-list@python.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=108.78.40.138; posting-account=S-UcDQoAAACh3mXdFBHQR00lNytDt6nm
References: <CADNxFdMV857vQkHy9+kfF=dA5hOX+aNNNqXLMtiJzT_deXk66A@mail.gmail.com> <mailman.115.1353452627.29569.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Subject: Re: Encoding conundrum
From: danielk <danielkleinad@gmail.com>
To: comp.lang.python@googlegroups.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: Python <python-list@python.org>
Precedence: list
Message-ID: <mailman.141.1353497045.29569.python-list@python.org>
Lines: 272
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:33718

On Tuesday, November 20, 2012 6:03:47 PM UTC-5, Ian wrote:
> On Tue, Nov 20, 2012 at 2:49 PM, Daniel Klein <danielkleinad@gmail.com> w=
rote:
>=20
> > With the assistance of this group I am understanding unicode encoding i=
ssues
>=20
> > much better; especially when handling special characters that are outsi=
de of
>=20
> > the ASCII range. I've got my application working perfectly now :-)
>=20
> >
>=20
> > However, I am still confused as to why I can only use one specific enco=
ding.
>=20
> >
>=20
> > I've done some research and it appears that I should be able to use any=
 of
>=20
> > the following codecs with codepoints '\xfc' (chr(252)) '\xfd' (chr(253)=
) and
>=20
> > '\xfe' (chr(254)) :
>=20
>=20
>=20
> These refer to the characters with *Unicode* codepoints 252, 253, and 254=
:
>=20
>=20
>=20
> >>> unicodedata.name('\xfc')
>=20
> 'LATIN SMALL LETTER U WITH DIAERESIS'
>=20
> >>> unicodedata.name('\xfd')
>=20
> 'LATIN SMALL LETTER Y WITH ACUTE'
>=20
> >>> unicodedata.name('\xfe')
>=20
> 'LATIN SMALL LETTER THORN'
>=20
>=20
>=20
> > ISO-8859-1   [ note that I'm using this codec on my Linux box ]
>=20
>=20
>=20
> For ISO 8859-1, these characters happen to exist and even correspond
>=20
> to the same ordinals: 252, 253, and 254 (this is by design); so there
>=20
> is no problem encoding them, and the resulting bytes even happen to
>=20
> match the codepoints of the characters.
>=20
>=20
>=20
> > cp1252
>=20
>=20
>=20
> cp1252 is designed after ISO 8859-1 and also has those same three charact=
ers:
>=20
>=20
>=20
> >>> for char in b'\xfc\xfd\xfe'.decode('cp1252'):
>=20
> ...     print(unicodedata.name(char))
>=20
> ...
>=20
> LATIN SMALL LETTER U WITH DIAERESIS
>=20
> LATIN SMALL LETTER Y WITH ACUTE
>=20
> LATIN SMALL LETTER THORN
>=20
>=20
>=20
> > latin1
>=20
>=20
>=20
> Latin-1 is just another name for ISO 8859-1.
>=20
>=20
>=20
> > utf-8
>=20
>=20
>=20
> UTF-8 is a *multi-byte* encoding.  It can encode any Unicode
>=20
> characters, so you can represent those three characters in UTF-8, but
>=20
> with a different (and longer) byte sequence:
>=20
>=20
>=20
> >>> print('\xfc\xfd\xfd'.encode('utf8'))
>=20
> b'\xc3\xbc\xc3\xbd\xc3\xbd'
>=20
>=20
>=20
> > cp437
>=20
>=20
>=20
> cp437 is another 8-bit encoding, but it maps entirely different
>=20
> characters to those three bytes:
>=20
>=20
>=20
> >>> for char in b'\xfc\xfd\xfe'.decode('cp437'):
>=20
> ...     print(unicodedata.name(char))
>=20
> ...
>=20
> SUPERSCRIPT LATIN SMALL LETTER N
>=20
> SUPERSCRIPT TWO
>=20
> BLACK SQUARE
>=20
>=20
>=20
> As it happens, the character at codepoint 252 (that's LATIN SMALL
>=20
> LETTER U WITH DIAERESIS) does exist in cp437.  It maps to the byte
>=20
> 0x81:
>=20
>=20
>=20
> >>> '\xfc'.encode('cp437')
>=20
> b'\x81'
>=20
>=20
>=20
> The other two Unicode characters, at codepoints 253 and 254, do not
>=20
> exist at all in cp437 and cannot be encoded.
>=20
>=20
>=20
> > If I'm not mistaken, all of these codecs can handle the complete 8bit
>=20
> > character set.
>=20
>=20
>=20
> There is no "complete 8bit character set".  cp1252, Latin1, and cp437
>=20
> are all 8-bit character sets, but they're *different* 8-bit character
>=20
> sets with only partial overlap.
>=20
>=20
>=20
> > However, on Windows 7, I am only able to use 'cp437' to display (print)=
 data
>=20
> > with those characters in Python. If I use any other encoding, Windows l=
aughs
>=20
> > at me with this error message:
>=20
> >
>=20
> >   File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
>=20
> >     return codecs.charmap_encode(input,self.errors,encoding_map)[0]
>=20
> > UnicodeEncodeError: 'charmap' codec can't encode character '\xfd' in
>=20
> > position 3: character maps to <undefined>
>=20
>=20
>=20
> It would be helpful to see the code you're running that causes this error=
.

I'm using subprocess.Popen to run a process that sends a list of codepoints=
 to the calling Python program. The list is sent to stdout as a string.  He=
re is a simple example that encodes the string "Dead^Parrot", where (for th=
is example) I'm using '^' to represent chr(254) :

encoded_string =3D '[68,101,97,100,254,80,97,114,114,111,116]'

This in turn is handled in __repr__ with:

return bytes((eval(encoded_string))).decode('cp437')

I get the aforementioned 'error' if I use any other encoding.

>=20
>=20
>=20
> > Furthermore I get this from IDLE:
>=20
> >
>=20
> >>>> import locale
>=20
> >>>> locale.getdefaultlocale()
>=20
> > ('en_US', 'cp1252')
>=20
> >
>=20
> > I also get 'cp1252' when running the same script from a Windows command
>=20
> > prompt.
>=20
> >
>=20
> > So there is a contradiction between the error message and the default
>=20
> > encoding.
>=20
>=20
>=20
> If you're printing to stdout, it's going to use the encoding
>=20
> associated with stdout, which does not necessarily have anything to do
>=20
> with the default locale.  Use this to determine what character set you
>=20
> need to be working in if you want your data to be printable:
>=20
>=20
>=20
> >>> import sys
>=20
> >>> sys.stdout.encoding
>=20
> 'cp437'
>=20

Hmmm. So THAT'S why I am only able to use 'cp437'. I had (mistakenly) thoug=
ht that I could just indicate whatever encoding I wanted, as long as the co=
dec supported it.

>=20
>=20
> > Why am I restricted from using just that one codec? Is this a Windows o=
r
>=20
> > Python restriction? Please enlighten me.
>=20
>=20
>=20
> In Linux, your terminal encoding is probably either UTF-8 or Latin-1,
>=20
> and either way it has no problems encoding that data for output.  In a
>=20
> Windows cmd terminal, the default terminal encoding is cp437, which
>=20
> can't support two of the three characters you mentioned above.

It may not be able to encode those two characters but it is able to decode =
them.    That seems rather inconsistent (and contradictory) to me.