Groups > comp.lang.python > #33682 > unrolled thread

Re: Encoding conundrum

Started by	Ian Kelly <ian.g.kelly@gmail.com>
First post	2012-11-20 16:03 -0700
Last post	2012-11-21 03:24 -0800
Articles	5 — 4 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Encoding conundrum Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-20 16:03 -0700
    Re: Encoding conundrum danielk <danielkleinad@gmail.com> - 2012-11-21 03:24 -0800
      Re: Encoding conundrum Nobody <nobody@nowhere.com> - 2012-11-21 12:18 +0000
      Re: Encoding conundrum Dave Angel <d@davea.name> - 2012-11-21 08:02 -0500
    Re: Encoding conundrum danielk <danielkleinad@gmail.com> - 2012-11-21 03:24 -0800

#33682 — Re: Encoding conundrum

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-11-20 16:03 -0700
Subject	Re: Encoding conundrum
Message-ID	<mailman.115.1353452627.29569.python-list@python.org>

On Tue, Nov 20, 2012 at 2:49 PM, Daniel Klein <danielkleinad@gmail.com> wrote:
> With the assistance of this group I am understanding unicode encoding issues
> much better; especially when handling special characters that are outside of
> the ASCII range. I've got my application working perfectly now :-)
>
> However, I am still confused as to why I can only use one specific encoding.
>
> I've done some research and it appears that I should be able to use any of
> the following codecs with codepoints '\xfc' (chr(252)) '\xfd' (chr(253)) and
> '\xfe' (chr(254)) :

These refer to the characters with *Unicode* codepoints 252, 253, and 254:

>>> unicodedata.name('\xfc')
'LATIN SMALL LETTER U WITH DIAERESIS'
>>> unicodedata.name('\xfd')
'LATIN SMALL LETTER Y WITH ACUTE'
>>> unicodedata.name('\xfe')
'LATIN SMALL LETTER THORN'

> ISO-8859-1   [ note that I'm using this codec on my Linux box ]

For ISO 8859-1, these characters happen to exist and even correspond
to the same ordinals: 252, 253, and 254 (this is by design); so there
is no problem encoding them, and the resulting bytes even happen to
match the codepoints of the characters.

> cp1252

cp1252 is designed after ISO 8859-1 and also has those same three characters:

>>> for char in b'\xfc\xfd\xfe'.decode('cp1252'):
...     print(unicodedata.name(char))
...
LATIN SMALL LETTER U WITH DIAERESIS
LATIN SMALL LETTER Y WITH ACUTE
LATIN SMALL LETTER THORN

> latin1

Latin-1 is just another name for ISO 8859-1.

> utf-8

UTF-8 is a *multi-byte* encoding.  It can encode any Unicode
characters, so you can represent those three characters in UTF-8, but
with a different (and longer) byte sequence:

>>> print('\xfc\xfd\xfd'.encode('utf8'))
b'\xc3\xbc\xc3\xbd\xc3\xbd'

> cp437

cp437 is another 8-bit encoding, but it maps entirely different
characters to those three bytes:

>>> for char in b'\xfc\xfd\xfe'.decode('cp437'):
...     print(unicodedata.name(char))
...
SUPERSCRIPT LATIN SMALL LETTER N
SUPERSCRIPT TWO
BLACK SQUARE

As it happens, the character at codepoint 252 (that's LATIN SMALL
LETTER U WITH DIAERESIS) does exist in cp437.  It maps to the byte
0x81:

>>> '\xfc'.encode('cp437')
b'\x81'

The other two Unicode characters, at codepoints 253 and 254, do not
exist at all in cp437 and cannot be encoded.

> If I'm not mistaken, all of these codecs can handle the complete 8bit
> character set.

There is no "complete 8bit character set".  cp1252, Latin1, and cp437
are all 8-bit character sets, but they're *different* 8-bit character
sets with only partial overlap.

> However, on Windows 7, I am only able to use 'cp437' to display (print) data
> with those characters in Python. If I use any other encoding, Windows laughs
> at me with this error message:
>
>   File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
>     return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\xfd' in
> position 3: character maps to <undefined>

It would be helpful to see the code you're running that causes this error.

> Furthermore I get this from IDLE:
>
>>>> import locale
>>>> locale.getdefaultlocale()
> ('en_US', 'cp1252')
>
> I also get 'cp1252' when running the same script from a Windows command
> prompt.
>
> So there is a contradiction between the error message and the default
> encoding.

If you're printing to stdout, it's going to use the encoding
associated with stdout, which does not necessarily have anything to do
with the default locale.  Use this to determine what character set you
need to be working in if you want your data to be printable:

>>> import sys
>>> sys.stdout.encoding
'cp437'

> Why am I restricted from using just that one codec? Is this a Windows or
> Python restriction? Please enlighten me.

In Linux, your terminal encoding is probably either UTF-8 or Latin-1,
and either way it has no problems encoding that data for output.  In a
Windows cmd terminal, the default terminal encoding is cp437, which
can't support two of the three characters you mentioned above.

[toc] | [next] | [standalone]

#33717

From	danielk <danielkleinad@gmail.com>
Date	2012-11-21 03:24 -0800
Message-ID	<6cce1b50-aa37-4077-89d5-34e57de2193e@googlegroups.com>
In reply to	#33682

On Tuesday, November 20, 2012 6:03:47 PM UTC-5, Ian wrote:
> On Tue, Nov 20, 2012 at 2:49 PM, Daniel Klein <danielkleinad@gmail.com> wrote:
> 
> > With the assistance of this group I am understanding unicode encoding issues
> 
> > much better; especially when handling special characters that are outside of
> 
> > the ASCII range. I've got my application working perfectly now :-)
> 
> >
> 
> > However, I am still confused as to why I can only use one specific encoding.
> 
> >
> 
> > I've done some research and it appears that I should be able to use any of
> 
> > the following codecs with codepoints '\xfc' (chr(252)) '\xfd' (chr(253)) and
> 
> > '\xfe' (chr(254)) :
> 
> 
> 
> These refer to the characters with *Unicode* codepoints 252, 253, and 254:
> 
> 
> 
> >>> unicodedata.name('\xfc')
> 
> 'LATIN SMALL LETTER U WITH DIAERESIS'
> 
> >>> unicodedata.name('\xfd')
> 
> 'LATIN SMALL LETTER Y WITH ACUTE'
> 
> >>> unicodedata.name('\xfe')
> 
> 'LATIN SMALL LETTER THORN'
> 
> 
> 
> > ISO-8859-1   [ note that I'm using this codec on my Linux box ]
> 
> 
> 
> For ISO 8859-1, these characters happen to exist and even correspond
> 
> to the same ordinals: 252, 253, and 254 (this is by design); so there
> 
> is no problem encoding them, and the resulting bytes even happen to
> 
> match the codepoints of the characters.
> 
> 
> 
> > cp1252
> 
> 
> 
> cp1252 is designed after ISO 8859-1 and also has those same three characters:
> 
> 
> 
> >>> for char in b'\xfc\xfd\xfe'.decode('cp1252'):
> 
> ...     print(unicodedata.name(char))
> 
> ...
> 
> LATIN SMALL LETTER U WITH DIAERESIS
> 
> LATIN SMALL LETTER Y WITH ACUTE
> 
> LATIN SMALL LETTER THORN
> 
> 
> 
> > latin1
> 
> 
> 
> Latin-1 is just another name for ISO 8859-1.
> 
> 
> 
> > utf-8
> 
> 
> 
> UTF-8 is a *multi-byte* encoding.  It can encode any Unicode
> 
> characters, so you can represent those three characters in UTF-8, but
> 
> with a different (and longer) byte sequence:
> 
> 
> 
> >>> print('\xfc\xfd\xfd'.encode('utf8'))
> 
> b'\xc3\xbc\xc3\xbd\xc3\xbd'
> 
> 
> 
> > cp437
> 
> 
> 
> cp437 is another 8-bit encoding, but it maps entirely different
> 
> characters to those three bytes:
> 
> 
> 
> >>> for char in b'\xfc\xfd\xfe'.decode('cp437'):
> 
> ...     print(unicodedata.name(char))
> 
> ...
> 
> SUPERSCRIPT LATIN SMALL LETTER N
> 
> SUPERSCRIPT TWO
> 
> BLACK SQUARE
> 
> 
> 
> As it happens, the character at codepoint 252 (that's LATIN SMALL
> 
> LETTER U WITH DIAERESIS) does exist in cp437.  It maps to the byte
> 
> 0x81:
> 
> 
> 
> >>> '\xfc'.encode('cp437')
> 
> b'\x81'
> 
> 
> 
> The other two Unicode characters, at codepoints 253 and 254, do not
> 
> exist at all in cp437 and cannot be encoded.
> 
> 
> 
> > If I'm not mistaken, all of these codecs can handle the complete 8bit
> 
> > character set.
> 
> 
> 
> There is no "complete 8bit character set".  cp1252, Latin1, and cp437
> 
> are all 8-bit character sets, but they're *different* 8-bit character
> 
> sets with only partial overlap.
> 
> 
> 
> > However, on Windows 7, I am only able to use 'cp437' to display (print) data
> 
> > with those characters in Python. If I use any other encoding, Windows laughs
> 
> > at me with this error message:
> 
> >
> 
> >   File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
> 
> >     return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> 
> > UnicodeEncodeError: 'charmap' codec can't encode character '\xfd' in
> 
> > position 3: character maps to <undefined>
> 
> 
> 
> It would be helpful to see the code you're running that causes this error.

I'm using subprocess.Popen to run a process that sends a list of codepoints to the calling Python program. The list is sent to stdout as a string.  Here is a simple example that encodes the string "Dead^Parrot", where (for this example) I'm using '^' to represent chr(254) :

encoded_string = '[68,101,97,100,254,80,97,114,114,111,116]'

This in turn is handled in __repr__ with:

return bytes((eval(encoded_string))).decode('cp437')

I get the aforementioned 'error' if I use any other encoding.

> 
> 
> 
> > Furthermore I get this from IDLE:
> 
> >
> 
> >>>> import locale
> 
> >>>> locale.getdefaultlocale()
> 
> > ('en_US', 'cp1252')
> 
> >
> 
> > I also get 'cp1252' when running the same script from a Windows command
> 
> > prompt.
> 
> >
> 
> > So there is a contradiction between the error message and the default
> 
> > encoding.
> 
> 
> 
> If you're printing to stdout, it's going to use the encoding
> 
> associated with stdout, which does not necessarily have anything to do
> 
> with the default locale.  Use this to determine what character set you
> 
> need to be working in if you want your data to be printable:
> 
> 
> 
> >>> import sys
> 
> >>> sys.stdout.encoding
> 
> 'cp437'
> 

Hmmm. So THAT'S why I am only able to use 'cp437'. I had (mistakenly) thought that I could just indicate whatever encoding I wanted, as long as the codec supported it.

> 
> 
> > Why am I restricted from using just that one codec? Is this a Windows or
> 
> > Python restriction? Please enlighten me.
> 
> 
> 
> In Linux, your terminal encoding is probably either UTF-8 or Latin-1,
> 
> and either way it has no problems encoding that data for output.  In a
> 
> Windows cmd terminal, the default terminal encoding is cp437, which
> 
> can't support two of the three characters you mentioned above.

It may not be able to encode those two characters but it is able to decode them.    That seems rather inconsistent (and contradictory) to me.

[toc] | [prev] | [next] | [standalone]

#33725

From	Nobody <nobody@nowhere.com>
Date	2012-11-21 12:18 +0000
Message-ID	<pan.2012.11.21.12.18.23.439000@nowhere.com>
In reply to	#33717

On Wed, 21 Nov 2012 03:24:01 -0800, danielk wrote:

>> >>> import sys
>> >>> sys.stdout.encoding
>> 'cp437'
>
> Hmmm. So THAT'S why I am only able to use 'cp437'. I had (mistakenly)
> thought that I could just indicate whatever encoding I wanted, as long as
> the codec supported it.

sys.stdout.encoding determines how Python converts unicode characters
written to sys.stdout to bytes.

If you want the correct characters to be shown, this has to match the
encoding which the console window uses to convert those bytes back to
unicode characters.

You can tell Python to use whichever encoding you want, but often you only
get to control one side of the equation, in which case there's only one
"right" answer.

[toc] | [prev] | [next] | [standalone]

#33731

From	Dave Angel <d@davea.name>
Date	2012-11-21 08:02 -0500
Message-ID	<mailman.149.1353502982.29569.python-list@python.org>
In reply to	#33717

On 11/21/2012 06:24 AM, danielk wrote:
> On Tuesday, November 20, 2012 6:03:47 PM UTC-5, Ian wrote:
>>> <snip>
>>
>> In Linux, your terminal encoding is probably either UTF-8 or Latin-1,
>>
>> and either way it has no problems encoding that data for output.  In a
>>
>> Windows cmd terminal, the default terminal encoding is cp437, which
>>
>> can't support two of the three characters you mentioned above.
> It may not be able to encode those two characters but it is able to decode them.    That seems rather inconsistent (and contradictory) to me.

You encode characters (code points), but you never decode them.  You
decode bytes.  In some cases and in some encodings, the number(ord) of
the two happens to be the same, eg. for ASCII characters.  Or to pick
latin1, where the first 256 map exactly.

But to pick utf8 for example, which I use almost exclusively on Linux,
the character chr(255) is a lowercase y with a diaeresis accent.

>>> chr(255)
'ÿ'
>>> unicodedata.name(chr(255))
'LATIN SMALL LETTER Y WITH DIAERESIS'

>>> chr(255).encode()
b'\xc3\xbf'
>>> len(chr(255).encode())
2

It takes 2 bytes to encode that character.  (Since there are 1112064
possible characters, most of them take more than one byte to encode in
utf-8.  I believe the size can range up to 4 bytes.)  But naturally, the
first byte of those 2 cannot be one that's valid by itself as an encoded
character, or it'd be impossible to pick apart (decode) a byte string
starting with that one.

So, there is no character which can be encoded to a single byte 0xc3. 
In other words:

>>> bytes([253])
b'\xfd'
>>> bytes([253]).decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfd in position 0:
invalid start byte

http://encyclopedia.thefreedictionary.com/UTF-8

has a description of the encoding rules.  Note they're really just
arithmetic, rather than arbitrary.  Ranges of characters encode to
various numbers of bytes.  The main rules are that characters below 0x80
are unchanged, and no valid character encoding is a prefix to any other
valid character encoding.

Contrast that with cp437, where the particular 256 valid characters were
chosen based only on their usefulness, and many of them are above 255. 
Consequently, there must be many characters below 255 which cannot be
encoded.

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#33718

From	danielk <danielkleinad@gmail.com>
Date	2012-11-21 03:24 -0800
Message-ID	<mailman.141.1353497045.29569.python-list@python.org>
In reply to	#33682

On Tuesday, November 20, 2012 6:03:47 PM UTC-5, Ian wrote:
> On Tue, Nov 20, 2012 at 2:49 PM, Daniel Klein <danielkleinad@gmail.com> wrote:
> 
> > With the assistance of this group I am understanding unicode encoding issues
> 
> > much better; especially when handling special characters that are outside of
> 
> > the ASCII range. I've got my application working perfectly now :-)
> 
> >
> 
> > However, I am still confused as to why I can only use one specific encoding.
> 
> >
> 
> > I've done some research and it appears that I should be able to use any of
> 
> > the following codecs with codepoints '\xfc' (chr(252)) '\xfd' (chr(253)) and
> 
> > '\xfe' (chr(254)) :
> 
> 
> 
> These refer to the characters with *Unicode* codepoints 252, 253, and 254:
> 
> 
> 
> >>> unicodedata.name('\xfc')
> 
> 'LATIN SMALL LETTER U WITH DIAERESIS'
> 
> >>> unicodedata.name('\xfd')
> 
> 'LATIN SMALL LETTER Y WITH ACUTE'
> 
> >>> unicodedata.name('\xfe')
> 
> 'LATIN SMALL LETTER THORN'
> 
> 
> 
> > ISO-8859-1   [ note that I'm using this codec on my Linux box ]
> 
> 
> 
> For ISO 8859-1, these characters happen to exist and even correspond
> 
> to the same ordinals: 252, 253, and 254 (this is by design); so there
> 
> is no problem encoding them, and the resulting bytes even happen to
> 
> match the codepoints of the characters.
> 
> 
> 
> > cp1252
> 
> 
> 
> cp1252 is designed after ISO 8859-1 and also has those same three characters:
> 
> 
> 
> >>> for char in b'\xfc\xfd\xfe'.decode('cp1252'):
> 
> ...     print(unicodedata.name(char))
> 
> ...
> 
> LATIN SMALL LETTER U WITH DIAERESIS
> 
> LATIN SMALL LETTER Y WITH ACUTE
> 
> LATIN SMALL LETTER THORN
> 
> 
> 
> > latin1
> 
> 
> 
> Latin-1 is just another name for ISO 8859-1.
> 
> 
> 
> > utf-8
> 
> 
> 
> UTF-8 is a *multi-byte* encoding.  It can encode any Unicode
> 
> characters, so you can represent those three characters in UTF-8, but
> 
> with a different (and longer) byte sequence:
> 
> 
> 
> >>> print('\xfc\xfd\xfd'.encode('utf8'))
> 
> b'\xc3\xbc\xc3\xbd\xc3\xbd'
> 
> 
> 
> > cp437
> 
> 
> 
> cp437 is another 8-bit encoding, but it maps entirely different
> 
> characters to those three bytes:
> 
> 
> 
> >>> for char in b'\xfc\xfd\xfe'.decode('cp437'):
> 
> ...     print(unicodedata.name(char))
> 
> ...
> 
> SUPERSCRIPT LATIN SMALL LETTER N
> 
> SUPERSCRIPT TWO
> 
> BLACK SQUARE
> 
> 
> 
> As it happens, the character at codepoint 252 (that's LATIN SMALL
> 
> LETTER U WITH DIAERESIS) does exist in cp437.  It maps to the byte
> 
> 0x81:
> 
> 
> 
> >>> '\xfc'.encode('cp437')
> 
> b'\x81'
> 
> 
> 
> The other two Unicode characters, at codepoints 253 and 254, do not
> 
> exist at all in cp437 and cannot be encoded.
> 
> 
> 
> > If I'm not mistaken, all of these codecs can handle the complete 8bit
> 
> > character set.
> 
> 
> 
> There is no "complete 8bit character set".  cp1252, Latin1, and cp437
> 
> are all 8-bit character sets, but they're *different* 8-bit character
> 
> sets with only partial overlap.
> 
> 
> 
> > However, on Windows 7, I am only able to use 'cp437' to display (print) data
> 
> > with those characters in Python. If I use any other encoding, Windows laughs
> 
> > at me with this error message:
> 
> >
> 
> >   File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
> 
> >     return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> 
> > UnicodeEncodeError: 'charmap' codec can't encode character '\xfd' in
> 
> > position 3: character maps to <undefined>
> 
> 
> 
> It would be helpful to see the code you're running that causes this error.

I'm using subprocess.Popen to run a process that sends a list of codepoints to the calling Python program. The list is sent to stdout as a string.  Here is a simple example that encodes the string "Dead^Parrot", where (for this example) I'm using '^' to represent chr(254) :

encoded_string = '[68,101,97,100,254,80,97,114,114,111,116]'

This in turn is handled in __repr__ with:

return bytes((eval(encoded_string))).decode('cp437')

I get the aforementioned 'error' if I use any other encoding.

> 
> 
> 
> > Furthermore I get this from IDLE:
> 
> >
> 
> >>>> import locale
> 
> >>>> locale.getdefaultlocale()
> 
> > ('en_US', 'cp1252')
> 
> >
> 
> > I also get 'cp1252' when running the same script from a Windows command
> 
> > prompt.
> 
> >
> 
> > So there is a contradiction between the error message and the default
> 
> > encoding.
> 
> 
> 
> If you're printing to stdout, it's going to use the encoding
> 
> associated with stdout, which does not necessarily have anything to do
> 
> with the default locale.  Use this to determine what character set you
> 
> need to be working in if you want your data to be printable:
> 
> 
> 
> >>> import sys
> 
> >>> sys.stdout.encoding
> 
> 'cp437'
> 

Hmmm. So THAT'S why I am only able to use 'cp437'. I had (mistakenly) thought that I could just indicate whatever encoding I wanted, as long as the codec supported it.

> 
> 
> > Why am I restricted from using just that one codec? Is this a Windows or
> 
> > Python restriction? Please enlighten me.
> 
> 
> 
> In Linux, your terminal encoding is probably either UTF-8 or Latin-1,
> 
> and either way it has no problems encoding that data for output.  In a
> 
> Windows cmd terminal, the default terminal encoding is cp437, which
> 
> can't support two of the three characters you mentioned above.

It may not be able to encode those two characters but it is able to decode them.    That seems rather inconsistent (and contradictory) to me.

[toc] | [prev] | [standalone]

csiph-web

Re: Encoding conundrum

Contents

#33682 — Re: Encoding conundrum

#33717

#33725

#33731

#33718