Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!ecngs!feeder2.ecngs.de!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python.': 0.02; 'received:209.85.223': 0.03; 'string.': 0.04; 'linux,': 0.05; 'sys': 0.05; 'ascii': 0.07; 'correspond': 0.07; 'locale': 0.07; 'utf-8': 0.07; 'python': 0.09; 'cmd': 0.09; 'codecs': 0.09; 'encode': 0.09; 'encodes': 0.09; 'encoding.': 0.09; 'stdout': 0.09; 'to:addr:comp.lang.python': 0.09; 'cc:addr:python-list': 0.10; 'resulting': 0.13; ':-)': 0.13; 'encoding': 0.15; '252': 0.16; '8-bit': 0.16; 'characters:': 0.16; 'codec': 0.16; 'cp1252': 0.16; 'decode': 0.16; 'encoded.': 0.16; 'enlighten': 0.16; 'example)': 0.16; 'range.': 0.16; 'sequence:': 0.16; 'string': 0.17; 'wrote:': 0.17; 'byte': 0.17; 'bytes': 0.17; 'char': 0.17; 'unicode': 0.17; '>>>': 0.18; 'appears': 0.18; 'windows': 0.19; 'skip:p 30': 0.20; 'skip:" 30': 0.20; 'import': 0.21; 'sends': 0.22; "skip:' 40": 0.22; 'cc:2**0': 0.23; 'example': 0.23; 'sets': 0.23; "i've": 0.23; 'seems': 0.23; '(this': 0.24; 'linux': 0.24; 'command': 0.24; 'script': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; 'supported': 0.26; 'handling': 0.27; 'skip:b 30': 0.27; 'set.': 0.27; 'represent': 0.28; 'run': 0.28; '>>>>': 0.29; 'restricted': 0.29; 'character': 0.29; 'handled': 0.29; 'probably': 0.29; "i'm": 0.29; 'daniel': 0.30; 'error': 0.30; 'helpful': 0.30; 'code': 0.31; '(and': 0.32; 'file': 0.32; "skip:' 20": 0.32; 'skip:b 40': 0.32; 'running': 0.32; 'could': 0.32; 'handle': 0.33; 'problem': 0.33; 'another': 0.33; "can't": 0.34; 'skip:b 20': 0.34; 'received:google.com': 0.34; 'done': 0.34; 'list': 0.35; 'whatever': 0.35; 'exist': 0.35; 'nov': 0.35; 'especially': 0.35; 'pm,': 0.35; 'received:209.85': 0.35; 'there': 0.35; 'skip:u 20': 0.36; 'but': 0.36; 'characters': 0.36; 'anything': 0.36; 'should': 0.36; 'problems': 0.36; 'skip:p 20': 0.36; 'display': 0.36; 'turn': 0.36; 'does': 0.37; 'two': 0.37; 'why': 0.37; '(for': 0.37; 'rather': 0.37; 'received:209': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'skip:l 20': 0.38; 'some': 0.38; 'application': 0.40; 'where': 0.40; 'skip:" 10': 0.40; 'your': 0.60; 'skip:u 10': 0.60; 'between': 0.63; 'assistance': 0.63; 'mentioned': 0.63; 'necessarily': 0.63; 'different': 0.63; 'here': 0.65; '20,': 0.65; 'klein': 0.65; 'skip:c 50': 0.66; 'special': 0.73; 'square': 0.75; 'latin': 0.84; 'overlap.': 0.84; 'sets,': 0.84; 'wanted,': 0.84; '8bit': 0.91 Newsgroups: comp.lang.python Date: Wed, 21 Nov 2012 03:24:01 -0800 (PST) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=108.78.40.138; posting-account=S-UcDQoAAACh3mXdFBHQR00lNytDt6nm References: User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-IP: 108.78.40.138 MIME-Version: 1.0 Subject: Re: Encoding conundrum From: danielk To: comp.lang.python@googlegroups.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Python X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Message-ID: Lines: 272 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1353497045 news.xs4all.nl 6984 [2001:888:2000:d::a6]:60462 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:33718 On Tuesday, November 20, 2012 6:03:47 PM UTC-5, Ian wrote: > On Tue, Nov 20, 2012 at 2:49 PM, Daniel Klein w= rote: >=20 > > With the assistance of this group I am understanding unicode encoding i= ssues >=20 > > much better; especially when handling special characters that are outsi= de of >=20 > > the ASCII range. I've got my application working perfectly now :-) >=20 > > >=20 > > However, I am still confused as to why I can only use one specific enco= ding. >=20 > > >=20 > > I've done some research and it appears that I should be able to use any= of >=20 > > the following codecs with codepoints '\xfc' (chr(252)) '\xfd' (chr(253)= ) and >=20 > > '\xfe' (chr(254)) : >=20 >=20 >=20 > These refer to the characters with *Unicode* codepoints 252, 253, and 254= : >=20 >=20 >=20 > >>> unicodedata.name('\xfc') >=20 > 'LATIN SMALL LETTER U WITH DIAERESIS' >=20 > >>> unicodedata.name('\xfd') >=20 > 'LATIN SMALL LETTER Y WITH ACUTE' >=20 > >>> unicodedata.name('\xfe') >=20 > 'LATIN SMALL LETTER THORN' >=20 >=20 >=20 > > ISO-8859-1 [ note that I'm using this codec on my Linux box ] >=20 >=20 >=20 > For ISO 8859-1, these characters happen to exist and even correspond >=20 > to the same ordinals: 252, 253, and 254 (this is by design); so there >=20 > is no problem encoding them, and the resulting bytes even happen to >=20 > match the codepoints of the characters. >=20 >=20 >=20 > > cp1252 >=20 >=20 >=20 > cp1252 is designed after ISO 8859-1 and also has those same three charact= ers: >=20 >=20 >=20 > >>> for char in b'\xfc\xfd\xfe'.decode('cp1252'): >=20 > ... print(unicodedata.name(char)) >=20 > ... >=20 > LATIN SMALL LETTER U WITH DIAERESIS >=20 > LATIN SMALL LETTER Y WITH ACUTE >=20 > LATIN SMALL LETTER THORN >=20 >=20 >=20 > > latin1 >=20 >=20 >=20 > Latin-1 is just another name for ISO 8859-1. >=20 >=20 >=20 > > utf-8 >=20 >=20 >=20 > UTF-8 is a *multi-byte* encoding. It can encode any Unicode >=20 > characters, so you can represent those three characters in UTF-8, but >=20 > with a different (and longer) byte sequence: >=20 >=20 >=20 > >>> print('\xfc\xfd\xfd'.encode('utf8')) >=20 > b'\xc3\xbc\xc3\xbd\xc3\xbd' >=20 >=20 >=20 > > cp437 >=20 >=20 >=20 > cp437 is another 8-bit encoding, but it maps entirely different >=20 > characters to those three bytes: >=20 >=20 >=20 > >>> for char in b'\xfc\xfd\xfe'.decode('cp437'): >=20 > ... print(unicodedata.name(char)) >=20 > ... >=20 > SUPERSCRIPT LATIN SMALL LETTER N >=20 > SUPERSCRIPT TWO >=20 > BLACK SQUARE >=20 >=20 >=20 > As it happens, the character at codepoint 252 (that's LATIN SMALL >=20 > LETTER U WITH DIAERESIS) does exist in cp437. It maps to the byte >=20 > 0x81: >=20 >=20 >=20 > >>> '\xfc'.encode('cp437') >=20 > b'\x81' >=20 >=20 >=20 > The other two Unicode characters, at codepoints 253 and 254, do not >=20 > exist at all in cp437 and cannot be encoded. >=20 >=20 >=20 > > If I'm not mistaken, all of these codecs can handle the complete 8bit >=20 > > character set. >=20 >=20 >=20 > There is no "complete 8bit character set". cp1252, Latin1, and cp437 >=20 > are all 8-bit character sets, but they're *different* 8-bit character >=20 > sets with only partial overlap. >=20 >=20 >=20 > > However, on Windows 7, I am only able to use 'cp437' to display (print)= data >=20 > > with those characters in Python. If I use any other encoding, Windows l= aughs >=20 > > at me with this error message: >=20 > > >=20 > > File "C:\Python33\lib\encodings\cp437.py", line 19, in encode >=20 > > return codecs.charmap_encode(input,self.errors,encoding_map)[0] >=20 > > UnicodeEncodeError: 'charmap' codec can't encode character '\xfd' in >=20 > > position 3: character maps to >=20 >=20 >=20 > It would be helpful to see the code you're running that causes this error= . I'm using subprocess.Popen to run a process that sends a list of codepoints= to the calling Python program. The list is sent to stdout as a string. He= re is a simple example that encodes the string "Dead^Parrot", where (for th= is example) I'm using '^' to represent chr(254) : encoded_string =3D '[68,101,97,100,254,80,97,114,114,111,116]' This in turn is handled in __repr__ with: return bytes((eval(encoded_string))).decode('cp437') I get the aforementioned 'error' if I use any other encoding. >=20 >=20 >=20 > > Furthermore I get this from IDLE: >=20 > > >=20 > >>>> import locale >=20 > >>>> locale.getdefaultlocale() >=20 > > ('en_US', 'cp1252') >=20 > > >=20 > > I also get 'cp1252' when running the same script from a Windows command >=20 > > prompt. >=20 > > >=20 > > So there is a contradiction between the error message and the default >=20 > > encoding. >=20 >=20 >=20 > If you're printing to stdout, it's going to use the encoding >=20 > associated with stdout, which does not necessarily have anything to do >=20 > with the default locale. Use this to determine what character set you >=20 > need to be working in if you want your data to be printable: >=20 >=20 >=20 > >>> import sys >=20 > >>> sys.stdout.encoding >=20 > 'cp437' >=20 Hmmm. So THAT'S why I am only able to use 'cp437'. I had (mistakenly) thoug= ht that I could just indicate whatever encoding I wanted, as long as the co= dec supported it. >=20 >=20 > > Why am I restricted from using just that one codec? Is this a Windows o= r >=20 > > Python restriction? Please enlighten me. >=20 >=20 >=20 > In Linux, your terminal encoding is probably either UTF-8 or Latin-1, >=20 > and either way it has no problems encoding that data for output. In a >=20 > Windows cmd terminal, the default terminal encoding is cp437, which >=20 > can't support two of the three characters you mentioned above. It may not be able to encode those two characters but it is able to decode = them. That seems rather inconsistent (and contradictory) to me.