Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #41261 > unrolled thread

Unicode

Started byThomas Heller <theller@ctypes.org>
First post2013-03-15 11:46 +0100
Last post2013-03-15 11:02 +0000
Articles 4 — 3 participants

Back to article view | Back to comp.lang.python


Contents

  Unicode Thomas Heller <theller@ctypes.org> - 2013-03-15 11:46 +0100
    Re: Unicode Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-03-15 10:58 +0000
      Re: Unicode Thomas Heller <theller@ctypes.org> - 2013-03-15 12:43 +0100
    Re: Unicode Duncan Booth <duncan.booth@invalid.invalid> - 2013-03-15 11:02 +0000

#41261 — Unicode

FromThomas Heller <theller@ctypes.org>
Date2013-03-15 11:46 +0100
SubjectUnicode
Message-ID<aqgcesFio46U1@mid.individual.net>
I thought I understand unicode (somewhat, at least), but this seems
not to be the case.

I expected the following code to print 'µm' two times to the console:

<code>
# -*- coding: cp850 -*-

a = u"µm"
b = u"\u03bcm"

print(a)
print(b)
</code>

But what I get is this:

<output>
µm
Traceback (most recent call last):
   File "x.py", line 7, in <module>
     print(b)
   File "C:\Python33-64\lib\encodings\cp850.py", line 19, in encode
     return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u03bc' in 
position 0: character maps to <undefined>
</output>

Using (german) windows, command prompt, codepage 850.

The same happens with Python 2.7.  What am I doing wrong?

Thanks,
Thomas

[toc] | [next] | [standalone]


#41262

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-03-15 10:58 +0000
Message-ID<5142feca$0$29965$c3e8da3$5496439d@news.astraweb.com>
In reply to#41261
On Fri, 15 Mar 2013 11:46:36 +0100, Thomas Heller wrote:

> I thought I understand unicode (somewhat, at least), but this seems not
> to be the case.
> 
> I expected the following code to print 'µm' two times to the console:
> 
> <code>
> # -*- coding: cp850 -*-
> 
> a = u"µm"
> b = u"\u03bcm"
> 
> print(a)
> print(b)
> </code>
> 
> But what I get is this:
> 
> <output>
> µm
> Traceback (most recent call last):
>    File "x.py", line 7, in <module>
>      print(b)
>    File "C:\Python33-64\lib\encodings\cp850.py", line 19, in encode
>      return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\u03bc' in
> position 0: character maps to <undefined> </output>
> 
> Using (german) windows, command prompt, codepage 850.
> 
> The same happens with Python 2.7.  What am I doing wrong?


That's because the two strings are not the same.

You can isolate the error by noting that the second one only raises an 
exception when you try to print it. That suggests that the problem is 
that it contains a character which is not defined in your terminal's 
codepage. So let's inspect the strings more carefully:


py> a = u"µm"
py> b = u"\u03bcm"
py> a == b
False
py> ord(a[0]), ord(b[0])
(181, 956)
py> import unicodedata
py> unicodedata.name(a[0])
'MICRO SIGN'
py> unicodedata.name(b[0])
'GREEK SMALL LETTER MU'

Does codepage 850 include Greek Small Letter Mu? The evidence suggests it 
does not.

If you can, you should set the terminal's encoding to UTF-8. That will 
avoid this sort of problem.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#41265

FromThomas Heller <theller@ctypes.org>
Date2013-03-15 12:43 +0100
Message-ID<aqgfq4FjfglU1@mid.individual.net>
In reply to#41262
Am 15.03.2013 11:58, schrieb Steven D'Aprano:
> On Fri, 15 Mar 2013 11:46:36 +0100, Thomas Heller wrote:
[Windows: Problems with unicode output to console]

> You can isolate the error by noting that the second one only raises an
> exception when you try to print it. That suggests that the problem is
> that it contains a character which is not defined in your terminal's
> codepage. So let's inspect the strings more carefully:
>
>
> py> a = u"µm"
> py> b = u"\u03bcm"
> py> a == b
> False
> py> ord(a[0]), ord(b[0])
> (181, 956)
> py> import unicodedata
> py> unicodedata.name(a[0])
> 'MICRO SIGN'
> py> unicodedata.name(b[0])
> 'GREEK SMALL LETTER MU'
>
> Does codepage 850 include Greek Small Letter Mu? The evidence suggests it
> does not.
>
> If you can, you should set the terminal's encoding to UTF-8. That will
> avoid this sort of problem.

Thanks for the clarification.

For the archives: Setting the console codepage to 65001 and the font to 
lucida console helps.

Thomas

[toc] | [prev] | [next] | [standalone]


#41263

FromDuncan Booth <duncan.booth@invalid.invalid>
Date2013-03-15 11:02 +0000
Message-ID<XnsA18470556F274duncanbooth@127.0.0.1>
In reply to#41261
Thomas Heller <theller@ctypes.org> wrote:

><output>
> æm
> Traceback (most recent call last):
>    File "x.py", line 7, in <module>
>      print(b)
>    File "C:\Python33-64\lib\encodings\cp850.py", line 19, in encode
>      return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\u03bc' in 
> position 0: character maps to <undefined>
></output>
> 
> Using (german) windows, command prompt, codepage 850.
> 
> The same happens with Python 2.7.  What am I doing wrong?
> 

They are different characters:

>>> repr(a)
"u'\\xb5m'"
>>> repr(b)
"u'\\u03bcm'"

a contains unicode MICRO SIGN, b contains GREEK SMALL LETTER MU

-- 
Duncan Booth http://kupuguy.blogspot.com

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web