Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #32908 > unrolled thread

Right solution to unicode error?

Started byAnders <aschneiderman@asha.org>
First post2012-11-07 14:17 -0800
Last post2012-11-08 21:30 -0600
Articles 20 on this page of 23 — 9 participants

Back to article view | Back to comp.lang.python


Contents

  Right solution to unicode error? Anders <aschneiderman@asha.org> - 2012-11-07 14:17 -0800
    RE: Right solution to unicode error? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-11-07 23:07 +0000
    Re: Right solution to unicode error? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-11-07 23:27 +0000
    Re: Right solution to unicode error? Andrew Berg <bahamutzero8825@gmail.com> - 2012-11-07 17:51 -0600
    Re: Right solution to unicode error? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-11-07 23:53 +0000
      Re: Right solution to unicode error? Hans Mulder <hansmu@xs4all.nl> - 2012-11-08 12:40 +0100
    Re: Right solution to unicode error? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-11-08 00:44 +0000
    Re: Right solution to unicode error? wxjmfauth@gmail.com - 2012-11-08 03:01 -0800
    RE: Right solution to unicode error? Anders Schneiderman <ASchneiderman@asha.org> - 2012-11-08 09:00 -0500
    Re: Right solution to unicode error? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-11-08 14:06 +0000
      Re: Right solution to unicode error? wxjmfauth@gmail.com - 2012-11-08 07:05 -0800
        Re: Right solution to unicode error? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-11-08 18:32 +0000
          Re: Right solution to unicode error? wxjmfauth@gmail.com - 2012-11-08 11:30 -0800
          Re: Right solution to unicode error? wxjmfauth@gmail.com - 2012-11-08 11:30 -0800
        Re: Right solution to unicode error? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-08 11:48 -0700
          Re: Right solution to unicode error? wxjmfauth@gmail.com - 2012-11-08 11:54 -0800
            Re: Right solution to unicode error? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-08 13:41 -0700
              Re: Right solution to unicode error? wxjmfauth@gmail.com - 2012-11-09 02:06 -0800
            RE: Right solution to unicode error? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-11-08 20:54 +0000
            Re: Right solution to unicode error? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-08 14:07 -0700
            Re: Right solution to unicode error? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-11-08 21:37 +0000
          Re: Right solution to unicode error? wxjmfauth@gmail.com - 2012-11-08 11:54 -0800
    Re: Right solution to unicode error? Andrew Berg <bahamutzero8825@gmail.com> - 2012-11-08 21:30 -0600

Page 1 of 2  [1] 2  Next page →


#32908 — Right solution to unicode error?

FromAnders <aschneiderman@asha.org>
Date2012-11-07 14:17 -0800
SubjectRight solution to unicode error?
Message-ID<09a3d20b-5871-47f4-9218-df119698e405@m4g2000yqf.googlegroups.com>
I've run into a Unicode error, and despite doing some googling, I
can't figure out the right way to fix it. I have a Python 2.6 script
that reads my Outlook 2010 task list. I'm able to read the tasks from
Outlook and store them as a list of objects without a hitch.  But when
I try to print the tasks' subjects, one of the tasks is generating an
error:

Traceback (most recent call last):
  File "outlook_tasks.py", line 66, in <module>
    my_tasks.dump_today_tasks()
  File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
dump_today_tasks
    print task.subject
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
position 42: ordinal not in range(128)

(where task.subject  was previously assigned the value of
task.Subject, aka the Subject property of an Outlook 2010 TaskItem)

From what I understand from reading online, the error is telling me
that the subject line  contains an en dash and that Python is trying
to convert to ascii and failing (as it should).

Here's where I'm getting stuck.  In the code above I was just printing
the subject so I can see whether the script is working properly.
Ultimately what I want to do is parse the tasks I'm interested in and
then create an HTML file containing those tasks.  Given that, what's
the best way to fix this problem?

BTW, if there's a clear description of the best solution for this
particular problem – i.e., where I want to ultimately display the
results as HTML – please feel free to refer me to the link. I tried
reading a number of docs on the web but still feel pretty lost.

Thanks,
Anders

[toc] | [next] | [standalone]


#32912

From"Prasad, Ramit" <ramit.prasad@jpmorgan.com>
Date2012-11-07 23:07 +0000
Message-ID<mailman.3400.1352329734.27098.python-list@python.org>
In reply to#32908
Anders wrote:
> 
> I've run into a Unicode error, and despite doing some googling, I
> can't figure out the right way to fix it. I have a Python 2.6 script
> that reads my Outlook 2010 task list. I'm able to read the tasks from
> Outlook and store them as a list of objects without a hitch.  But when
> I try to print the tasks' subjects, one of the tasks is generating an
> error:
> 
> Traceback (most recent call last):
>   File "outlook_tasks.py", line 66, in <module>
>     my_tasks.dump_today_tasks()
>   File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> dump_today_tasks
>     print task.subject
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> position 42: ordinal not in range(128)
> 
> (where task.subject  was previously assigned the value of
> task.Subject, aka the Subject property of an Outlook 2010 TaskItem)
> 
> From what I understand from reading online, the error is telling me
> that the subject line  contains an en dash and that Python is trying
> to convert to ascii and failing (as it should).
> 
> Here's where I'm getting stuck.  In the code above I was just printing
> the subject so I can see whether the script is working properly.
> Ultimately what I want to do is parse the tasks I'm interested in and
> then create an HTML file containing those tasks.  Given that, what's
> the best way to fix this problem?
> 
> BTW, if there's a clear description of the best solution for this
> particular problem - i.e., where I want to ultimately display the
> results as HTML - please feel free to refer me to the link. I tried
> reading a number of docs on the web but still feel pretty lost.
> 

You can always encode in a non-ASCII codec. 
`print task.subject.encode(<encoding>)` where <encoding> is something that
supports the characters you want e.g. latin1. 

The list of built in codecs can be found:
http://docs.python.org/library/codecs.html#standard-encodings


~Ramit



This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  

[toc] | [prev] | [next] | [standalone]


#32917

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2012-11-07 23:27 +0000
Message-ID<mailman.3406.1352330840.27098.python-list@python.org>
In reply to#32908
On 7 November 2012 22:17, Anders <aschneiderman@asha.org> wrote:
>
> Traceback (most recent call last):
>   File "outlook_tasks.py", line 66, in <module>
>     my_tasks.dump_today_tasks()
>   File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> dump_today_tasks
>     print task.subject
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> position 42: ordinal not in range(128)
>
> Here's where I'm getting stuck.  In the code above I was just printing
> the subject so I can see whether the script is working properly.
> Ultimately what I want to do is parse the tasks I'm interested in and
> then create an HTML file containing those tasks.  Given that, what's
> the best way to fix this problem?

Are you using cmd.exe (standard Windows terminal)? If so, it does not
support unicode and Python is telling you that it cannot encode the
string in a way that can be understood by your terminal. You can try
using chcp to set the code page to something that works with your
script.

If you are only printing it for debugging purposes you can just print
the repr() of the string which will be ascii and will come out fine in
your terminal. If you want to write it to a html file you should
encode the string with whatever encoding (probably utf-8) you use in
the html file. If you really just want your script to be able to print
unicode characters then you need to use something other than cmd.exe
(such as IDLE).


Oscar

[toc] | [prev] | [next] | [standalone]


#32920

FromAndrew Berg <bahamutzero8825@gmail.com>
Date2012-11-07 17:51 -0600
Message-ID<mailman.3408.1352332281.27098.python-list@python.org>
In reply to#32908
On 2012.11.07 17:27, Oscar Benjamin wrote:
> Are you using cmd.exe (standard Windows terminal)? If so, it does not
> support unicode
Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
the OP since Python versions below 3.3 don't support cp65001, but I
think it's important to point out that the Windows command line system
(it is not unique to cmd) does in fact support Unicode.
-- 
CPython 3.3.0 | Windows NT 6.1.7601.17835

[toc] | [prev] | [next] | [standalone]


#32921

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-11-07 23:53 +0000
Message-ID<509af48d$0$29980$c3e8da3$5496439d@news.astraweb.com>
In reply to#32908
On Wed, 07 Nov 2012 14:17:42 -0800, Anders wrote:

> I've run into a Unicode error, and despite doing some googling, I can't
> figure out the right way to fix it. I have a Python 2.6 script that
> reads my Outlook 2010 task list. I'm able to read the tasks from Outlook
> and store them as a list of objects without a hitch.  But when I try to
> print the tasks' subjects, one of the tasks is generating an error:
> 
> Traceback (most recent call last):
>   File "outlook_tasks.py", line 66, in <module>
>     my_tasks.dump_today_tasks()
>   File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> dump_today_tasks
>     print task.subject
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> position 42: ordinal not in range(128)


This error confuses me. Is that an exact copy and paste of the error, or 
have you edited it or reconstructed it? Because it seems to me that if 
task.subject is a unicode string, as it appears to be, calling print on 
it should succeed:

py> s = u'ABC\u2013DEF'
py> print s
ABC–DEF

What does type(task.subject) return?


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#32941

FromHans Mulder <hansmu@xs4all.nl>
Date2012-11-08 12:40 +0100
Message-ID<509b9a1b$0$6841$e4fe514c@news2.news.xs4all.nl>
In reply to#32921
On 8/11/12 00:53:49, Steven D'Aprano wrote:
> This error confuses me. Is that an exact copy and paste of the error, or 
> have you edited it or reconstructed it? Because it seems to me that if 
> task.subject is a unicode string, as it appears to be, calling print on 
> it should succeed:
> 
> py> s = u'ABC\u2013DEF'
> py> print s
> ABC–DEF

That would depend on whether python thinks sys.stdout can
handle UTF8.  For example, on my MacOS X box:

$ python2.6 -c 'print u"abc\u2013def"'
abc–def
$ python2.6 -c 'print u"abc\u2013def"' | cat
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
position 3: ordinal not in range(128)

This is because python knows that my terminal is capable
of handling UTF8, but it has no idea whether the program at
the other end of a pipe had that ability, so it'll fall
back to ASCII only if sys.stdout goes to a pipe.

Apparently the OP has a terminal that doesn't handle UTF8,
or one that Python doesn't know about.


Hope this helps,

-- HansM

[toc] | [prev] | [next] | [standalone]


#32927

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2012-11-08 00:44 +0000
Message-ID<mailman.3415.1352335468.27098.python-list@python.org>
In reply to#32908
On 7 November 2012 23:51, Andrew Berg <bahamutzero8825@gmail.com> wrote:
> On 2012.11.07 17:27, Oscar Benjamin wrote:
>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
>> support unicode
> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
> the OP since Python versions below 3.3 don't support cp65001, but I
> think it's important to point out that the Windows command line system
> (it is not unique to cmd) does in fact support Unicode.

I have tried to use code page 65001 and it didn't work for me even if
I did use a version of Python (possibly 3.3 alpha) that claimed to
support it. It turned out that there were other Windows related
problems with using the codepage so that I had to do something like

chcp 65001 && python myscript.py && chcp 2521

(It was important for all those commands to be on the same line) I'm
not on Windows right now and I can't remember all the details but I
seem to remember that even with that awkwardness and changing the font
it still didn't actually work.

If you know how to make it work, I'd be interested to know.


Oscar

[toc] | [prev] | [next] | [standalone]


#32940

Fromwxjmfauth@gmail.com
Date2012-11-08 03:01 -0800
Message-ID<b2e373bd-7a62-415d-ba18-9d834bb4821b@googlegroups.com>
In reply to#32908
Le mercredi 7 novembre 2012 23:17:42 UTC+1, Anders a écrit :
> I've run into a Unicode error, and despite doing some googling, I
> 
> can't figure out the right way to fix it. I have a Python 2.6 script
> 
> that reads my Outlook 2010 task list. I'm able to read the tasks from
> 
> Outlook and store them as a list of objects without a hitch.  But when
> 
> I try to print the tasks' subjects, one of the tasks is generating an
> 
> error:
> 
> 
> 
> Traceback (most recent call last):
> 
>   File "outlook_tasks.py", line 66, in <module>
> 
>     my_tasks.dump_today_tasks()
> 
>   File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> 
> dump_today_tasks
> 
>     print task.subject
> 
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> 
> position 42: ordinal not in range(128)
> 
> 
> 
> (where task.subject  was previously assigned the value of
> 
> task.Subject, aka the Subject property of an Outlook 2010 TaskItem)
> 
> 
> 
> From what I understand from reading online, the error is telling me
> 
> that the subject line  contains an en dash and that Python is trying
> 
> to convert to ascii and failing (as it should).
> 
> 
> 
> Here's where I'm getting stuck.  In the code above I was just printing
> 
> the subject so I can see whether the script is working properly.
> 
> Ultimately what I want to do is parse the tasks I'm interested in and
> 
> then create an HTML file containing those tasks.  Given that, what's
> 
> the best way to fix this problem?
> 
> 
> 
> BTW, if there's a clear description of the best solution for this
> 
> particular problem – i.e., where I want to ultimately display the
> 
> results as HTML – please feel free to refer me to the link. I tried
> 
> reading a number of docs on the web but still feel pretty lost.
> 
> 
> 
> Thanks,
> 
> Anders

----------


The problem is not on the Python side or specific
to Python. It is on the side of the "coding of
characters".

1) Unicode is an abstract entity, it has to be encoded
for the system/device that will host it.
Using Python:
<unicode>.encode(host_coding)

2) The host_coding scheme may not contain the
character (glyph/grapheme) corresponding to the
"unicode character". In that case, 2 possible
solutions, "ignore" it ou "replace" it with a
substitution character.
Using Python:
<unicode>.encode(host_coding, "ignore")
<unicode>.encode(host_coding, "replace")

3) Detecting the host_coding, the most difficult
task. Either you have to hard-code it or you
may expect Python find it via its sys.encoding.

4) Due to the nature of unicode, it the unique
way to do it correctly.

Expectedly failing and not failing examples.
Mainly Py3, but it doesn't matter. Note: Py3 encodes
and creates a byte string, which has to be
decoded to produce a native (unicode) string, here
with cp1252.


Py2

>>> u'éléphant\u2013abc'.encode('ascii')

Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    u'éléphant\u2013abc'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
>>> print(u'éléphant\u2013abc'.encode('cp1252'))
éléphant–abc
>>> 

Py3

>>> 'éléphant\u2013abc'.encode('ascii')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in
position 0: ordinal not in range(128)
>>> 'éléphant\u2013abc'.encode('ascii', 'ignore')
b'lphantabc'
>>> 'éléphant\u2013abc'.encode('ascii', 'replace')
b'?l?phant?abc'
>>> 'éléphant\u2013abc'.encode('ascii', 'ignore').decode('cp1252')
'lphantabc'
>>> 'éléphant\u2013abc'.encode('ascii', 'replace').decode('cp1252')
'?l?phant?abc'
>>> 
>>> 'éléphant\u2013abc'.encode('cp1252').decode('cp1252')
'éléphant–abc'

>>> sys.stdout.encoding
'cp1252'
>>> 'éléphant\u2013abc'.encode(sys.stdout.encoding).decode('cp1252')
'éléphant–abc'

etc

jmf

[toc] | [prev] | [next] | [standalone]


#32950

FromAnders Schneiderman <ASchneiderman@asha.org>
Date2012-11-08 09:00 -0500
Message-ID<mailman.3435.1352383315.27098.python-list@python.org>
In reply to#32908
Thanks, Oscar and Ramit! This is exactly what I was looking for.

Anders 


> -----Original Message-----
> From: Oscar Benjamin [mailto:oscar.j.benjamin@gmail.com]
> Sent: Wednesday, November 07, 2012 6:27 PM
> To: Anders Schneiderman
> Cc: python-list@python.org
> Subject: Re: Right solution to unicode error?
> 
> On 7 November 2012 22:17, Anders <aschneiderman@asha.org> wrote:
> >
> > Traceback (most recent call last):
> >   File "outlook_tasks.py", line 66, in <module>
> >     my_tasks.dump_today_tasks()
> >   File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> > dump_today_tasks
> >     print task.subject
> > UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> > position 42: ordinal not in range(128)
> >
> > Here's where I'm getting stuck.  In the code above I was just printing
> > the subject so I can see whether the script is working properly.
> > Ultimately what I want to do is parse the tasks I'm interested in and
> > then create an HTML file containing those tasks.  Given that, what's
> > the best way to fix this problem?
> 
> Are you using cmd.exe (standard Windows terminal)? If so, it does not
> support unicode and Python is telling you that it cannot encode the string in a
> way that can be understood by your terminal. You can try using chcp to set
> the code page to something that works with your script.
> 
> If you are only printing it for debugging purposes you can just print the repr()
> of the string which will be ascii and will come out fine in your terminal. If you
> want to write it to a html file you should encode the string with whatever
> encoding (probably utf-8) you use in the html file. If you really just want your
> script to be able to print unicode characters then you need to use something
> other than cmd.exe (such as IDLE).
> 
> 
> Oscar

[toc] | [prev] | [next] | [standalone]


#32951

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2012-11-08 14:06 +0000
Message-ID<mailman.3436.1352383603.27098.python-list@python.org>
In reply to#32908
On 8 November 2012 00:44, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
> On 7 November 2012 23:51, Andrew Berg <bahamutzero8825@gmail.com> wrote:
>> On 2012.11.07 17:27, Oscar Benjamin wrote:
>>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
>>> support unicode
>> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
>> the OP since Python versions below 3.3 don't support cp65001, but I
>> think it's important to point out that the Windows command line system
>> (it is not unique to cmd) does in fact support Unicode.
>
> I have tried to use code page 65001 and it didn't work for me even if
> I did use a version of Python (possibly 3.3 alpha) that claimed to
> support it.

I stand corrected. I've just checked and codepage 65001 does work in
cmd.exe (on this machine):

O:\>Q:\tools\Python33\python -c print('abc\u2013def')
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "Q:\tools\Python33\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in
position 3: character maps to
 <undefined>

O:\>chcp 65001
Active code page: 65001

O:\>Q:\tools\Python33\python -c print('abc\u2013def')
abc-def


O:\>Q:\tools\Python33\python -c print('\u03b1')
α

It would be a lot better though if it just worked straight away
without me needing to set the code page (like the terminal in every
other OS I use).


Oscar

[toc] | [prev] | [next] | [standalone]


#32955

Fromwxjmfauth@gmail.com
Date2012-11-08 07:05 -0800
Message-ID<65910cea-f145-409c-a579-9f0cda499546@googlegroups.com>
In reply to#32951
Le jeudi 8 novembre 2012 15:07:23 UTC+1, Oscar Benjamin a écrit :
> On 8 November 2012 00:44, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
> 
> > On 7 November 2012 23:51, Andrew Berg <bahamutzero8825@gmail.com> wrote:
> 
> >> On 2012.11.07 17:27, Oscar Benjamin wrote:
> 
> >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
> 
> >>> support unicode
> 
> >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
> 
> >> the OP since Python versions below 3.3 don't support cp65001, but I
> 
> >> think it's important to point out that the Windows command line system
> 
> >> (it is not unique to cmd) does in fact support Unicode.
> 
> >
> 
> > I have tried to use code page 65001 and it didn't work for me even if
> 
> > I did use a version of Python (possibly 3.3 alpha) that claimed to
> 
> > support it.
> 
> 
> 
> I stand corrected. I've just checked and codepage 65001 does work in
> 
> cmd.exe (on this machine):
> 
> 
> 
> O:\>Q:\tools\Python33\python -c print('abc\u2013def')
> 
> Traceback (most recent call last):
> 
>   File "<string>", line 1, in <module>
> 
>   File "Q:\tools\Python33\lib\encodings\cp850.py", line 19, in encode
> 
>     return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> 
> UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in
> 
> position 3: character maps to
> 
>  <undefined>
> 
> 
> 
> O:\>chcp 65001
> 
> Active code page: 65001
> 
> 
> 
> O:\>Q:\tools\Python33\python -c print('abc\u2013def')
> 
> abc-def
> 
> 
> 
> 
> 
> O:\>Q:\tools\Python33\python -c print('\u03b1')
> 
> α
> 
> 
> 
> It would be a lot better though if it just worked straight away
> 
> without me needing to set the code page (like the terminal in every
> 
> other OS I use).
> 
> 
> 
> 
> 
> Oscar

----------

It *WORKS* straight away. The problem is that
people do not wish to use unicode correctly
(eg. Mulder's example).
Read the point 1) and 4) in my previous post.

Unicode and in general the coding of the characters
have nothing to do with the os's or programming languages.

jmf

[toc] | [prev] | [next] | [standalone]


#32970

FromOscar Benjamin <oscar.j.benjamin@gmail.com>
Date2012-11-08 18:32 +0000
Message-ID<mailman.3457.1352399533.27098.python-list@python.org>
In reply to#32955
On 8 November 2012 15:05,  <wxjmfauth@gmail.com> wrote:
> Le jeudi 8 novembre 2012 15:07:23 UTC+1, Oscar Benjamin a écrit :
>> On 8 November 2012 00:44, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
>> > On 7 November 2012 23:51, Andrew Berg <bahamutzero8825@gmail.com> wrote:
>> >> On 2012.11.07 17:27, Oscar Benjamin wrote:
>>
>> >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
>> >>> support unicode
>>
>> >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
>> >> the OP since Python versions below 3.3 don't support cp65001, but I
>> >> think it's important to point out that the Windows command line system
>> >> (it is not unique to cmd) does in fact support Unicode.
>>
>> > I have tried to use code page 65001 and it didn't work for me even if
>> > I did use a version of Python (possibly 3.3 alpha) that claimed to
>> > support it.
>>
>> I stand corrected. I've just checked and codepage 65001 does work in
>> cmd.exe (on this machine):
>>
>> O:\>chcp 65001
>> Active code page: 65001
>>
>> O:\>Q:\tools\Python33\python -c print('abc\u2013def')
>> abc-def
>>
>> O:\>Q:\tools\Python33\python -c print('\u03b1')
>> α
>>
>> It would be a lot better though if it just worked straight away
>> without me needing to set the code page (like the terminal in every
>> other OS I use).
>
> It *WORKS* straight away. The problem is that
> people do not wish to use unicode correctly
> (eg. Mulder's example).
> Read the point 1) and 4) in my previous post.
>
> Unicode and in general the coding of the characters
> have nothing to do with the os's or programming languages.

I don't know what you mean that it works "straight away".

The default code page on my machine is cp850.

O:\>chcp
Active code page: 850

cp850 doesn't understand utf-8. It just prints garbage:

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
╬▒

Using the correct encoding doesn't help:

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode('cp850'))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in
position 0: character maps to
 <undefined>

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
coding))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in
position 0: character maps to
 <undefined>

If I want the other characters to work I need to change the code page:

O:\>chcp 65001
Active code page: 65001

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
α

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
coding))"
α


Oscar

[toc] | [prev] | [next] | [standalone]


#32974

Fromwxjmfauth@gmail.com
Date2012-11-08 11:30 -0800
Message-ID<08b2c7a7-a5df-45cb-a1b8-1aebe01d46e7@googlegroups.com>
In reply to#32970
Le jeudi 8 novembre 2012 19:32:14 UTC+1, Oscar Benjamin a écrit :
> On 8 November 2012 15:05,  <wxjmfauth@gmail.com> wrote:
> 
> > Le jeudi 8 novembre 2012 15:07:23 UTC+1, Oscar Benjamin a écrit :
> 
> >> On 8 November 2012 00:44, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
> 
> >> > On 7 November 2012 23:51, Andrew Berg <bahamutzero8825@gmail.com> wrote:
> 
> >> >> On 2012.11.07 17:27, Oscar Benjamin wrote:
> 
> >>
> 
> >> >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
> 
> >> >>> support unicode
> 
> >>
> 
> >> >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
> 
> >> >> the OP since Python versions below 3.3 don't support cp65001, but I
> 
> >> >> think it's important to point out that the Windows command line system
> 
> >> >> (it is not unique to cmd) does in fact support Unicode.
> 
> >>
> 
> >> > I have tried to use code page 65001 and it didn't work for me even if
> 
> >> > I did use a version of Python (possibly 3.3 alpha) that claimed to
> 
> >> > support it.
> 
> >>
> 
> >> I stand corrected. I've just checked and codepage 65001 does work in
> 
> >> cmd.exe (on this machine):
> 
> >>
> 
> >> O:\>chcp 65001
> 
> >> Active code page: 65001
> 
> >>
> 
> >> O:\>Q:\tools\Python33\python -c print('abc\u2013def')
> 
> >> abc-def
> 
> >>
> 
> >> O:\>Q:\tools\Python33\python -c print('\u03b1')
> 
> >> α
> 
> >>
> 
> >> It would be a lot better though if it just worked straight away
> 
> >> without me needing to set the code page (like the terminal in every
> 
> >> other OS I use).
> 
> >
> 
> > It *WORKS* straight away. The problem is that
> 
> > people do not wish to use unicode correctly
> 
> > (eg. Mulder's example).
> 
> > Read the point 1) and 4) in my previous post.
> 
> >
> 
> > Unicode and in general the coding of the characters
> 
> > have nothing to do with the os's or programming languages.
> 
> 
> 
> I don't know what you mean that it works "straight away".
> 
> 
> 
> The default code page on my machine is cp850.
> 
> 
> 
> O:\>chcp
> 
> Active code page: 850
> 
> 
> 
> cp850 doesn't understand utf-8. It just prints garbage:
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> 
> ╬▒
> 
> 
> 
> Using the correct encoding doesn't help:
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode('cp850'))"
> 
> Traceback (most recent call last):
> 
>   File "<string>", line 1, in <module>
> 
>   File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode
> 
>     return codecs.charmap_encode(input,errors,encoding_map)
> 
> UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in
> 
> position 0: character maps to
> 
>  <undefined>
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> 
> coding))"
> 
> Traceback (most recent call last):
> 
>   File "<string>", line 1, in <module>
> 
>   File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode
> 
>     return codecs.charmap_encode(input,errors,encoding_map)
> 
> UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in
> 
> position 0: character maps to
> 
>  <undefined>
> 
> 
> 
> If I want the other characters to work I need to change the code page:
> 
> 
> 
> O:\>chcp 65001
> 
> Active code page: 65001
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> 
> α
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> 
> coding))"
> 
> α
> 
> 
> 
> 
> 
> Oscar

You are confusing two things. The coding of the
characters and the set of the characters (glyphes/graphemes)
of a coding scheme.

It is always possible to encode safely an unicode, but
the target coding may not contain the character.

Take a look at the output of this "special" interactive
interpreter" where the host coding (sys.stdout.encoding)
can be change on the fly.


>>> s = 'éléphant\u2013abc需'
>>> sys.stdout.encoding
'<unicode>'
>>> s
'éléphant–abc需'
>>> 
>>> sys.stdout.encoding = 'cp1252'
>>> s.encode('cp1252')
'éléphant–abc需'
>>> sys.stdout.encoding = 'cp850'
>>> s.encode('cp850')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
  File "C:\Python32\lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013'
in position 8: character maps to <undefined>
>>> # but
>>> s.encode('cp850', 'replace')
'éléphant?abcé??'
>>> 
>>> sys.stdout.encoding = 'utf-8'
>>> s
'éléphant–abc需'
>>> s.encode('utf-8')
'éléphant–abc需'
>>> 
>>> sys.stdout.encoding = 'utf-16-le'  <<<<<<<<<
>>> s
' é l é p h a n t  a b c é S ¬ '
>>> s.encode('utf-16-le')
'éléphant–abc需'

<<<<<<<<<<< some cheating here do to the mail system, it really looks like this.

jmf

[toc] | [prev] | [next] | [standalone]


#32975

Fromwxjmfauth@gmail.com
Date2012-11-08 11:30 -0800
Message-ID<mailman.3461.1352403047.27098.python-list@python.org>
In reply to#32970
Le jeudi 8 novembre 2012 19:32:14 UTC+1, Oscar Benjamin a écrit :
> On 8 November 2012 15:05,  <wxjmfauth@gmail.com> wrote:
> 
> > Le jeudi 8 novembre 2012 15:07:23 UTC+1, Oscar Benjamin a écrit :
> 
> >> On 8 November 2012 00:44, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
> 
> >> > On 7 November 2012 23:51, Andrew Berg <bahamutzero8825@gmail.com> wrote:
> 
> >> >> On 2012.11.07 17:27, Oscar Benjamin wrote:
> 
> >>
> 
> >> >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
> 
> >> >>> support unicode
> 
> >>
> 
> >> >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
> 
> >> >> the OP since Python versions below 3.3 don't support cp65001, but I
> 
> >> >> think it's important to point out that the Windows command line system
> 
> >> >> (it is not unique to cmd) does in fact support Unicode.
> 
> >>
> 
> >> > I have tried to use code page 65001 and it didn't work for me even if
> 
> >> > I did use a version of Python (possibly 3.3 alpha) that claimed to
> 
> >> > support it.
> 
> >>
> 
> >> I stand corrected. I've just checked and codepage 65001 does work in
> 
> >> cmd.exe (on this machine):
> 
> >>
> 
> >> O:\>chcp 65001
> 
> >> Active code page: 65001
> 
> >>
> 
> >> O:\>Q:\tools\Python33\python -c print('abc\u2013def')
> 
> >> abc-def
> 
> >>
> 
> >> O:\>Q:\tools\Python33\python -c print('\u03b1')
> 
> >> α
> 
> >>
> 
> >> It would be a lot better though if it just worked straight away
> 
> >> without me needing to set the code page (like the terminal in every
> 
> >> other OS I use).
> 
> >
> 
> > It *WORKS* straight away. The problem is that
> 
> > people do not wish to use unicode correctly
> 
> > (eg. Mulder's example).
> 
> > Read the point 1) and 4) in my previous post.
> 
> >
> 
> > Unicode and in general the coding of the characters
> 
> > have nothing to do with the os's or programming languages.
> 
> 
> 
> I don't know what you mean that it works "straight away".
> 
> 
> 
> The default code page on my machine is cp850.
> 
> 
> 
> O:\>chcp
> 
> Active code page: 850
> 
> 
> 
> cp850 doesn't understand utf-8. It just prints garbage:
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> 
> ╬▒
> 
> 
> 
> Using the correct encoding doesn't help:
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode('cp850'))"
> 
> Traceback (most recent call last):
> 
>   File "<string>", line 1, in <module>
> 
>   File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode
> 
>     return codecs.charmap_encode(input,errors,encoding_map)
> 
> UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in
> 
> position 0: character maps to
> 
>  <undefined>
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> 
> coding))"
> 
> Traceback (most recent call last):
> 
>   File "<string>", line 1, in <module>
> 
>   File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode
> 
>     return codecs.charmap_encode(input,errors,encoding_map)
> 
> UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in
> 
> position 0: character maps to
> 
>  <undefined>
> 
> 
> 
> If I want the other characters to work I need to change the code page:
> 
> 
> 
> O:\>chcp 65001
> 
> Active code page: 65001
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> 
> α
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> 
> coding))"
> 
> α
> 
> 
> 
> 
> 
> Oscar

You are confusing two things. The coding of the
characters and the set of the characters (glyphes/graphemes)
of a coding scheme.

It is always possible to encode safely an unicode, but
the target coding may not contain the character.

Take a look at the output of this "special" interactive
interpreter" where the host coding (sys.stdout.encoding)
can be change on the fly.


>>> s = 'éléphant\u2013abc需'
>>> sys.stdout.encoding
'<unicode>'
>>> s
'éléphant–abc需'
>>> 
>>> sys.stdout.encoding = 'cp1252'
>>> s.encode('cp1252')
'éléphant–abc需'
>>> sys.stdout.encoding = 'cp850'
>>> s.encode('cp850')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
  File "C:\Python32\lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013'
in position 8: character maps to <undefined>
>>> # but
>>> s.encode('cp850', 'replace')
'éléphant?abcé??'
>>> 
>>> sys.stdout.encoding = 'utf-8'
>>> s
'éléphant–abc需'
>>> s.encode('utf-8')
'éléphant–abc需'
>>> 
>>> sys.stdout.encoding = 'utf-16-le'  <<<<<<<<<
>>> s
' é l é p h a n t  a b c é S ¬ '
>>> s.encode('utf-16-le')
'éléphant–abc需'

<<<<<<<<<<< some cheating here do to the mail system, it really looks like this.

jmf

[toc] | [prev] | [next] | [standalone]


#32972

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-11-08 11:48 -0700
Message-ID<mailman.3459.1352400535.27098.python-list@python.org>
In reply to#32955
On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin
<oscar.j.benjamin@gmail.com> wrote:
> If I want the other characters to work I need to change the code page:
>
> O:\>chcp 65001
> Active code page: 65001
>
> O:\>Q:\tools\Python33\python -c "import sys;
> sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> α
>
> O:\>Q:\tools\Python33\python -c "import sys;
> sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> coding))"
> α

I find that I also need to change the font.  With the default font,
printing '\u2013' gives me:

–

The only alternative font option I have in Windows XP is Lucida
Console, which at least works correctly, although it seems to be
lacking a lot of glyphs.

[toc] | [prev] | [next] | [standalone]


#32976

Fromwxjmfauth@gmail.com
Date2012-11-08 11:54 -0800
Message-ID<a0073458-3b60-4c19-909d-c3d6dda7dccc@googlegroups.com>
In reply to#32972
Le jeudi 8 novembre 2012 19:49:24 UTC+1, Ian a écrit :
> On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin
> 
> <oscar.j.benjamin@gmail.com> wrote:
> 
> > If I want the other characters to work I need to change the code page:
> 
> >
> 
> > O:\>chcp 65001
> 
> > Active code page: 65001
> 
> >
> 
> > O:\>Q:\tools\Python33\python -c "import sys;
> 
> > sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> 
> > α
> 
> >
> 
> > O:\>Q:\tools\Python33\python -c "import sys;
> 
> > sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> 
> > coding))"
> 
> > α
> 
> 
> 
> I find that I also need to change the font.  With the default font,
> 
> printing '\u2013' gives me:
> 
> 
> 
> –
> 
> 
> 
> The only alternative font option I have in Windows XP is Lucida
> 
> Console, which at least works correctly, although it seems to be
> 
> lacking a lot of glyphs.

--------

Font has nothing to do here.
You are "simply" wrongly encoding your "unicode".

>>> '\u2013'
'–'
>>> '\u2013'.encode('utf-8')
b'\xe2\x80\x93'
>>> '\u2013'.encode('utf-8').decode('cp1252')
'–'

jmf

[toc] | [prev] | [next] | [standalone]


#32980

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-11-08 13:41 -0700
Message-ID<mailman.3465.1352407330.27098.python-list@python.org>
In reply to#32976
On Thu, Nov 8, 2012 at 12:54 PM,  <wxjmfauth@gmail.com> wrote:
> Font has nothing to do here.
> You are "simply" wrongly encoding your "unicode".
>
>>>> '\u2013'
> '–'
>>>> '\u2013'.encode('utf-8')
> b'\xe2\x80\x93'
>>>> '\u2013'.encode('utf-8').decode('cp1252')
> '–'

No, it seriously is the font.  This is what I get using the default
("Raster") font:

C:\>chcp 65001
Active code page: 65001

C:\>c:\python33\python
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> '\u2013'
'–'
>>> import sys
>>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8'))
–
4

I should note here that the characters copied and pasted do not
correspond to the glyphs actually displayed in my terminal window.  In
the terminal window I actually see:

ΓÇô

If I change the font to Lucida Console and run the *exact same code*,
I get this:

C:\>chcp 65001
Active code page: 65001

C:\>c:\python33\python
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> '\u2013'
'–'

>>> import sys
>>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8'))
–
4

Why is the font important?  I have no idea.  Blame Microsoft.

[toc] | [prev] | [next] | [standalone]


#33007

Fromwxjmfauth@gmail.com
Date2012-11-09 02:06 -0800
Message-ID<65d2286f-78dc-4eb8-945c-d15fb41a8232@googlegroups.com>
In reply to#32980
Le jeudi 8 novembre 2012 21:42:58 UTC+1, Ian a écrit :
> On Thu, Nov 8, 2012 at 12:54 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > Font has nothing to do here.
> 
> > You are "simply" wrongly encoding your "unicode".
> 
> >
> 
> >>>> '\u2013'
> 
> > '–'
> 
> >>>> '\u2013'.encode('utf-8')
> 
> > b'\xe2\x80\x93'
> 
> >>>> '\u2013'.encode('utf-8').decode('cp1252')
> 
> > '–'
> 
> 
> 
> No, it seriously is the font.  This is what I get using the default
> 
> ("Raster") font:
> 
> 
> 
> C:\>chcp 65001
> 
> Active code page: 65001
> 
> 
> 
> C:\>c:\python33\python
> 
> Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
> 
> 32 bit (Intel)] on win32
> 
> Type "help", "copyright", "credits" or "license" for more information.
> 
> >>> '\u2013'
> 
> '–'
> 
> >>> import sys
> 
> >>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8'))
> 
> –
> 
> 4
> 
> 
> 
> I should note here that the characters copied and pasted do not
> 
> correspond to the glyphs actually displayed in my terminal window.  In
> 
> the terminal window I actually see:
> 
> 
> 
> ΓÇô
> 
> 
> 
> If I change the font to Lucida Console and run the *exact same code*,
> 
> I get this:
> 
> 
> 
> C:\>chcp 65001
> 
> Active code page: 65001
> 
> 
> 
> C:\>c:\python33\python
> 
> Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
> 
> 32 bit (Intel)] on win32
> 
> Type "help", "copyright", "credits" or "license" for more information.
> 
> >>> '\u2013'
> 
> '–'
> 
> 
> 
> >>> import sys
> 
> >>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8'))
> 
> –
> 
> 4
> 
> 
> 
> Why is the font important?  I have no idea.  Blame Microsoft.

---------

If you have something like this 'ΓÇô'; in
Unicode nomenclature:
>>> import unicodedata as ud
>>> for c in 'ΓÇô':
...     ud.name(c)
...     
'GREEK CAPITAL LETTER GAMMA'
'LATIN CAPITAL LETTER C WITH CEDILLA'
'LATIN SMALL LETTER O WITH CIRCUMFLEX'

it is a sign of a "cp437" somewhere.

>>> '\u2013'.encode('utf-8').decode('cp437')
'ΓÇô'

On Windows 7. I do not remember having once a "coding
of the caracters" issue on XP.

jmf

[toc] | [prev] | [next] | [standalone]


#32981

From"Prasad, Ramit" <ramit.prasad@jpmorgan.com>
Date2012-11-08 20:54 +0000
Message-ID<mailman.3466.1352408089.27098.python-list@python.org>
In reply to#32976
wxjmfauth@gmail.com wrote:
> 
> Le jeudi 8 novembre 2012 19:49:24 UTC+1, Ian a écrit :
> > On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin
> >
> > <oscar.j.benjamin@gmail.com> wrote:
> >
> > > If I want the other characters to work I need to change the code page:
> > >
> > > O:\>chcp 65001
> > > Active code page: 65001
> > >
> > > O:\>Q:\tools\Python33\python -c "import sys;
> > > sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> > > α
> > >
> > > O:\>Q:\tools\Python33\python -c "import sys;
> > > sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> > > coding))"
> > > α
> >
> > I find that I also need to change the font.  With the default font,
> >
> > printing '\u2013' gives me:
> > –
> >
> > The only alternative font option I have in Windows XP is Lucida
> > Console, which at least works correctly, although it seems to be
> > lacking a lot of glyphs.
> 
> --------
> 
> Font has nothing to do here.
> You are "simply" wrongly encoding your "unicode".
> 


Why would font not matter? Unicode is the abstract definition 
of all characters right? From that we map the abstract 
character to a code page/set, which gives real values for an
abstract character. From that code page we then visually display 
the "real value" based on the font. If that font does
not have a glyph for a specific character page (or a different
glyph) then that is a problem and not related encoding. 

Unicode->code page->font


> >>> '\u2013'
> '–'
> >>> '\u2013'.encode('utf-8')
> b'\xe2\x80\x93'
> >>> '\u2013'.encode('utf-8').decode('cp1252')
> '–'
> 

This is a mismatched translation between code pages; not
font related but is instead one abstraction "level" up. 


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  

[toc] | [prev] | [next] | [standalone]


#32982

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-11-08 14:07 -0700
Message-ID<mailman.3467.1352408866.27098.python-list@python.org>
In reply to#32976
On Thu, Nov 8, 2012 at 1:54 PM, Prasad, Ramit <ramit.prasad@jpmorgan.com> wrote:
> Why would font not matter? Unicode is the abstract definition
> of all characters right? From that we map the abstract
> character to a code page/set, which gives real values for an
> abstract character. From that code page we then visually display
> the "real value" based on the font. If that font does
> not have a glyph for a specific character page (or a different
> glyph) then that is a problem and not related encoding.

Usually though when the font is missing a glyph for a Unicode
character, you just get a missing glyph symbol, such as an empty
rectangle.  For some reason when using the default font, cmd seemingly
ignores the active code page, skips decoding the characters, and tries
to print the individual bytes as if using code page 437.

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.python


csiph-web