Groups > comp.lang.python > #24793 > unrolled thread

helping with unicode

Started by	"self.python" <howmuchistoday@gmail.com>
First post	2012-07-02 17:49 -0700
Last post	2012-07-02 21:39 -0400
Articles	5 — 4 participants

Back to article view | Back to comp.lang.python

  helping with unicode "self.python" <howmuchistoday@gmail.com> - 2012-07-02 17:49 -0700
    Re: helping with unicode Andrew Berg <bahamutzero8825@gmail.com> - 2012-07-02 20:14 -0500
    Re: helping with unicode MRAB <python@mrabarnett.plus.com> - 2012-07-03 02:21 +0100
    Re: helping with unicode Terry Reedy <tjreedy@udel.edu> - 2012-07-02 21:39 -0400
    Re: helping with unicode Terry Reedy <tjreedy@udel.edu> - 2012-07-02 21:39 -0400

#24793 — helping with unicode

From	"self.python" <howmuchistoday@gmail.com>
Date	2012-07-02 17:49 -0700
Subject	helping with unicode
Message-ID	<56e3cafd-ec4f-4ae4-ad6c-685f2d991403@googlegroups.com>

it's a simple source view program.

the codec of the target website is utf-8
so I read it and print the decoded

--------------------------------------------------------------
#-*-coding:utf8-*-
import urllib2

rf=urllib2.urlopen(r"http://gall.dcinside.com/list.php?id=programming")

print rf.read().decode('utf-8')

raw_input()
---------------------------------------------------------------

It works fine on python shell

but when I make the file "wrong.py" and run it,
Error rises.

----------------------------------------------------------------
Traceback (most recent call last):
  File "C:wrong.py", line 8, in <module>
    print rf.read().decode('utf-8')
UnicodeEncodeError: 'cp949' codec can't encode character u'u1368' in position 5
5122: illegal multibyte sequence
---------------------------------------------------------------------

cp949 is the basic codec of sys.stdout and cmd.exe  
but I have no idea why it doesn't works.
printing without decode('utf-8') works fine on IDLE but on cmd, it print broken characters(Ascii portion is still fine, problem is only about the Korean)

the question may look silly:(
but I want to know what is the problem or how to print the not broken strings.

thanks for reading.

[toc] | [next] | [standalone]

#24795

From	Andrew Berg <bahamutzero8825@gmail.com>
Date	2012-07-02 20:14 -0500
Message-ID	<mailman.1723.1341278107.4697.python-list@python.org>
In reply to	#24793

On 7/2/2012 7:49 PM, self.python wrote:
> ----------------------------------------------------------------
> Traceback (most recent call last):
>   File "C:wrong.py", line 8, in <module>
>     print rf.read().decode('utf-8')
> UnicodeEncodeError: 'cp949' codec can't encode character u'u1368' in position 5
> 5122: illegal multibyte sequence
> ---------------------------------------------------------------------
> 
> cp949 is the basic codec of sys.stdout and cmd.exe  
> but I have no idea why it doesn't works.
> printing without decode('utf-8') works fine on IDLE but on cmd, it print broken characters(Ascii portion is still fine, problem is only about the Korean)
Your terminal can't display those characters. You could try using other
code pages with chcp (a CLI utility that is part of Windows). IDLE is a
GUI, so it does not have to work with code pages.

Python 3.3 supports cp65001 (which is the equivalent of UTF-8 for
Windows terminals), but unfortunately, previous versions do not.
-- 
CPython 3.3.0a4 | Windows NT 6.1.7601.17803

[toc] | [prev] | [next] | [standalone]

#24797

From	MRAB <python@mrabarnett.plus.com>
Date	2012-07-03 02:21 +0100
Message-ID	<mailman.1724.1341278482.4697.python-list@python.org>
In reply to	#24793

On 03/07/2012 01:49, self.python wrote:
> it's a simple source view program.
>
> the codec of the target website is utf-8
> so I read it and print the decoded
>
> --------------------------------------------------------------
> #-*-coding:utf8-*-
> import urllib2
>
> rf=urllib2.urlopen(r"http://gall.dcinside.com/list.php?id=programming")
>
> print rf.read().decode('utf-8')
>
> raw_input()
> ---------------------------------------------------------------
>
> It works fine on python shell
>
> but when I make the file "wrong.py" and run it,
> Error rises.
>
> ----------------------------------------------------------------
> Traceback (most recent call last):
>    File "C:wrong.py", line 8, in <module>
>      print rf.read().decode('utf-8')
> UnicodeEncodeError: 'cp949' codec can't encode character u'u1368' in position 5
> 5122: illegal multibyte sequence
> ---------------------------------------------------------------------
>
> cp949 is the basic codec of sys.stdout and cmd.exe
> but I have no idea why it doesn't works.
> printing without decode('utf-8') works fine on IDLE but on cmd, it print broken characters(Ascii portion is still fine, problem is only about the Korean)
>
> the question may look silly:(
> but I want to know what is the problem or how to print the not broken strings.
>
> thanks for reading.
>
The encoding of your console is 'cp949', so when you try to print the
Unicode string, Python tries to encode it as 'cp949'.

Unfortunately, the character (actually, when talking about Unicode the
correct term is 'codepoint') u'\u1368' cannot be encoded into 'cp949'
because that codepoint does not exist in that encoding, in the same way
that ASCII doesn't have Korean characters.

So what is that codepoint?

 >>> import unicodedata
 >>> unicodedata.name(u'\u1368')
'ETHIOPIC PARAGRAPH SEPARATOR'

Apparently 'cp949', which is for the Korean language, doesn't support
Ethiopic codepoints. Somehow that doesn't surprise me! :-)

[toc] | [prev] | [next] | [standalone]

#24799

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-07-02 21:39 -0400
Message-ID	<mailman.1726.1341279573.4697.python-list@python.org>
In reply to	#24793

On 7/2/2012 8:49 PM, self.python wrote:
> it's a simple source view program.
>
> the codec of the target website is utf-8
> so I read it and print the decoded

which re-encodes before printing

> --------------------------------------------------------------
> #-*-coding:utf8-*-
> import urllib2
>
> rf=urllib2.urlopen(r"http://gall.dcinside.com/list.php?id=programming")
>
> print rf.read().decode('utf-8')
>
> raw_input()
> ---------------------------------------------------------------
>
> It works fine on python shell

Do you mean the Windows Command Prompt shell?
>
> but when I make the file "wrong.py" and run it,
> Error rises.
>
> ----------------------------------------------------------------
> Traceback (most recent call last):
>    File "C:wrong.py", line 8, in <module>
>      print rf.read().decode('utf-8')
> UnicodeEncodeError: 'cp949' codec can't encode character u'u1368' in position 5
> 5122: illegal multibyte sequence
> ---------------------------------------------------------------------
>
> cp949 is the basic codec of sys.stdout and cmd.exe
> but I have no idea why it doesn't works.

cp949 is a Euro-Korean multibyte encoding whose mapping is given at
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT
u1368 is not in the mapping. There is no reason the utf-8 site would 
restrict itself to the cp949 subset.

Perhap it prints in the interpreter because 2.x uses errors = 'replace' 
rather than 'strict' (as in 3.x).

Try print rf.read().decode('utf-8').encode('cp949', errors = 'replace')
Non-cp949 chars will print as '?'.

> printing without decode('utf-8') works fine on IDLE

because IDLE encodes to utf-8, and x.decode('utf-8').encode('utf-8') == x

> but on cmd, it print broken characters

Printing utf-8 encoded bytes as if cp949 encoded bytes is pretty hilariour
>
> the question may look silly:(
  but I want to know what is the problem



  or how to print the not broken strings.
>
> thanks for reading.
>


-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#24800

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-07-02 21:39 -0400
Message-ID	<mailman.1727.1341279615.4697.python-list@python.org>
In reply to	#24793

On 7/2/2012 8:49 PM, self.python wrote:
> it's a simple source view program.
>
> the codec of the target website is utf-8
> so I read it and print the decoded

which re-encodes before printing

> --------------------------------------------------------------
> #-*-coding:utf8-*-
> import urllib2
>
> rf=urllib2.urlopen(r"http://gall.dcinside.com/list.php?id=programming")
>
> print rf.read().decode('utf-8')
>
> raw_input()
> ---------------------------------------------------------------
>
> It works fine on python shell

Do you mean the Windows Command Prompt shell?
>
> but when I make the file "wrong.py" and run it,
> Error rises.
>
> ----------------------------------------------------------------
> Traceback (most recent call last):
>    File "C:wrong.py", line 8, in <module>
>      print rf.read().decode('utf-8')
> UnicodeEncodeError: 'cp949' codec can't encode character u'u1368' in position 5
> 5122: illegal multibyte sequence
> ---------------------------------------------------------------------
>
> cp949 is the basic codec of sys.stdout and cmd.exe
> but I have no idea why it doesn't works.

cp949 is a Euro-Korean multibyte encoding whose mapping is given at
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT
u1368 is not in the mapping. There is no reason the utf-8 site would 
restrict itself to the cp949 subset.

Perhap it prints in the interpreter because 2.x uses errors = 'replace' 
rather than 'strict' (as in 3.x).

Try print rf.read().decode('utf-8').encode('cp949', errors = 'replace')
Non-cp949 chars will print as '?'.

> printing without decode('utf-8') works fine on IDLE

because IDLE encodes to utf-8, and x.decode('utf-8').encode('utf-8') == x

> but on cmd, it print broken characters

Printing utf-8 encoded bytes as if cp949 encoded bytes is pretty hilariour
>
> the question may look silly:(
  but I want to know what is the problem



  or how to print the not broken strings.
>
> thanks for reading.
>


-- 
Terry Jan Reedy

[toc] | [prev] | [standalone]

csiph-web

helping with unicode

Contents

#24793 — helping with unicode

#24795

#24797

#24799

#24800