Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'broken': 0.03; 'encoded': 0.05; 'ascii': 0.07; 'strings.': 0.07; 'urllib2': 0.07; 'utf-8': 0.07; 'works.': 0.07; 'python': 0.09; 'encode': 0.09; 'sys.stdout': 0.09; 'language,': 0.11; 'portion': 0.13; ':-)': 0.13; 'encoding': 0.15; 'cmd,': 0.16; 'codec': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'received:84.93': 0.16; 'received:84.93.230': 0.16; 'subject:unicode': 0.16; 'wrote:': 0.17; 'string,': 0.17; 'tries': 0.17; 'unicode': 0.17; 'shell': 0.18; '>>>': 0.18; 'import': 0.21; 'fine,': 0.22; 'idea': 0.24; 'header:In-Reply-To:1': 0.25; 'header :User-Agent:1': 0.26; '(most': 0.27; 'skip:# 10': 0.27; 'question': 0.27; "doesn't": 0.28; 'correct': 0.28; 'fine': 0.28; 'run': 0.28; 'received:192.168.1.3': 0.29; 'character': 0.29; 'source': 0.29; 'basic': 0.30; 'error': 0.30; 'file': 0.32; 'received:84': 0.32; 'print': 0.32; 'idle': 0.33; 'traceback': 0.33; 'problem': 0.33; 'to:addr:python-list': 0.33; "can't": 0.34; 'thanks': 0.34; 'exist': 0.35; 'sequence': 0.35; 'skip:u 20': 0.36; 'but': 0.36; 'subject:with': 0.36; 'does': 0.37; 'why': 0.37; 'subject:: ': 0.38; 'to:addr:python.org': 0.39; 'received:192': 0.39; 'skip:" 10': 0.40; 'received:192.168': 0.40; 'your': 0.60; 'skip:u 10': 0.60; 'surprise': 0.65; 'talking': 0.66; 'header:Reply-To:1': 0.68; 'reply-to:no real name:2**0': 0.72; 'me!': 0.84; 'reply-to:addr:python.org': 0.84; 'url:programming': 0.84; 'url:php': 0.86 X-CM-Score: 0.00 X-CNFS-Analysis: v=2.0 cv=AYoz7grG c=1 sm=1 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=YsUzL_8ObRgA:10 a=ihvODaAuJD4A:10 a=OUOv7kDek9cA:10 a=8nJEP1OIZ-IA:10 a=EBOSESyhAAAA:8 a=8AHkEIZyAAAA:8 a=2jkvkECSAAAA:8 a=vQoyAk9Vj3XnCKtjlF0A:9 a=wPNLvfGTeEIA:10 a=rB0lRsuY_mAA:10 a=0nF1XD0wxitMEM03M9B4ZQ==:117 X-AUTH: mrabarnett:2500 Date: Tue, 03 Jul 2012 02:21:23 +0100 From: MRAB User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20120614 Thunderbird/13.0.1 MIME-Version: 1.0 To: python-list@python.org Subject: Re: helping with unicode References: <56e3cafd-ec4f-4ae4-ad6c-685f2d991403@googlegroups.com> In-Reply-To: <56e3cafd-ec4f-4ae4-ad6c-685f2d991403@googlegroups.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: python-list@python.org List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 55 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1341278482 news.xs4all.nl 6882 [2001:888:2000:d::a6]:40942 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:24797 On 03/07/2012 01:49, self.python wrote: > it's a simple source view program. > > the codec of the target website is utf-8 > so I read it and print the decoded > > -------------------------------------------------------------- > #-*-coding:utf8-*- > import urllib2 > > rf=urllib2.urlopen(r"http://gall.dcinside.com/list.php?id=programming") > > print rf.read().decode('utf-8') > > raw_input() > --------------------------------------------------------------- > > It works fine on python shell > > but when I make the file "wrong.py" and run it, > Error rises. > > ---------------------------------------------------------------- > Traceback (most recent call last): > File "C:wrong.py", line 8, in > print rf.read().decode('utf-8') > UnicodeEncodeError: 'cp949' codec can't encode character u'u1368' in position 5 > 5122: illegal multibyte sequence > --------------------------------------------------------------------- > > cp949 is the basic codec of sys.stdout and cmd.exe > but I have no idea why it doesn't works. > printing without decode('utf-8') works fine on IDLE but on cmd, it print broken characters(Ascii portion is still fine, problem is only about the Korean) > > the question may look silly:( > but I want to know what is the problem or how to print the not broken strings. > > thanks for reading. > The encoding of your console is 'cp949', so when you try to print the Unicode string, Python tries to encode it as 'cp949'. Unfortunately, the character (actually, when talking about Unicode the correct term is 'codepoint') u'\u1368' cannot be encoded into 'cp949' because that codepoint does not exist in that encoding, in the same way that ASCII doesn't have Korean characters. So what is that codepoint? >>> import unicodedata >>> unicodedata.name(u'\u1368') 'ETHIOPIC PARAGRAPH SEPARATOR' Apparently 'cp949', which is for the Korean language, doesn't support Ethiopic codepoints. Somehow that doesn't surprise me! :-)