Path: csiph.com!usenet.pasdenom.info!news.etla.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.012 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'charset:iso-8859-7': 0.04; 'encoding': 0.05; 'subject:Python': 0.06; 'binary': 0.07; 'python3': 0.07; 'skip:b 110': 0.09; 'to:addr:comp.lang.python': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'encoding.': 0.16; 'simpson': 0.16; 'subject:Unicode': 0.16; 'subject:issue': 0.16; 'weblog': 0.16; 'wrote:': 0.18; 'code.': 0.18; '>>>': 0.22; 'cc:addr:gmail.com': 0.22; 'putting': 0.22; 'cc:addr:python.org': 0.22; 'print': 0.22; 'cc:2**1': 0.23; 'header:User-Agent:1': 0.23; 'byte': 0.24; 'bytes': 0.24; 'text,': 0.24; 'file.': 0.24; 'looks': 0.24; 'cc:no real name:2**0': 0.24; 'script': 0.25; 'header:In-Reply-To:1': 0.27; 'correct': 0.29; "doesn't": 0.30; 'mode': 0.30; 'see,': 0.30; 'statement': 0.30; 'code': 0.31; 'getting': 0.31; 'file': 0.32; 'probably': 0.32; 'text': 0.33; 'open': 0.33; 'subject:with': 0.35; 'editor': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'skip:o 20': 0.38; 'thank': 0.38; 'files': 0.38; 'anything': 0.39; 'skip:p 20': 0.39; 'how': 0.40; 'ensure': 0.60; 'remove': 0.60; 'read': 0.60; "you're": 0.61; 'back': 0.62; "you've": 0.63; 'email addr:gmail.com': 0.63; 'show': 0.63; 'more': 0.64; 'taking': 0.65; 'website:': 0.67; 'reads': 0.68; 'webpage': 0.68; '(probably': 0.84; 'reading,': 0.84; 'opens': 0.91; 'shell,': 0.91; '2013': 0.98 X-Received: by 10.49.27.102 with SMTP id s6mr657173qeg.1.1365699318975; Thu, 11 Apr 2013 09:55:18 -0700 (PDT) Newsgroups: comp.lang.python Date: Thu, 11 Apr 2013 09:55:18 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=94.68.69.168; posting-account=hGu1uQoAAACZy7LiR653nG0NwqDrTyoS References: User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-IP: 94.68.69.168 MIME-Version: 1.0 Subject: Re: Unicode issue with Python v3.3 From: Nikos To: comp.lang.python@googlegroups.com Content-Type: text/plain; charset=ISO-8859-7 Content-Transfer-Encoding: quoted-printable Cc: =?ISO-8859-7?B?zd/q7/Igw+rxMzPq?= , python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Message-ID: Lines: 62 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1365699327 news.xs4all.nl 2592 [2001:888:2000:d::a6]:47007 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:43390 =D4=E7 =D0=DD=EC=F0=F4=E7, 11 =C1=F0=F1=E9=EB=DF=EF=F5 2013 1:45:22 =EC.=EC= . UTC+3, =EF =F7=F1=DE=F3=F4=E7=F2 Cameron Simpson =DD=E3=F1=E1=F8=E5: > On 10Apr2013 21:50, nagia.retsina@gmail.com wro= te: >=20 > | Firtly thank uou for taking a look into the code. >=20 > | the doctype is coming form the attempt of script metrites.py to open an= d read the 'index.html' file. >=20 > | But i don't know how to try to open it as a byte file instead of an tet= xt file. >=20 >=20 >=20 > I think you've got it backwards. It looks like metrites.py has >=20 > opened the file as bytes instead of as text (probably utf8, but >=20 > that remains to be seen). Because it has opened it in binary mode >=20 > you're getting bytes when you read from the file. >=20 >=20 >=20 > Can you show the relevant code that opens the files and reads from >=20 > it, and the print statement that is putting it back out? >=20 >=20 >=20 > You probably need to ensure that metrites.py is opening it as text, >=20 > with the correct encoding. Note that the encoding is nothing to >=20 > do with your _output_. It is the encoding of the data in the file >=20 > you are reading, and that is dictated by the editor used to make >=20 > the file. > > Webhost && Weblog This works in the shell, but doesn't work on my website: $ cat utf8.txt =F5=EB=E9=EA=FC!=D0=F1=FC=EA=E5=E9=F4=E1=E9 =E3 $ python3 Python 3.2.3 (default, Oct 19 2012, 20:10:41) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> data =3D open('utf8.txt').read() >>> print(data) =F5=EB=E9=EA=FC!=D0=F1=FC=EA=E5=E9=F4=E1=E9 =E3 >>> print(data.encode('utf-8')) b'\xcf\x85\xce\xbb\xce\xb9\xce\xba\xcf\x8c!\xce\xa0\xcf\x81\xcf\x8c\xce\xba= \xce\xb5\xce\xb9\xcf\x84\xce\xb1\xce\xb9 \xce\xb3\n' See, the last line is what i'am getting on my website. If i remove the enco= de('utf-8') part in metrites.py, the webpage will not show anything at all.= ..