Path: csiph.com!usenet.pasdenom.info!news.albasani.net!news.mixmin.net!hq-usenetpeers.eweka.nl!81.171.88.15.MISMATCH!eweka.nl!lightspeed.eweka.nl!82.197.223.106.MISMATCH!feeder1.cambriumusenet.nl!feed.tweaknews.nl!194.134.4.91.MISMATCH!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'charset:iso-8859-7': 0.04; 'encoding': 0.05; 'explicitly': 0.05; 'output': 0.05; 'subject:Python': 0.06; '(using': 0.07; 'advocate': 0.07; 'binary': 0.07; 'python3': 0.07; 'skip:" 60': 0.07; 'utf-8': 0.07; 'string': 0.09; 'ascii': 0.09; 'bytes.': 0.09; 'exception.': 0.09; 'skip:b 110': 0.09; 'to:addr:comp.lang.python': 0.09; 'way:': 0.09; 'wrong,': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'suggest': 0.14; 'thread': 0.14; 'apache': 0.15; '"view': 0.16; '127': 0.16; '23,': 0.16; 'ah,': 0.16; 'all...': 0.16; 'codec': 0.16; 'codecs': 0.16; 'encoding.': 0.16; 'losing': 0.16; 'mode,': 0.16; 'mode;': 0.16; 'open()': 0.16; 'patience': 0.16; 'reload': 0.16; 'route,': 0.16; 'simpson': 0.16; 'source"': 0.16; 'stdout': 0.16; 'str()': 0.16; 'str)': 0.16; 'subject:Unicode': 0.16; 'subject:issue': 0.16; 'sys.stdout': 0.16; 'throw': 0.16; 'unicode,': 0.16; 'uploading': 0.16; 'exception': 0.16; 'ignore': 0.16; 'fix': 0.17; 'wrote:': 0.18; 'all,': 0.19; 'file,': 0.19; 'producing': 0.19; 'examples': 0.20; 'later': 0.20; 'seems': 0.21; '>>>': 0.22; 'select': 0.22; 'example': 0.22; 'import': 0.22; 'cc:addr:gmail.com': 0.22; 'issue.': 0.22; 'cc:addr:python.org': 0.22; 'print': 0.22; 'header:User-Agent:1': 0.23; 'error': 0.23; '(such': 0.24; 'byte': 0.24; 'bytes': 0.24; 'case.': 0.24; 'example.': 0.24; 'passes': 0.24; 'please?': 0.24; 'switched': 0.24; 'unicode': 0.24; 'visible': 0.24; 'fine': 0.24; 'cheers,': 0.24; 'file.': 0.24; 'environment': 0.24; 'cc:no real name:2**0': 0.24; 'script': 0.25; 'least': 0.26; 'gets': 0.27; 'header:In- Reply-To:1': 0.27; 'tried': 0.27; 'point': 0.28; 'function': 0.29; 'correct': 0.29; 'character': 0.29; 'raise': 0.29; 'related': 0.29; "doesn't": 0.30; 'cc:2**2': 0.30; 'characters': 0.30; 'mode': 0.30; 'see,': 0.30; 'program,': 0.31; 'code': 0.31; 'getting': 0.31; 'page.': 0.31; 'posting': 0.31; 'that.': 0.31; 'cgi': 0.31; 'file': 0.32; 'stuff': 0.32; 'run': 0.32; 'another': 0.32; 'text': 0.33; 'open': 0.33; 'running': 0.33; '(i.e.': 0.33; '(most': 0.33; 'plain': 0.33; 'screen': 0.34; 'subject:with': 0.35; "can't": 0.35; 'display': 0.35; 'something': 0.35; 'form.': 0.35; 'test': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'add': 0.35; 'taking': 0.65; 'here': 0.66; 'website:': 0.67; 'bottom': 0.67; 'natural': 0.68; 'realized': 0.68; 'webpage': 0.68; 'default': 0.69; '(depends': 0.84; 'data;': 0.84; 'end.': 0.84; 'environment;': 0.84; 'handing': 0.84; 'scenes': 0.84; 'skip:/ 30': 0.84; 'strings)': 0.84; 'do:': 0.91; 'shell,': 0.91; 'thing,': 0.91; 'write()': 0.91; 'examine': 0.93; '2013': 0.98 X-Received: by 10.49.81.198 with SMTP id c6mr1157474qey.37.1365828645846; Fri, 12 Apr 2013 21:50:45 -0700 (PDT) Newsgroups: comp.lang.python Date: Fri, 12 Apr 2013 21:50:45 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=94.68.69.168; posting-account=hGu1uQoAAACZy7LiR653nG0NwqDrTyoS References: User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-IP: 94.68.69.168 MIME-Version: 1.0 Subject: Re: Unicode issue with Python v3.3 From: nagia.retsina@gmail.com To: comp.lang.python@googlegroups.com Content-Type: text/plain; charset=ISO-8859-7 Content-Transfer-Encoding: quoted-printable Cc: Nikos , =?ISO-8859-7?B?zd/q7/Igw+rxMzPq?= , python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Message-ID: Lines: 316 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1365828648 news.xs4all.nl 2608 [2001:888:2000:d::a6]:39728 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:43501 =D4=E7 =D3=DC=E2=E2=E1=F4=EF, 13 =C1=F0=F1=E9=EB=DF=EF=F5 2013 4:41:57 =F0.= =EC. UTC+3, =EF =F7=F1=DE=F3=F4=E7=F2 Cameron Simpson =DD=E3=F1=E1=F8=E5: > On 11Apr2013 09:55, Nikos wrote: >=20 > | =D4=E7 =D0=DD=EC=F0=F4=E7, 11 =C1=F0=F1=E9=EB=DF=EF=F5 2013 1:45:22 =EC= .=EC. UTC+3, =EF =F7=F1=DE=F3=F4=E7=F2 Cameron Simpson =DD=E3=F1=E1=F8=E5: >=20 > | > On 10Apr2013 21:50, nagia.retsina@gmail.com = wrote: >=20 > | > | the doctype is coming form the attempt of script metrites.py to ope= n and read the 'index.html' file. >=20 > | > | But i don't know how to try to open it as a byte file instead of an= tetxt file. >=20 >=20 >=20 > Lele Gaifax showed one way: >=20 >=20 >=20 > from codecs import open >=20 > with open('index.html', encoding=3D'utf-8') as f: >=20 > content =3D f.read() >=20 >=20 >=20 > But a plain open() should also do: >=20 >=20 >=20 > with open('index.html') as f: >=20 > content =3D f.read() >=20 >=20 >=20 > if you're not taking tight control of the file encoding. >=20 >=20 >=20 > The point here is to get _text_ (i.e. str) data from the file, not bytes. >=20 >=20 >=20 > If the text turns out to be incorrectly decoded (i.e. incorrectly >=20 > reading the file bytes and assembling them into text strings) because >=20 > the default encoding is wrong, then you may need to read for Lele's >=20 > more verbose open() example to select the correct encoding. >=20 >=20 >=20 > But first ignore that and get text (str) instead of bytes. >=20 > If you're already getting text from the file, something later is >=20 > making bytes and handing it to print(). >=20 >=20 >=20 > Another approach to try is to use >=20 > sys.stdout.write() >=20 > instead of >=20 > print() >=20 >=20 >=20 > The print() function will take _anything_ and write text of some form. >=20 > The write() function will throw an exception if it gets the wrong type of= data. >=20 >=20 >=20 > If sys.stdout is opened in binary mode then write() will require >=20 > bytes as data; strings will need to be explicitly turned into bytes >=20 > via .encode() in order to not raise an exception. >=20 >=20 >=20 > If sys.stdout is open in text mode, write() will require str data. >=20 > The sys.stdout file itself will transcribe to bytes for you. >=20 >=20 >=20 > If you take that route, at least you will not have confusion about >=20 > str versus bytes. >=20 >=20 >=20 > For an HTML output page I would advocate arranging that sys.stdout >=20 > is in text mode; that way you can do the natural thing and .write() >=20 > str data and lovely UTF-8 bytes will come out the other end. >=20 >=20 >=20 > If the above test (using .write() instead of print()) shows it to >=20 > be in binary mode we can fix that. But you need to find out. >=20 >=20 >=20 > You will want access to the error messages from the CGI environment; >=20 > do you have access to the web servers error_log? You can tail that >=20 > in a terminal while you reload the page to see what's going on. >=20 >=20 >=20 > | This works in the shell, but doesn't work on my website: >=20 > |=20 >=20 > | $ cat utf8.txt >=20 > | =F5=EB=E9=EA=FC!=D0=F1=FC=EA=E5=E9=F4=E1=E9 =E3 >=20 >=20 >=20 > Ok, so your terminal is using UTF-8 as its output coding. (And so >=20 > is your mail posting program, since we see it unmangled on my screen >=20 > here.) >=20 >=20 >=20 > | $ python3 >=20 > | Python 3.2.3 (default, Oct 19 2012, 20:10:41) >=20 > | [GCC 4.6.3] on linux2 >=20 > | Type "help", "copyright", "credits" or "license" for more information. >=20 > | >>> data =3D open('utf8.txt').read() >=20 > | >>> print(data) >=20 > | =F5=EB=E9=EA=FC!=D0=F1=FC=EA=E5=E9=F4=E1=E9 =E3 >=20 >=20 >=20 > Likewise. >=20 >=20 >=20 > However, in an exciting twist, I seem to recall that Python invoked >=20 > interactively with aterminal as output will have the default terminal >=20 > encoding in place on sys.stdout. Producing what you expect. _However_, >=20 > python invoked in a batch environment where stdout is not a terminal >=20 > (such as in the CGI environment producing your web page), that is >=20 > _not_ necessarily the case. >=20 >=20 >=20 > | >>> print(data.encode('utf-8')) >=20 > | b'\xcf\x85\xce\xbb\xce\xb9\xce\xba\xcf\x8c!\xce\xa0\xcf\x81\xcf\x8c\xce= \xba\xce\xb5\xce\xb9\xcf\x84\xce\xb1\xce\xb9 \xce\xb3\n' >=20 > |=20 >=20 > | See, the last line is what i'am getting on my website. >=20 >=20 >=20 > The above line takes your Unicode text in "data" and transcribed >=20 > it to bytes using UTF-8 as the encoding. And print() is then receiving >=20 > that bytes object and printing its str() representation as "b'....'". >=20 > That str is itself unicode, and when print passes it to sys.stdout, >=20 > _that_ transcribed the unicode "b'...'" string as bytes to your >=20 > terminal. Using UTF-8 based on the previous examples above, but >=20 > since all those characters are in the bottom 127 code range the >=20 > byte sequence will be the same if it uses ASCII or ISO8859-1 or >=20 > almost anything else:-) >=20 >=20 >=20 > As you can see, there's a lot of encoding/decoding going on behind >=20 > the scenes even in this superficially simple example. >=20 >=20 >=20 > | If i remove >=20 > | the encode('utf-8') part in metrites.py, the webpage will not show >=20 > | anything at all... >=20 >=20 >=20 > Ah, but data will be being output. The print() function _will_ be >=20 > writing "data" out in some form. I suggest you remove the .encode() >=20 > and then examine the _source_ text of the web page, not its visible >=20 > form. >=20 >=20 >=20 > So: remove .encode(), reload the web page, "view page source" >=20 > (depends on your browser, it is ctrl-U in Firefox ((Cmd-U in firefox >=20 > on a Mac))). >=20 >=20 >=20 > I think a lot of the issue you have in this thread is that your >=20 > page is too complex. Make another page to do the same thing, and >=20 > start with nothing. Add stuff to it a single item at a time until >=20 > the page behaves incorrectly. Then you will know the exact item of >=20 > code that introduced the issue. And then that single item can be >=20 > examined in detail for the decode/encode issues. >=20 >=20 >=20 > The other issue in the thread is that people losing patience get >=20 > snarky. Respond only to the technical content. If a message is only >=20 > snarky, _ignore_ it. People like the last word; let them have it >=20 > and you won't get sidetracked into arguments. >=20 >=20 >=20 > Cheers, >=20 > --=20 >=20 > Cameron Simpson >=20 >=20 >=20 > PCs are like a submarine, it will work fine till you open Windows. - zoll= ie101 First of all thank you very much Cameron for your detailed help and effort = to write to me: It seems another issue had happened without my knowledge, i was uploading s= tuff at /root/public_html/cgi-bin instead of /home/nikos/public_html/cgi-bi= n. I realized that when i deliberately made error to metrites.py scropt and i = got still the same page. Ookey after that is corrected, i then tried the plain solution and i got th= is response back form the shell: Traceback (most recent call last): File "metrites.py", line 213, in <module> htmldata =3D f.read() File "/root/.local/lib/python2.7/lib/python3.3/encodings/iso8859_7.py", l= ine 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0xae in position 47: = character maps to <undefined> then i switched to: with open('/home/nikos/www/' + page, encoding=3D'utf-8') as f: htmldata =3D f.read() and i got no error at all, just pure run *from the shell*! But i get internal server error when i try to run the webpage from the brow= ser(Chrome). So, can you tell me please where can i find the apache error log so to disp= lay here please? Apcher error_log is always better than running 'python3 metrites.py' becaus= e even if the python script has no error apache will also display more web = related things? Thank you Cameron.