Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!goblin2!goblin.stu.neva.ru!newsfeed1.swip.net!uio.no!news.tele.dk!news.tele.dk!small.news.tele.dk!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'read.': 0.03; 'argument': 0.05; 'encoding': 0.05; 'binary': 0.07; 'encoded': 0.07; 'failing': 0.07; 'fixes': 0.07; 'python3': 0.07; 'sys': 0.07; 'ugly': 0.07; 'utf-8': 0.07; 'string': 0.09; 'ascii': 0.09; 'assuming': 0.09; 'encode': 0.09; 'encoding:': 0.09; 'parameter': 0.09; 'skip:$ 20': 0.09; 'skip:/ 10': 0.09; 'subject:script': 0.09; 'things,': 0.09; 'python': 0.11; 'bug': 0.12; 'assume': 0.14; 'question.': 0.14; 'apache': 0.15; '"hello': 0.16; '"r")': 0.16; '(assuming': 0.16; '......': 0.16; 'bonus,': 0.16; 'characters:': 0.16; 'codec': 0.16; 'encoding.': 0.16; 'index.html': 0.16; 'ordinal': 0.16; 'somewhere.': 0.16; 'specifying': 0.16; 'subject:Unicode': 0.16; 'underlying': 0.16; 'wow,': 0.16; 'wsgi': 0.16; 'sat,': 0.16; 'all.': 0.16; 'code.': 0.18; '(but': 0.19; 'file,': 0.19; 'starts': 0.20; 'help.': 0.21; 'seems': 0.21; 'import': 0.22; 'aug': 0.22; 'putting': 0.22; 'tests': 0.22; 'header:User-Agent:1': 0.23; 'error': 0.23; 'byte': 0.24; 'fixed.': 0.24; 'lets': 0.24; 'unicode': 0.24; "i've": 0.25; 'options': 0.25; 'script': 0.25; 'right.': 0.26; 'skip:" 30': 0.26; 'skip:" 40': 0.26; 'post': 0.26; 'least': 0.26; 'header:In- Reply-To:1': 0.27; 'testing': 0.29; 'skip:p 30': 0.29; 'character': 0.29; 'wonder': 0.29; "i'm": 0.30; 'gives': 0.31; 'code': 0.31; 'getting': 0.31; 'lines': 0.31; 'cgi': 0.31; 'fixing': 0.31; 'steven': 0.31; 'file': 0.32; 'skip:- 30': 0.32; 'text': 0.33; 'says': 0.33; 'running': 0.33; '(most': 0.33; 'skip:# 10': 0.33; 'problem': 0.35; 'subject:with': 0.35; "can't": 0.35; 'skip:- 50': 0.35; 'problem.': 0.35; 'test': 0.35; 'but': 0.35; 'add': 0.35; 'really': 0.36; 'in.': 0.36; 'interface,': 0.36; "i'll": 0.36; 'should': 0.36; 'skip:- 20': 0.37; 'server': 0.38; 'displays': 0.38; 'to:addr:python-list': 0.38; 'skip:- 10': 0.38; 'that,': 0.38; 'recent': 0.39; '12,': 0.39; 'skip:. 10': 0.39; 'to:addr:python.org': 0.39; 'enough': 0.39; 'skip:p 20': 0.39; 'skip:u 10': 0.60; 'read': 0.60; 'blank': 0.60; 'everybody': 0.60; 'hope': 0.61; 'skip:o 30': 0.61; 'new': 0.61; 'browser': 0.61; "you're": 0.61; 'save': 0.62; "you've": 0.63; 'more': 0.64; 'different': 0.65; 'skip:1 20': 0.65; 'world': 0.66; 'here': 0.66; '26,': 0.68; 'feeling': 0.68; 'url:be': 0.68; 'jul': 0.74; '.....': 0.78; '1997': 0.84; 'browser:': 0.91; 'cause,': 0.91; 'write:': 0.91; 'dirty': 0.93 Date: Sun, 17 Aug 2014 11:40:34 +0200 From: Dominique Ramaekers User-Agent: Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.0 MIME-Version: 1.0 To: python-list@python.org Subject: Re: Unicode in cgi-script with apache2 References: <53f05ed9$0$30003$c3e8da3$5496439d@news.astraweb.com> In-Reply-To: <53f05ed9$0$30003$c3e8da3$5496439d@news.astraweb.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 235 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1408268785 news.xs4all.nl 2829 [2001:888:2000:d::a6]:53214 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:76418 Wow, everybody keeps on chewing on this problem. As a bonus, I've reconfigured my server to do some testings. http://cloudserver.ramaekers-stassart.be/test.html => is the file I want to read. Going to this url displays the file... http://cloudserver.ramaekers-stassart.be/cgi-python/encoding1 => is the cgi-script of this test http://cloudserver.ramaekers-stassart.be/wsgi => is the wsgi sollution (but for now it just says 'Hello world'...) ----------------This configuration----------------------------- dominique@cloudserver:/var/www/cgi-python$ cat /etc/default/locale LANG="en_US.UTF-8" LANGUAGE="en_US:" dominique@cloudserver:/var/www/cgi-python$ cat /etc/apache2/sites-enabled/000-default.conf ServerAdmin dominique@ramaekers-stassart.be WSGIScriptAlias /wsgi /var/www/wsgi/application Order allow,deny Allow from all DocumentRoot /var/www/html ScriptAlias /cgi-python /var/www/cgi-python/ Options ExecCGI SetHandler cgi-script ErrorLog ${APACHE_LOG_DIR}/error.log CustomLog ${APACHE_LOG_DIR}/access.log combined dominique@cloudserver:/var/www/cgi-python$ cat encoding1 #!/usr/bin/env python3 print("Content-Type: text/html") print("Cache-Control: no-cache, must-revalidate") # HTTP/1.1 print("Expires: Sat, 26 Jul 1997 05:00:00 GMT") # Date in the past print("") f = open("/var/www/html/test.html", "r") for line in f: print(line,end='') dominique@cloudserver:/var/www/cgi-python$ cat ../html/test.html Testing my cgi...

Ok, Testing my cgi... Lets try some characters: é ë ü

dominique@cloudserver:/var/www/cgi-python$ file ../html/test.html ../html/test.html: HTML document, UTF-8 Unicode text ---------Start test---------------------- In brower: http://cloudserver.ramaekers-stassart.be/test.html => page displays ok (try it yourself...) In terminal: => all go's wel.... dominique@cloudserver:/var/www/cgi-python$ ./encoding1 Content-Type: text/html Cache-Control: no-cache, must-revalidate Expires: Sat, 26 Jul 1997 05:00:00 GMT Testing my cgi...

Ok, Testing my cgi... Lets try some characters: é ë ü

In the browser (firefox): http://cloudserver.ramaekers-stassart.be/cgi-python/encoding1 => gives a blank page! The error log says: root@cloudserver:~# cat /var/log/apache2/error.log | tail -n 6 [Sun Aug 17 11:09:21.102003 2014] [cgi:error] [pid 32146] [client 84.194.120.161:36707] AH01215: Traceback (most recent call last): [Sun Aug 17 11:09:21.102129 2014] [cgi:error] [pid 32146] [client 84.194.120.161:36707] AH01215: File "/var/www/cgi-python/encoding1", line 7, in [Sun Aug 17 11:09:21.102149 2014] [cgi:error] [pid 32146] [client 84.194.120.161:36707] AH01215: for line in f: [Sun Aug 17 11:09:21.102201 2014] [cgi:error] [pid 32146] [client 84.194.120.161:36707] AH01215: File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode [Sun Aug 17 11:09:21.102243 2014] [cgi:error] [pid 32146] [client 84.194.120.161:36707] AH01215: return codecs.ascii_decode(input, self.errors)[0] [Sun Aug 17 11:09:21.102318 2014] [cgi:error] [pid 32146] [client 84.194.120.161:36707] AH01215: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 162: ordinal not in range(128) --------------Conclusion----------------------------- In my current configuration, the bug is recreated!!! -------------------Test 2: new configuration----------------------------- I change the line f = open("/var/www/html/test.html", "r") into f = open("/var/www/html/test.html", "r", encoding="utf-8") and save the script as encoding2 In the terminal: => All ok In the browser: => blank page!!! Error log in apache: root@cloudserver:~# cat /var/log/apache2/error.log | tail -n 4 [Sun Aug 17 11:13:47.372353 2014] [cgi:error] [pid 32147] [client 84.194.120.161:36711] AH01215: Traceback (most recent call last): [Sun Aug 17 11:13:47.372461 2014] [cgi:error] [pid 32147] [client 84.194.120.161:36711] AH01215: File "/var/www/cgi-python/encoding2", line 8, in [Sun Aug 17 11:13:47.372483 2014] [cgi:error] [pid 32147] [client 84.194.120.161:36711] AH01215: print(line,end='') [Sun Aug 17 11:13:47.372572 2014] [cgi:error] [pid 32147] [client 84.194.120.161:36711] AH01215: UnicodeEncodeError: 'ascii' codec can't encode character '\\xe9' in position 51: ordinal not in range(128) ---------Conclusion------------------ Steven was right. It was a read error => with encoding2 script the file is read in UTF-8. Dough, I find it strange. The file is in UTF-8 and Python3 has UTF-8 as standard..... But reading the file is fixed. Now the writing is still broken.... Here are some tests hinted before: Tip from Steven => getting the encoding: dominique@cloudserver:/var/www/cgi-python$ cat readencoding #!/usr/bin/env python3 import sys print("Content-Type: text/html") print("") print(sys.getfilesystemencoding()) Gives in the terminal: utf-8 Gives in the browes: ascii Found the problem!!!!! Now, why apache starts Python in ascii???? Putting the lines in my apache config: AddDefaultCharset UTF-8 SetEnv PYTHONIOENCODING utf-8 Cleared my brower-cache... No change..... I removed these lines.... If someone wants me to try more things, just post it. I'll try to process them all. I don't want to change the code. I want Apache-Python3 to work in UTF-8 and not in ASCII. Fixing it in my code seems to me like a dirty fix... For now I'm going one with wsgi and hope I don't get the same problem (but now I think I will :( ....) Grtz Op 17-08-14 om 09:50 schreef Steven D'Aprano: .... > > I think you've got it. I've been assuming the problem was on *writing* the > line. That's because the OP was insistent that the line failing was > > [quoting Dominique] > The problem is, when python 'prints' to the apache interface, it > translates the string to ascii. > > > but if you read the traceback, you're right, the problem is *reading* the > file, not printing: > > [Sat Aug 16 23:12:42.158326 2014] [cgi:error] [pid 29327] [client > 119.63.193.196:11110] AH01215: Traceback (most recent call last): > [Sat Aug 16 23:12:42.158451 2014] [cgi:error] [pid 29327] [client > 119.63.193.196:11110] AH01215: File "/var/www/cgi-python/index.html", > line 12, in > [Sat Aug 16 23:12:42.158473 2014] [cgi:error] [pid 29327] [client > 119.63.193.196:11110] AH01215: for line in f: .... > >> I wonder if specifying the binary data parameter and / or utf-8 encoding >> when opening the file might help. > We don't really know what encoding the index.html file is encoded in. It > might be Latin-1, or cp-1252, or some other legacy encoding. But let's > assume it's UTF-8. > > So why is Dominque's script reading it in ASCII? That's the key question. I > have a sinking feeling that Apache may be running Python as a subprocess > with the C locale, maybe. I don't know enough about cgi to be more than > just guessing. > > Dominique, if you write: > > f = open("/var/www/cgi-data/index.html", "r", encoding='utf-8') > > the problem should go away (assuming index.html is valid UTF-8). If it > doesn't, there's a very strange bug somewhere. > > Please try that, and see if it fixes the problem, or if the error goes to a > different line. ..... > >> f = open( "/var/www/cgi-data/index.html", "r", encoding="utf-8" ) > That's the bunny! > > If you just want to hide the problem without fixing the underlying cause, > add an argument errors="replace", which is ugly but at least lets you move > on: > > py> b = "Hello ë ü world".encode('utf-8') > py> print(b.decode('ascii', errors='replace')) > Hello �� �� world > > >