Groups > comp.lang.python > #76382 > unrolled thread

Unicode in cgi-script with apache2

Started by	Dominique Ramaekers <dominique@ramaekers-stassart.be>
First post	2014-08-15 20:10 +0200
Last post	2014-08-17 01:08 -0700
Articles	20 on this page of 23 — 9 participants

Back to article view | Back to comp.lang.python

  Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-15 20:10 +0200
    Re: Unicode in cgi-script with apache2 alister <alister.nospam.ware@ntlworld.com> - 2014-08-15 19:27 +0000
      Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 00:36 +0200
        Re: Unicode in cgi-script with apache2 Denis McMahon <denismfmcmahon@gmail.com> - 2014-08-17 02:50 +0000
          Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 07:32 +0200
          Re: Unicode in cgi-script with apache2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-17 17:50 +1000
            Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 11:40 +0200
            Re: Unicode in cgi-script with apache2 wxjmfauth@gmail.com - 2014-08-17 03:05 -0700
            Re: Unicode in cgi-script with apache2 Peter Otten <__peter__@web.de> - 2014-08-17 13:04 +0200
            Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 13:34 +0200
            Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 14:02 +0200
              Re: Unicode in cgi-script with apache2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-17 23:00 +1000
                Re: Unicode in cgi-script with apache2 wxjmfauth@gmail.com - 2014-08-17 08:56 -0700
            Re: Unicode in cgi-script with apache2 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-08-17 13:35 +0100
              Re: Unicode in cgi-script with apache2 Tony the Tiger <tony@tiger.invalid> - 2014-08-18 04:39 +0000
            Re: Unicode in cgi-script with apache2 Peter Otten <__peter__@web.de> - 2014-08-17 15:12 +0200
            Re: Unicode in cgi-script with apache2 Peter Otten <__peter__@web.de> - 2014-08-17 16:06 +0200
        Re: Unicode in cgi-script with apache2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-17 15:54 +1000
    Re: Unicode in cgi-script with apache2 John Gordon <gordon@panix.com> - 2014-08-15 19:32 +0000
      Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 00:39 +0200
    Re: Unicode in cgi-script with apache2 Denis McMahon <denismfmcmahon@gmail.com> - 2014-08-16 16:40 +0000
      Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 00:57 +0200
    Re: Unicode in cgi-script with apache2 wxjmfauth@gmail.com - 2014-08-17 01:08 -0700

Page 1 of 2 [1] 2 Next page →

#76382 — Unicode in cgi-script with apache2

From	Dominique Ramaekers <dominique@ramaekers-stassart.be>
Date	2014-08-15 20:10 +0200
Subject	Unicode in cgi-script with apache2
Message-ID	<mailman.13038.1408130249.18130.python-list@python.org>

Hi,

I've got a little script:

#!/usr/bin/env python3
print("Content-Type: text/html")
print("Cache-Control: no-cache, must-revalidate")    # HTTP/1.1
print("Expires: Sat, 26 Jul 1997 05:00:00 GMT") # Date in the past
print("")
f = open("/var/www/cgi-data/index.html", "r")
for line in f:
     print(line,end='')

If I run the script in the terminal, it nicely prints the webpage 
'index.html'.

If access the script through a webbrowser, apache gives an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 
1791: ordinal not in range(128)

I've done a hole afternoon of reading on fora and blogs, I don't have a 
solution.

Can anyone help me?

Greetings,

Dominique.

[toc] | [next] | [standalone]

#76383

From	alister <alister.nospam.ware@ntlworld.com>
Date	2014-08-15 19:27 +0000
Message-ID	<satHv.195207$ze2.61877@fx28.am4>
In reply to	#76382

On Fri, 15 Aug 2014 20:10:25 +0200, Dominique Ramaekers wrote:

> Hi,
> 
> I've got a little script:
> 
> #!/usr/bin/env python3 print("Content-Type: text/html")
> print("Cache-Control: no-cache, must-revalidate")    # HTTP/1.1
> print("Expires: Sat, 26 Jul 1997 05:00:00 GMT") # Date in the past
> print("")
> f = open("/var/www/cgi-data/index.html", "r")
> for line in f:
>      print(line,end='')
> 
> If I run the script in the terminal, it nicely prints the webpage
> 'index.html'.
> 
> If access the script through a webbrowser, apache gives an error:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 1791: ordinal not in range(128)
> 
> I've done a hole afternoon of reading on fora and blogs, I don't have a
> solution.
> 
> Can anyone help me?
> 
> Greetings,
> 
> Dominique.

1) this is not the way to get python to generate a web page, if you dont 
want to use an existing framework (for example if you are doing this ans 
an educational exercise) i suggest to google SWGI

2) you need to encode your output strings  into a format apache/html 
protocols can support - UTF8 is probably best here.
change your pint function to
print(line.encode('utf'),end='') 


3) Ignore any subsequent advice from JMF even when he is trying to help 
he is invariable wrong.
 

-- 
Freedom's just another word for nothing left to lose.
		-- Kris Kristofferson, "Me and Bobby McGee"

[toc] | [prev] | [next] | [standalone]

#76409

From	Dominique Ramaekers <dominique@ramaekers-stassart.be>
Date	2014-08-17 00:36 +0200
Message-ID	<mailman.13054.1408229123.18130.python-list@python.org>
In reply to	#76383

I fond my problem, I will describe it more at the bottom of this message...

But first...

Thanks Alister for the tips:
1) This evening, I've researched WSGI. I found that WSGI is more 
advanced than CGI and I also think WSGI is more the Python way. I'm an 
amateur playing around with my imagination on a small virtual server 
(online cloudserver.ramaekers-stassart.be). I'm trying to build 
something rather specific. I also like to make things as basic as 
possible. My first thought was not to use a framework. This because with 
a framework I didn't really know what the code is doing. For a 
framework, for me, would be a black-box. But after inspecting WSGI, I 
got the idea not to make it myself more difficult than it has to be. I 
will work with a framework and I think I'll put my chances on Falcon 
(for it's speed, small size and it doesn't seem to difficult)... There 
are a lot of frameworks, so if someone wants to point me to an other 
framework, I'm open to suggestions...

2) Your tip, to use 'encode' did not solve the problem and created a new 
one. My lines were incapsulted in quotes and I got a lot of \b's and 
\n's... and I still got the same error.

3) I didn't got the message from JMF, so...

What seems to be the problem:
My Script was ok. I know this because in the terminal I got my expected 
output. Python3 uses UTF-8 coding as a standard. The problem is, when 
python 'prints' to the apache interface, it translates the string to 
ascii. (Why, I never found an answer). Somewhere in the middle of my 
index.html file, there are letters like ë and ü. If Python tries to 
translate these, Python throws an error. If I delete these letters in 
the file, the script works perfectly in a browser! In Python2.7 the 
script can easily be tweaked so the translation to ascii isn't done, but 
in Python3, its a real pain in the a... I've read about people who 
managed to force Python3 to 'print' to apache in UTF-8, but none of 
their solutions worked for me.
I think the programmers of Python doesn't want to focus on Python + 
apache + CGI (I think it only happens with apache and not with an other 
http-server). I don't think they do this intentional but I guess they 
assume that if you use Python to make a web-application, you also use 
mod_wsgi or mod_python (in apache)...
So I'll use wsgi, It's a little more work but it seems really neat...

grtz

Op 15-08-14 om 21:27 schreef alister:
> On Fri, 15 Aug 2014 20:10:25 +0200, Dominique Ramaekers wrote:
>
>> Hi,
>>
>> I've got a little script:
>>
>> #!/usr/bin/env python3 print("Content-Type: text/html")
>> print("Cache-Control: no-cache, must-revalidate")    # HTTP/1.1
>> print("Expires: Sat, 26 Jul 1997 05:00:00 GMT") # Date in the past
>> print("")
>> f = open("/var/www/cgi-data/index.html", "r")
>> for line in f:
>>       print(line,end='')
>>
>> If I run the script in the terminal, it nicely prints the webpage
>> 'index.html'.
>>
>> If access the script through a webbrowser, apache gives an error:
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
>> 1791: ordinal not in range(128)
>>
>> I've done a hole afternoon of reading on fora and blogs, I don't have a
>> solution.
>>
>> Can anyone help me?
>>
>> Greetings,
>>
>> Dominique.
> 1) this is not the way to get python to generate a web page, if you dont
> want to use an existing framework (for example if you are doing this ans
> an educational exercise) i suggest to google SWGI
>
> 2) you need to encode your output strings  into a format apache/html
> protocols can support - UTF8 is probably best here.
> change your pint function to
> print(line.encode('utf'),end='')
>
>
> 3) Ignore any subsequent advice from JMF even when he is trying to help
> he is invariable wrong.
>   
>

[toc] | [prev] | [next] | [standalone]

#76413

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2014-08-17 02:50 +0000
Message-ID	<lsp5ab$sjv$1@dont-email.me>
In reply to	#76409

On Sun, 17 Aug 2014 00:36:14 +0200, Dominique Ramaekers wrote:

> What seems to be the problem:
> My Script was ok. I know this because in the terminal I got my expected
> output. Python3 uses UTF-8 coding as a standard. The problem is, when
> python 'prints' to the apache interface, it translates the string to
> ascii. (Why, I never found an answer).

Is the apache server running on a linux or a windows platform?

The problem may not be python, it may be the underlying OS. I wonder if 
apache is spawning a process for python though, and if so whether it is 
in some way constraining the character set available to stdout of the 
spawned process.

From your other message, the error appears to be a python error on 
reading the input file. For some reason python seems to be trying to 
interpret the file it is reading as ascii.

I wonder if specifying the binary data parameter and / or utf-8 encoding 
when opening the file might help.

eg:

f = open( "/var/www/cgi-data/index.html", "rb" )
f = open( "/var/www/cgi-data/index.html", "rb", encoding="utf-8" )
f = open( "/var/www/cgi-data/index.html", "r", encoding="utf-8" )

I've managed to drive down a bit further in the problem:

print() goes to sys.stdout

This is part of what the docs say about sys.stdout:

"""
The character encoding is platform-dependent. Under Windows, if the 
stream is interactive (that is, if its isatty() method returns True), the 
console codepage is used, otherwise the ANSI code page. Under other 
platforms, the locale encoding is used (see locale.getpreferredencoding
()).

Under all platforms though, you can override this value by setting the 
PYTHONIOENCODING environment variable before starting Python.
"""

At this point, details of the OS become very significant. If your server 
is running on a windows platform you may need to figure out how to make 
apache set the PYTHONIOENCODING environment variable to "utf-8" (or 
whatever else is appropriate) before calling the python script.

I believe that the following line in your httpd.conf may have the 
required effect.

SetEnv PYTHONIOENCODING utf-8

Of course, if the file is not encoded as utf-8, but rather something 
else, then use that as the encoding in the above suggestions. If the 
server is not running windows, then I'm not sure where the problem might 
be.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]

#76414

From	Dominique Ramaekers <dominique@ramaekers-stassart.be>
Date	2014-08-17 07:32 +0200
Message-ID	<mailman.13058.1408253857.18130.python-list@python.org>
In reply to	#76413

* My system is a linux-box.

* I've tried using encoding="utf-8". It didn't fix things.

* That print uses sys.stdout would explain, using sys.stdout isn't better.

* My locale and the system-wide locale is UTF-8. Using SetEnv 
PYTHONIOENCODING utf-8 didn't fix things

* The file is encoded UTF-8...

I can not speak for anybody else but in my search I don't believe to 
have read about someone who had the problem on a Windows-system. They 
all used linux (different kinds of flavors) or OS-X... This is the first 
time I've encountered a situation where Windows is better in encoding 
issues :P +1 for Microsoft...

I think that Apache (*nix versions) doesn't tell Python, she's accepting 
UTF-8. Or Python doesn't listen right... Maybe I should place a bug 
report in both projects?


Op 17-08-14 om 04:50 schreef Denis McMahon:
> On Sun, 17 Aug 2014 00:36:14 +0200, Dominique Ramaekers wrote:
>
>> What seems to be the problem:
>> My Script was ok. I know this because in the terminal I got my expected
>> output. Python3 uses UTF-8 coding as a standard. The problem is, when
>> python 'prints' to the apache interface, it translates the string to
>> ascii. (Why, I never found an answer).
> Is the apache server running on a linux or a windows platform?
>
> The problem may not be python, it may be the underlying OS. I wonder if
> apache is spawning a process for python though, and if so whether it is
> in some way constraining the character set available to stdout of the
> spawned process.
>
>  From your other message, the error appears to be a python error on
> reading the input file. For some reason python seems to be trying to
> interpret the file it is reading as ascii.
>
> I wonder if specifying the binary data parameter and / or utf-8 encoding
> when opening the file might help.
>
> eg:
>
> f = open( "/var/www/cgi-data/index.html", "rb" )
> f = open( "/var/www/cgi-data/index.html", "rb", encoding="utf-8" )
> f = open( "/var/www/cgi-data/index.html", "r", encoding="utf-8" )
>
> I've managed to drive down a bit further in the problem:
>
> print() goes to sys.stdout
>
> This is part of what the docs say about sys.stdout:
>
> """
> The character encoding is platform-dependent. Under Windows, if the
> stream is interactive (that is, if its isatty() method returns True), the
> console codepage is used, otherwise the ANSI code page. Under other
> platforms, the locale encoding is used (see locale.getpreferredencoding
> ()).
>
> Under all platforms though, you can override this value by setting the
> PYTHONIOENCODING environment variable before starting Python.
> """
>
> At this point, details of the OS become very significant. If your server
> is running on a windows platform you may need to figure out how to make
> apache set the PYTHONIOENCODING environment variable to "utf-8" (or
> whatever else is appropriate) before calling the python script.
>
> I believe that the following line in your httpd.conf may have the
> required effect.
>
> SetEnv PYTHONIOENCODING utf-8
>
> Of course, if the file is not encoded as utf-8, but rather something
> else, then use that as the encoding in the above suggestions. If the
> server is not running windows, then I'm not sure where the problem might
> be.
>

[toc] | [prev] | [next] | [standalone]

#76416

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-08-17 17:50 +1000
Message-ID	<53f05ed9$0$30003$c3e8da3$5496439d@news.astraweb.com>
In reply to	#76413

Denis McMahon wrote:

> From your other message, the error appears to be a python error on
> reading the input file. For some reason python seems to be trying to
> interpret the file it is reading as ascii.

Oh!!! /facepalm

I think you've got it. I've been assuming the problem was on *writing* the
line. That's because the OP was insistent that the line failing was

    [quoting Dominique]
    The problem is, when python 'prints' to the apache interface, it
    translates the string to ascii.

but if you read the traceback, you're right, the problem is *reading* the
file, not printing:

[Sat Aug 16 23:12:42.158326 2014] [cgi:error] [pid 29327] [client 
119.63.193.196:11110] AH01215: Traceback (most recent call last):
[Sat Aug 16 23:12:42.158451 2014] [cgi:error] [pid 29327] [client 
119.63.193.196:11110] AH01215:   File "/var/www/cgi-python/index.html", 
line 12, in <module>
[Sat Aug 16 23:12:42.158473 2014] [cgi:error] [pid 29327] [client 
119.63.193.196:11110] AH01215:     for line in f:

That's the line which is failing, reading the file. Which is then *decoded*.
Files contain bytes, which have to be decoded into text, and the decode is
assuming ASCII:

[Sat Aug 16 23:12:42.158526 2014] [cgi:error] [pid 29327] [client 
119.63.193.196:11110] AH01215:   File 
"/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
[Sat Aug 16 23:12:42.158569 2014] [cgi:error] [pid 29327] [client 
119.63.193.196:11110] AH01215:     return codecs.ascii_decode(input, 
self.errors)[0]
[Sat Aug 16 23:12:42.158663 2014] [cgi:error] [pid 29327] [client 
119.63.193.196:11110] AH01215: UnicodeDecodeError: 'ascii' codec can't 
decode byte 0xc3 in position 1791: ordinal not in range(128)

> I wonder if specifying the binary data parameter and / or utf-8 encoding
> when opening the file might help.

We don't really know what encoding the index.html file is encoded in. It
might be Latin-1, or cp-1252, or some other legacy encoding. But let's
assume it's UTF-8.

So why is Dominque's script reading it in ASCII? That's the key question. I
have a sinking feeling that Apache may be running Python as a subprocess
with the C locale, maybe. I don't know enough about cgi to be more than
just guessing.

Dominique, if you write:

f = open("/var/www/cgi-data/index.html", "r", encoding='utf-8')

the problem should go away (assuming index.html is valid UTF-8). If it
doesn't, there's a very strange bug somewhere.

Please try that, and see if it fixes the problem, or if the error goes to a
different line.

> eg:
> 
> f = open( "/var/www/cgi-data/index.html", "rb" )

No, you don't want that, since then reading the file will return bytes, not
text. Although I suppose the OP might just commit to using bytes
everywhere. Yuck.

> f = open( "/var/www/cgi-data/index.html", "rb", encoding="utf-8" )

That makes no sense. If you're reading in binary mode, there's no encoding.
Every byte represents itself.

> f = open( "/var/www/cgi-data/index.html", "r", encoding="utf-8" )

That's the bunny!

If you just want to hide the problem without fixing the underlying cause,
add an argument errors="replace", which is ugly but at least lets you move
on:

py> b = "Hello ë ü world".encode('utf-8')
py> print(b.decode('ascii', errors='replace'))
Hello �� �� world

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#76418

From	Dominique Ramaekers <dominique@ramaekers-stassart.be>
Date	2014-08-17 11:40 +0200
Message-ID	<mailman.13061.1408268785.18130.python-list@python.org>
In reply to	#76416

Wow, everybody keeps on chewing on this problem. As a bonus, I've 
reconfigured my server to do some testings.
http://cloudserver.ramaekers-stassart.be/test.html => is the file I want 
to read. Going to this url displays the file...
http://cloudserver.ramaekers-stassart.be/cgi-python/encoding1 => is the 
cgi-script of this test
http://cloudserver.ramaekers-stassart.be/wsgi => is the wsgi sollution 
(but for now it just says 'Hello world'...)

----------------This configuration-----------------------------

dominique@cloudserver:/var/www/cgi-python$ cat /etc/default/locale
LANG="en_US.UTF-8"
LANGUAGE="en_US:"

dominique@cloudserver:/var/www/cgi-python$ cat 
/etc/apache2/sites-enabled/000-default.conf
<VirtualHost *:80>

     ServerAdmin dominique@ramaekers-stassart.be
     WSGIScriptAlias /wsgi /var/www/wsgi/application

     <Directory /var/www/wsgi>
             Order allow,deny
             Allow from all
         </Directory>

     DocumentRoot /var/www/html

     ScriptAlias /cgi-python /var/www/cgi-python/
     <Directory /var/www/cgi-python>
             Options ExecCGI
             SetHandler cgi-script
         </Directory>

     ErrorLog ${APACHE_LOG_DIR}/error.log
     CustomLog ${APACHE_LOG_DIR}/access.log combined

</VirtualHost>

dominique@cloudserver:/var/www/cgi-python$ cat encoding1
#!/usr/bin/env python3
print("Content-Type: text/html")
print("Cache-Control: no-cache, must-revalidate")    # HTTP/1.1
print("Expires: Sat, 26 Jul 1997 05:00:00 GMT") # Date in the past
print("")
f = open("/var/www/html/test.html", "r")
for line in f:
     print(line,end='')

dominique@cloudserver:/var/www/cgi-python$ cat ../html/test.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Testing my cgi...</title>
</head>
<body>
<p>Ok, Testing my cgi... Lets try some characters: é ë ü</p>
</body>
</html>

dominique@cloudserver:/var/www/cgi-python$ file ../html/test.html
../html/test.html: HTML document, UTF-8 Unicode text

---------Start test----------------------
In brower: http://cloudserver.ramaekers-stassart.be/test.html => page 
displays ok (try it yourself...)

In terminal: => all go's wel....
dominique@cloudserver:/var/www/cgi-python$ ./encoding1
Content-Type: text/html
Cache-Control: no-cache, must-revalidate
Expires: Sat, 26 Jul 1997 05:00:00 GMT

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Testing my cgi...</title>
</head>
<body>
<p>Ok, Testing my cgi... Lets try some characters: é ë ü</p>
</body>
</html>

In the browser (firefox):
http://cloudserver.ramaekers-stassart.be/cgi-python/encoding1 => gives a 
blank page!

The error log says:
root@cloudserver:~# cat /var/log/apache2/error.log | tail -n 6
[Sun Aug 17 11:09:21.102003 2014] [cgi:error] [pid 32146] [client 
84.194.120.161:36707] AH01215: Traceback (most recent call last):
[Sun Aug 17 11:09:21.102129 2014] [cgi:error] [pid 32146] [client 
84.194.120.161:36707] AH01215:   File "/var/www/cgi-python/encoding1", 
line 7, in <module>
[Sun Aug 17 11:09:21.102149 2014] [cgi:error] [pid 32146] [client 
84.194.120.161:36707] AH01215:     for line in f:
[Sun Aug 17 11:09:21.102201 2014] [cgi:error] [pid 32146] [client 
84.194.120.161:36707] AH01215:   File 
"/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
[Sun Aug 17 11:09:21.102243 2014] [cgi:error] [pid 32146] [client 
84.194.120.161:36707] AH01215:     return codecs.ascii_decode(input, 
self.errors)[0]
[Sun Aug 17 11:09:21.102318 2014] [cgi:error] [pid 32146] [client 
84.194.120.161:36707] AH01215: UnicodeDecodeError: 'ascii' codec can't 
decode byte 0xc3 in position 162: ordinal not in range(128)

--------------Conclusion-----------------------------
In my current configuration, the bug is recreated!!!

-------------------Test 2: new configuration-----------------------------
I change the line f = open("/var/www/html/test.html", "r") into f = 
open("/var/www/html/test.html", "r", encoding="utf-8") and save the 
script as encoding2

In the terminal: => All ok

In the browser: => blank page!!!

Error log in apache:
root@cloudserver:~# cat /var/log/apache2/error.log | tail -n 4
[Sun Aug 17 11:13:47.372353 2014] [cgi:error] [pid 32147] [client 
84.194.120.161:36711] AH01215: Traceback (most recent call last):
[Sun Aug 17 11:13:47.372461 2014] [cgi:error] [pid 32147] [client 
84.194.120.161:36711] AH01215:   File "/var/www/cgi-python/encoding2", 
line 8, in <module>
[Sun Aug 17 11:13:47.372483 2014] [cgi:error] [pid 32147] [client 
84.194.120.161:36711] AH01215:     print(line,end='')
[Sun Aug 17 11:13:47.372572 2014] [cgi:error] [pid 32147] [client 
84.194.120.161:36711] AH01215: UnicodeEncodeError: 'ascii' codec can't 
encode character '\\xe9' in position 51: ordinal not in range(128)

---------Conclusion------------------
Steven was right. It was a read error => with encoding2 script the file 
is read in UTF-8. Dough, I find it strange. The file is in UTF-8 and 
Python3 has UTF-8 as standard..... But reading the file is fixed.

Now the writing is still broken....

Here are some tests hinted before:

Tip from Steven => getting the encoding:
dominique@cloudserver:/var/www/cgi-python$ cat readencoding
#!/usr/bin/env python3
import sys
print("Content-Type: text/html")
print("")
print(sys.getfilesystemencoding())

Gives in the terminal: utf-8
Gives in the browes: ascii

Found the problem!!!!!

Now, why apache starts Python in ascii????

Putting the lines in my apache config:
AddDefaultCharset UTF-8
SetEnv PYTHONIOENCODING utf-8

Cleared my brower-cache... No change.....

I removed these lines....

If someone wants me to try more things, just post it. I'll try to 
process them all. I don't want to change the code. I want Apache-Python3 
to work in UTF-8 and not in ASCII. Fixing it in my code seems to me like 
a dirty fix...

For now I'm going one with wsgi and hope I don't get the same problem 
(but now I think I will :( ....)

Grtz

Op 17-08-14 om 09:50 schreef Steven D'Aprano:
....
>
> I think you've got it. I've been assuming the problem was on *writing* the
> line. That's because the OP was insistent that the line failing was
>
>      [quoting Dominique]
>      The problem is, when python 'prints' to the apache interface, it
>      translates the string to ascii.
>
>
> but if you read the traceback, you're right, the problem is *reading* the
> file, not printing:
>
> [Sat Aug 16 23:12:42.158326 2014] [cgi:error] [pid 29327] [client
> 119.63.193.196:11110] AH01215: Traceback (most recent call last):
> [Sat Aug 16 23:12:42.158451 2014] [cgi:error] [pid 29327] [client
> 119.63.193.196:11110] AH01215:   File "/var/www/cgi-python/index.html",
> line 12, in <module>
> [Sat Aug 16 23:12:42.158473 2014] [cgi:error] [pid 29327] [client
> 119.63.193.196:11110] AH01215:     for line in f:
....
>
>> I wonder if specifying the binary data parameter and / or utf-8 encoding
>> when opening the file might help.
> We don't really know what encoding the index.html file is encoded in. It
> might be Latin-1, or cp-1252, or some other legacy encoding. But let's
> assume it's UTF-8.
>
> So why is Dominque's script reading it in ASCII? That's the key question. I
> have a sinking feeling that Apache may be running Python as a subprocess
> with the C locale, maybe. I don't know enough about cgi to be more than
> just guessing.
>
> Dominique, if you write:
>
> f = open("/var/www/cgi-data/index.html", "r", encoding='utf-8')
>
> the problem should go away (assuming index.html is valid UTF-8). If it
> doesn't, there's a very strange bug somewhere.
>
> Please try that, and see if it fixes the problem, or if the error goes to a
> different line.
.....
>
>> f = open( "/var/www/cgi-data/index.html", "r", encoding="utf-8" )
> That's the bunny!
>
> If you just want to hide the problem without fixing the underlying cause,
> add an argument errors="replace", which is ugly but at least lets you move
> on:
>
> py> b = "Hello ë ü world".encode('utf-8')
> py> print(b.decode('ascii', errors='replace'))
> Hello �� �� world
>
>
>

[toc] | [prev] | [next] | [standalone]

#76419

From	wxjmfauth@gmail.com
Date	2014-08-17 03:05 -0700
Message-ID	<406363a3-5616-477c-86c0-71e101bca5bb@googlegroups.com>
In reply to	#76416

Le dimanche 17 août 2014 09:50:48 UTC+2, Steven D'Aprano a écrit :
> 
> 
> 
> 
> py> b = "Hello ë ü world".encode('utf-8')
> 
> py> print(b.decode('ascii', errors='replace'))
> 
> Hello �� �� world
> 
> 
> 

=========

No. Your are taking the problem in the wrong way. This is
a typical situation, where the produced code will work
correctly, but it will be a "just for me working code".

The mistake is that, in that way you are producing code,
that is not suitable for the "system" that will host your
string.

In the present case, you are already assuming prior
any string manipulation, the output should be utf-8.

D:\>c:\python32\python
Python 3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> b = "Hello ë ü world".encode('utf-8')
>>> b
b'Hello \xc3\xab \xc3\xbc world'
>>> b.decode('ascii', 'replace')
'Hello \ufffd\ufffd \ufffd\ufffd world'
>>> print(b.decode('ascii', 'replace'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python32\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 6-7: cha
racter maps to <undefined>
>>>


The proper way is to "prepare" your string prior any
further manipulation (see my previous comment with
processes).

I'm using explicitely the code page cp850 and the
euro sign.

>>> u = "Hello ë ü world \u20ac\u20ac\u20ac"
>>> newu = u.encode('cp850', 'replace').decode('cp850')
>>> print(newu)
Hello ë ü world ???
>>> type(newu)
<class 'str'>
>>>

The replacement character now belongs to the set of the
characters, which are display-able.
It will never fail.


You can mimic the same behaviour with a web navigator.

Create an html file in utf-8 containing characters
not belonging to iso-8859-1.
Display that file and change the coding of the nagivator
to iso-8859-1.
You will see, the navigator "reencode* the source with
a replacement char and only later re-display it. Same
process I gave above.

The key point is the detection, if doable, of the coding scheme
that should be used.

>>> import sys
>>> sys.stdout.encoding
'cp850'
>>>

My example is not Windows specific. On a gb**** Chinese
BSD or a kio-8 Russion linux: identical problematic.

jmf

[toc] | [prev] | [next] | [standalone]

#76421

From	Peter Otten <__peter__@web.de>
Date	2014-08-17 13:04 +0200
Message-ID	<mailman.13062.1408273509.18130.python-list@python.org>
In reply to	#76416

Dominique Ramaekers wrote:

> Putting the lines in my apache config:
> AddDefaultCharset UTF-8
> SetEnv PYTHONIOENCODING utf-8
> 
> Cleared my brower-cache... No change.....

Did you restart the apache?

[toc] | [prev] | [next] | [standalone]

#76422

From	Dominique Ramaekers <dominique@ramaekers-stassart.be>
Date	2014-08-17 13:34 +0200
Message-ID	<mailman.13063.1408275277.18130.python-list@python.org>
In reply to	#76416

Yes, even a restart not just reload. I Also put it in the section 
<virtualHost> as in the main apache2.conf....

Op 17-08-14 om 13:04 schreef Peter Otten:
> Dominique Ramaekers wrote:
>
>> Putting the lines in my apache config:
>> AddDefaultCharset UTF-8
>> SetEnv PYTHONIOENCODING utf-8
>>
>> Cleared my brower-cache... No change.....
> Did you restart the apache?
>
>

[toc] | [prev] | [next] | [standalone]

#76423

From	Dominique Ramaekers <dominique@ramaekers-stassart.be>
Date	2014-08-17 14:02 +0200
Message-ID	<mailman.13064.1408276955.18130.python-list@python.org>
In reply to	#76416

As I suspected, if I check the used encoding in wsgi I get:
ANSI_X3.4-1968

I found you can define the coding of the script with a special comment: 
# -*- coding: utf-8 -*-

Now I don't get an error but my special chars still doesn't display well.
The script:
# -*- coding: utf-8 -*-
import sys
def application(environ, start_response):
     status = '200 OK'
     output = 'Hello World! é ü à ũ'
     #output = sys.getfilesystemencoding() #1

     response_headers = [('Content-type', 'text/plain'),
                         ('Content-Length', str(len(output)))]
     start_response(status, response_headers)

     return [output]

Gives in the browser as output:

Hello World! Ã© Ã¼ Ã  Å©

And if I check the encoding with the python script (uncommenting line 
#1), I still get ANSI_X3.4-1968

This is really getting on my nerves.


Op 17-08-14 om 13:04 schreef Peter Otten:
> Dominique Ramaekers wrote:
>
>> Putting the lines in my apache config:
>> AddDefaultCharset UTF-8
>> SetEnv PYTHONIOENCODING utf-8
>>
>> Cleared my brower-cache... No change.....
> Did you restart the apache?
>
>

[toc] | [prev] | [next] | [standalone]

#76426

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-08-17 23:00 +1000
Message-ID	<53f0a787$0$29991$c3e8da3$5496439d@news.astraweb.com>
In reply to	#76423

Dominique Ramaekers wrote:

> As I suspected, if I check the used encoding in wsgi I get:
> ANSI_X3.4-1968

That's another name for ASCII.

> I found you can define the coding of the script with a special comment:
> # -*- coding: utf-8 -*-

Be careful. That just tells Python what encoding the source code file is in.
It is not used by print(), or reading/writing files, just when the compiler
reads the source code.

> Now I don't get an error but my special chars still doesn't display well.
> The script:
> # -*- coding: utf-8 -*-
> import sys
> def application(environ, start_response):
>      status = '200 OK'
>      output = 'Hello World! é ü à ũ'
>      #output = sys.getfilesystemencoding() #1
> 
>      response_headers = [('Content-type', 'text/plain'),
>                          ('Content-Length', str(len(output)))]
>      start_response(status, response_headers)
> 
>      return [output]
> 
> Gives in the browser as output:
> 
> Hello World! Ã© Ã¼ Ã  Å©

That looks like ordinary moji-bake. Your Python script takes the text
string 'Hello World! é ü à ũ', which in UTF-8 gives you bytes:

py> 'Hello World! é ü à ũ'.encode('utf-8')
b'Hello World! \xc3\xa9 \xc3\xbc \xc3\xa0 \xc5\xa9'

Decoding back using latin-1 gives:

py> 'Hello World! é ü à ũ'.encode('utf-8').decode('latin1')
'Hello World! Ã© Ã¼ Ã\xa0 Å©'

which appears to be exactly what you have. Why Latin-1 instead of ASCII?
Because the process has to output *something*, and Latin-1 is sometimes
called "extended ASCII". 

I'm starting to fear a bug in Python 3.4, but since I have almost no
knowledge about wsgi and cgi, I can't be sure that this isn't just normal
expected behaviour :-(

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#76435

From	wxjmfauth@gmail.com
Date	2014-08-17 08:56 -0700
Message-ID	<7eb1e2f0-a3ae-4ee1-b6ff-f25abc3f535f@googlegroups.com>
In reply to	#76426

Le dimanche 17 août 2014 15:00:53 UTC+2, Steven D'Aprano a écrit :
> 
> 
> I'm starting to fear a bug in Python 3.4, but since I have almost no
> 
> knowledge about wsgi and cgi, I can't be sure that this isn't just normal
> 
> expected behaviour :-(
> 

Not Python 3.4. Python 3. It fails from the day zero.
Do you remember this story from the Greek guy with
"its" Greek encoding on the server side?

jmf

[toc] | [prev] | [next] | [standalone]

#76424

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2014-08-17 13:35 +0100
Message-ID	<mailman.13065.1408278931.18130.python-list@python.org>
In reply to	#76416

On 17/08/2014 13:02, Dominique Ramaekers wrote:

if style == TOP_POSTING:
     *plonk*

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#76448

From	Tony the Tiger <tony@tiger.invalid>
Date	2014-08-18 04:39 +0000
Message-ID	<53f1837e$0$25650$b1db1813$ba2d9d20@news.astraweb.com>
In reply to	#76424

On Sun, 17 Aug 2014 13:35:15 +0100, Mark Lawrence wrote:

> if style == TOP_POSTING:
>      *plonk*

Hear hear!


 /Grrr
-- 
          ___                  ___
 (\_--_/)  | _ ._    _|_|_  _   |o _  _ ._
 ( 9  9 )  |(_)| |\/  |_| |(/_  ||(_|(/_|
 stripes are forever - as overripe ferrets

[toc] | [prev] | [next] | [standalone]

#76427

From	Peter Otten <__peter__@web.de>
Date	2014-08-17 15:12 +0200
Message-ID	<mailman.13067.1408281166.18130.python-list@python.org>
In reply to	#76416

Dominique Ramaekers wrote:

> As I suspected, if I check the used encoding in wsgi I get:
> ANSI_X3.4-1968
> 
> I found you can define the coding of the script with a special comment:
> # -*- coding: utf-8 -*-
> 
> Now I don't get an error but my special chars still doesn't display well.
> The script:
> # -*- coding: utf-8 -*-
> import sys
> def application(environ, start_response):
>      status = '200 OK'
>      output = 'Hello World! é ü à ũ'
>      #output = sys.getfilesystemencoding() #1
> 
>      response_headers = [('Content-type', 'text/plain'),
>                          ('Content-Length', str(len(output)))]
>      start_response(status, response_headers)
> 
>      return [output]
> 
> Gives in the browser as output:
> 
> Hello World! Ã© Ã¼ Ã  Å©

That's UTF-8 interpreted as Latin-1. Try specifying the charset in the 
header:

...
response_headers = [('Content-type', 'text/plain; charset=utf-8'),
...
 
> And if I check the encoding with the python script (uncommenting line
> #1), I still get ANSI_X3.4-1968

[toc] | [prev] | [next] | [standalone]

#76428

From	Peter Otten <__peter__@web.de>
Date	2014-08-17 16:06 +0200
Message-ID	<mailman.13068.1408284413.18130.python-list@python.org>
In reply to	#76416

Dominique Ramaekers wrote:

> And if I check the encoding with the python script (uncommenting line
> #1), I still get ANSI_X3.4-1968

That should not matter as long as

print(os.environ.get("PYTHONIOENCODING"))

prints

UTF-8

If you do get the correct PYTHONIOENCODING you should be able to replace the 
corresponding SetEnv with

SetEnv LANG en_US.UTF-8

or similar. If you don't get the expected value the SetEnv is probably not 
in the right place. In my experiments I put it into

/etc/apache2/sites-enabled/000-default.conf 

in an apache installation I think I have not tinkered with before ;)

While looking around in the apache configuration I also found the file
/etc/apache2/envvars. Here's an excerpt:

## The locale used by some modules like mod_dav
export LANG=C
## Uncomment the following line to use the system default locale instead:
#. /etc/default/locale

export LANG

If you uncomment the line

. /etc/default/locale

and replace 

SetEnv LANG en_US.UTF-8

with 

PassEnv LANG

you should get a similar effect assuming your system defaults to UTF-8.

[toc] | [prev] | [next] | [standalone]

#76415

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-08-17 15:54 +1000
Message-ID	<53f043af$0$29975$c3e8da3$5496439d@news.astraweb.com>
In reply to	#76409

Dominique Ramaekers wrote:

[...]
> 2) Your tip, to use 'encode' did not solve the problem and created a new
> one. My lines were incapsulted in quotes and I got a lot of \b's and
> \n's... and I still got the same error.

Just throwing random encode/decode calls into the mix are unlikely to fix
the problem. First, you need to find an Apache expert who can tell you what
encoding your Apache process is expecting. Hopefully it is UTF-8. Then you
need to confirm that your Python process is also using UTF-8. Nearly all
Unicode-related issues are due to mismatches between encodings in different
parts of the system. If only everyone could use UTF-8 for all storage and
transport layers, life would be so much simpler... but I digress.

[...]
> What seems to be the problem:
> My Script was ok. I know this because in the terminal I got my expected
> output. 

Did you test it at the terminal with input including ë and ü?

> Python3 uses UTF-8 coding as a standard. The problem is, when 
> python 'prints' to the apache interface, it translates the string to
> ascii. (Why, I never found an answer).

Try putting the lines:

import sys
print(sys.getfilesystemencoding())

at the start of your program, and see what it prints at the terminal and
what it prints under Apache. I predict that under Apache, it will say
something like "C locale" or "US ASCII". If so, *that* is your problem.

> Somewhere in the middle of my 
> index.html file, there are letters like ë and ü. If Python tries to
> translate these, Python throws an error. If I delete these letters in
> the file, the script works perfectly in a browser! In Python2.7 the
> script can easily be tweaked so the translation to ascii isn't done, 

Not quite. Under Python 2.7, you will likely get moji-bake. For instance, if
your index.html contains "ë ü π" stored in UTF-8, Python 2.7 will throw its
hands in the air, say "I have no idea what ASCII characters they are, let's
pretend it's some sort of Latin-1" and you'll get:

Ã« Ã¼ Ï

instead. Or perhaps not. With Python 2.7, what you get is not quite random,
but it depends on the environment in some fairly obscure ways. Python 3 at
least raises an exception when there is a mismatch, instead of trying to
guess what you get.

> but 
> in Python3, its a real pain in the a... I've read about people who
> managed to force Python3 to 'print' to apache in UTF-8, but none of
> their solutions worked for me.

There is very little point in throwing random solutions at a problem if you
don't understand the problem. First you need to find out why Python is
trying to convert to ASCII. That's probably because of something Apache is
doing. Do you have an Apache technician you can ask?

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#76384

From	John Gordon <gordon@panix.com>
Date	2014-08-15 19:32 +0000
Message-ID	<lsln95$mvk$1@reader1.panix.com>
In reply to	#76382

In <mailman.13038.1408130249.18130.python-list@python.org> Dominique Ramaekers <dominique@ramaekers-stassart.be> writes:

> #!/usr/bin/env python3
> print("Content-Type: text/html")
> print("Cache-Control: no-cache, must-revalidate")    # HTTP/1.1
> print("Expires: Sat, 26 Jul 1997 05:00:00 GMT") # Date in the past
> print("")
> f = open("/var/www/cgi-data/index.html", "r")
> for line in f:
>      print(line,end='')

> If access the script through a webbrowser, apache gives an error:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 
> 1791: ordinal not in range(128)

The error traceback should display exactly where the error occurs within
the script.  Which line is it?

-- 
John Gordon         Imagine what it must be like for a real medical doctor to
gordon@panix.com    watch 'House', or a real serial killer to watch 'Dexter'.

[toc] | [prev] | [next] | [standalone]

#76410

From	Dominique Ramaekers <dominique@ramaekers-stassart.be>
Date	2014-08-17 00:39 +0200
Message-ID	<mailman.13055.1408229269.18130.python-list@python.org>
In reply to	#76384

Hi John,

The error is in the line "print(line,end='')"... and it only happens 
when the script is started from a webbrowser. In the terminal, the 
script works fine.
See my previous mail for my findings after a lot of reading and trying...

grz



Op 15-08-14 om 21:32 schreef John Gordon:
> In <mailman.13038.1408130249.18130.python-list@python.org> Dominique Ramaekers <dominique@ramaekers-stassart.be> writes:
>
>> #!/usr/bin/env python3
>> print("Content-Type: text/html")
>> print("Cache-Control: no-cache, must-revalidate")    # HTTP/1.1
>> print("Expires: Sat, 26 Jul 1997 05:00:00 GMT") # Date in the past
>> print("")
>> f = open("/var/www/cgi-data/index.html", "r")
>> for line in f:
>>       print(line,end='')
>> If access the script through a webbrowser, apache gives an error:
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
>> 1791: ordinal not in range(128)
> The error traceback should display exactly where the error occurs within
> the script.  Which line is it?
>

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Unicode in cgi-script with apache2

Contents

#76382 — Unicode in cgi-script with apache2

#76383

#76409

#76413

#76414

#76416

#76418

#76419

#76421

#76422

#76423

#76426

#76435

#76424

#76448

#76427

#76428

#76415

#76384

#76410