Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #95640 > unrolled thread
| Started by | RAH <rene.heymans@gmail.com> |
|---|---|
| First post | 2015-08-25 14:19 -0700 |
| Last post | 2015-08-27 07:01 -0700 |
| Articles | 17 — 6 participants |
Back to article view | Back to comp.lang.python
file.write() of non-ASCII characters differs in Interpreted Python than in script run RAH <rene.heymans@gmail.com> - 2015-08-25 14:19 -0700
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run Chris Kaynor <ckaynor@zindagigames.com> - 2015-08-25 14:28 -0700
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run RAH <rene.heymans@gmail.com> - 2015-08-26 02:12 -0700
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run Chris Angelico <rosuav@gmail.com> - 2015-08-26 19:24 +1000
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run RAH <rene.heymans@gmail.com> - 2015-08-26 07:24 -0700
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run Chris Angelico <rosuav@gmail.com> - 2015-08-26 09:16 +1000
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run RAH <rene.heymans@gmail.com> - 2015-08-26 07:18 -0700
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run dieter <dieter@handshake.de> - 2015-08-26 07:51 +0200
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run RAH <rene.heymans@gmail.com> - 2015-08-26 07:20 -0700
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run Pete Dowdell <contact@stridebird.com> - 2015-08-26 14:09 +0700
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run RAH <rene.heymans@gmail.com> - 2015-08-26 07:22 -0700
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run RAH <rene.heymans@gmail.com> - 2015-08-26 08:02 -0700
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run Chris Angelico <rosuav@gmail.com> - 2015-08-27 01:57 +1000
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run RAH <rene.heymans@gmail.com> - 2015-08-26 12:23 -0700
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run Chris Angelico <rosuav@gmail.com> - 2015-08-27 09:15 +1000
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run Marko Rauhamaa <marko@pacujo.net> - 2015-08-27 08:59 +0300
Re: file.write() of non-ASCII characters differs in Interpreted Python than in script run RAH <rene.heymans@gmail.com> - 2015-08-27 07:01 -0700
| From | RAH <rene.heymans@gmail.com> |
|---|---|
| Date | 2015-08-25 14:19 -0700 |
| Subject | file.write() of non-ASCII characters differs in Interpreted Python than in script run |
| Message-ID | <c085c6af-31f6-480c-a9b4-90f46441fdd1@googlegroups.com> |
Dear All,
I experienced an incomprehensible behavior (I've spent already many hours on this subject): the `file.write('string')` provides an error in run mode and not when interpreted at the console. The string must contain non-ASCII characters. If all ASCII, there is no error.
The following example shows what I can see. I must overlook something because I cannot think Python makes a difference between interpreted and run modes and yet ... Can someone please check that subject.
Thank you in advance.
René
Code extract from WSGI application (reply.py)
=============================================
request_body = environ['wsgi.input'].read(request_body_size) # bytes
rb = request_body.decode() # string
d = parse_qs(rb) # dict
f = open('logbytes', 'ab')
g = open('logstr', 'a')
h = open('logdict', 'a')
f.write(request_body)
g.write(str(type(request_body)) + '\t' + str(type(rb)) + '\t' + str(type(d)) + '\n')
h.write(str(d) + '\n') <--- line 28 of the application
h.close()
g.close()
f.close()
Tail of Apache2 error.log
=========================
[Tue Aug 25 20:24:04.657933 2015] [wsgi:error] [pid 3677:tid 3029764928] [remote 192.168.1.5:27575] File "reply.py", line 28, in application
[Tue Aug 25 20:24:04.658001 2015] [wsgi:error] [pid 3677:tid 3029764928] [remote 192.168.1.5:27575] h.write(str(d) + '\\n')
[Tue Aug 25 20:24:04.658201 2015] [wsgi:error] [pid 3677:tid 3029764928] [remote 192.168.1.5:27575] UnicodeEncodeError: 'ascii' codec can't encode character '\\xc7' in position 15: ordinal not in range(128)
Checking what has been logged
=============================
rse@Alibaba:~/test$ cat logbytes
userName=Ça va ! <--- this was indeed the input (notice the
french C + cedilla)
Unicode U+00C7 ALT-0199 UTF-8 C387
Reading the logbytes file one can verify
that Ç is indeed represented by the 2 bytes
\xC3 and \x87
rse@Alibaba:~/test$ cat logstr
<class 'bytes'> <class 'str'> <class 'dict'>
rse@Alibaba:~/test$ cat logdict
rse@Alibaba:~/test$ <--- Obviously empty because of error
Trying similar code within the Python interpreter
=================================================
rse@Alibaba:~/test$ python
Python 3.4.0 (default, Jun 19 2015, 14:18:46)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> di = {'userName': ['Ça va !']} <--- A dictionary
>>> str(di)
"{'userName': ['Ça va !']}" <--- and its string representation
>>> type(str(di))
<class 'str'> <--- Is a string indeed
>>> fi = open('essai', 'a')
>>> fi.write(str(di) + '\n')
26 <--- It works well
>>> fi.close()
>>>
Checking what has been written
==============================
rse@Alibaba:~/test$ cat essai
{'userName': ['Ça va !']} <--- The result is correct
rse@Alibaba:~/test$
No error if all ASCII
=====================
If the input is `userName=Rene` for instance then there is no error and the
`logdict' does indeed then contain the text of the dictionary
`{'userName': ['Rene']}`
[toc] | [next] | [standalone]
| From | Chris Kaynor <ckaynor@zindagigames.com> |
|---|---|
| Date | 2015-08-25 14:28 -0700 |
| Message-ID | <mailman.34.1440538145.11709.python-list@python.org> |
| In reply to | #95640 |
On Tue, Aug 25, 2015 at 2:19 PM, RAH <rene.heymans@gmail.com> wrote:
> Dear All,
>
> I experienced an incomprehensible behavior (I've spent already many hours on this subject): the `file.write('string')` provides an error in run mode and not when interpreted at the console. The string must contain non-ASCII characters. If all ASCII, there is no error.
>
> The following example shows what I can see. I must overlook something because I cannot think Python makes a difference between interpreted and run modes and yet ... Can someone please check that subject.
>
> Thank you in advance.
> René
>
> Code extract from WSGI application (reply.py)
> =============================================
>
> request_body = environ['wsgi.input'].read(request_body_size) # bytes
> rb = request_body.decode() # string
> d = parse_qs(rb) # dict
>
> f = open('logbytes', 'ab')
> g = open('logstr', 'a')
> h = open('logdict', 'a')
>
> f.write(request_body)
> g.write(str(type(request_body)) + '\t' + str(type(rb)) + '\t' + str(type(d)) + '\n')
> h.write(str(d) + '\n') <--- line 28 of the application
>
> h.close()
> g.close()
> f.close()
>
>
> Tail of Apache2 error.log
> =========================
>
> [Tue Aug 25 20:24:04.657933 2015] [wsgi:error] [pid 3677:tid 3029764928] [remote 192.168.1.5:27575] File "reply.py", line 28, in application
> [Tue Aug 25 20:24:04.658001 2015] [wsgi:error] [pid 3677:tid 3029764928] [remote 192.168.1.5:27575] h.write(str(d) + '\\n')
> [Tue Aug 25 20:24:04.658201 2015] [wsgi:error] [pid 3677:tid 3029764928] [remote 192.168.1.5:27575] UnicodeEncodeError: 'ascii' codec can't encode character '\\xc7' in position 15: ordinal not in range(128)
>
What version of Python is Apache2 using? From the looks of the error,
it is probably using some version of Python2, in which case you'll
need to manually encode the string and pick an encoding for the file
(via an encoding argument to the open function). I'd recommend using
UTF-8.
You can log out the value of sys.version to find out the version number.
> Trying similar code within the Python interpreter
> =================================================
>
> rse@Alibaba:~/test$ python
> Python 3.4.0 (default, Jun 19 2015, 14:18:46)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>>>> di = {'userName': ['Ça va !']} <--- A dictionary
>>>> str(di)
> "{'userName': ['Ça va !']}" <--- and its string representation
>>>> type(str(di))
> <class 'str'> <--- Is a string indeed
>>>> fi = open('essai', 'a')
>>>> fi.write(str(di) + '\n')
> 26 <--- It works well
>>>> fi.close()
>>>>
In this run, you are using Python 3.4, which defaults to UTF-8.
[toc] | [prev] | [next] | [standalone]
| From | RAH <rene.heymans@gmail.com> |
|---|---|
| Date | 2015-08-26 02:12 -0700 |
| Message-ID | <cf98e944-f5fd-4075-8155-98ad1cbf54ff@googlegroups.com> |
| In reply to | #95641 |
Dear Chris, I can confirm it is Python 3. Here is the line from the Apache2 log: [Wed Aug 26 10:28:01.508194 2015] [mpm_worker:notice] [pid 1120:tid 3074398848] AH00292: Apache/2.4.7 (Ubuntu) OpenSSL/1.0.1f mod_wsgi/4.4.13 Python/3.4.0 configured -- resuming normal operations As a matter of fact, I connect to the same machine that runs Apache2/mod_wsgi/Python via PuTTY and when I open the Python interpreter it responds: > Python 3.4.0 (default, Jun 19 2015, 14:18:46) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. >>>> Hence exactly the same Python 3.4.0 version. By the way my whole installation is defaulted to UTF-8: HTML: <head><meta charset="utf-8"></head> Javascript: <script type="text/javascript" charset="UTF-8"> PuTTY: >Translation>Character set translation>Remote character set>UTF-8 Python code: # -*- coding: utf-8 -*- Linux: $ echo $LANG ---> en_US.UTF-8 Finally, I also checked the coding of the `request_body` as written in the binary file (`logbytes`) and `Ç` is indeed coded as C387 (hex) or `é` is correctly written as C3A9 (hex). Thank you Chris, René
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-08-26 19:24 +1000 |
| Message-ID | <mailman.41.1440581085.11709.python-list@python.org> |
| In reply to | #95646 |
On Wed, Aug 26, 2015 at 7:12 PM, RAH <rene.heymans@gmail.com> wrote: > By the way my whole installation is defaulted to UTF-8: > > HTML: <head><meta charset="utf-8"></head> > Javascript: <script type="text/javascript" charset="UTF-8"> > PuTTY: >Translation>Character set translation>Remote character set>UTF-8 > Python code: # -*- coding: utf-8 -*- > Linux: $ echo $LANG ---> en_US.UTF-8 Check that from actually inside your web script - os.environ["LANG"] should be "en_US.UTF-8". If it isn't, then that may be where your difference between web and interactive is, and so you'll need to be more explicit about encodings. It's common for background processes to have restricted environments, sometimes with LANG=C or similar. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | RAH <rene.heymans@gmail.com> |
|---|---|
| Date | 2015-08-26 07:24 -0700 |
| Message-ID | <cc1e2b5e-7798-4b7f-8911-172f856aa6eb@googlegroups.com> |
| In reply to | #95647 |
Thank you Chris. I'll share my findings in a moment. Please bear with me a bit more time. Cheers, René
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-08-26 09:16 +1000 |
| Message-ID | <mailman.35.1440544578.11709.python-list@python.org> |
| In reply to | #95640 |
On Wed, Aug 26, 2015 at 7:19 AM, RAH <rene.heymans@gmail.com> wrote:
> rb = request_body.decode() # string
I'd recommend avoiding this operation in Python 2. As of Python 3,
omitting the encoding means "UTF-8", but in Python 2 it means "use the
default encoding", and that often causes problems in scripts that run
in various background environments. Instead, explicitly say what
encoding you're using:
rb = request_body.decode("UTF-8")
Conversely, if you're using Python 3 for this, the default encoding is
coming from this line:
> h = open('logdict', 'a')
Again, if you want this to be a text file with a specific encoding, say so:
h = open('logdict', 'a', encoding='UTF-8')
Give that a try and see if your problems disappear. If not, this
should at least poke them with a pointy stick.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | RAH <rene.heymans@gmail.com> |
|---|---|
| Date | 2015-08-26 07:18 -0700 |
| Message-ID | <ff4475d9-7ee8-4a2c-ad3e-bb2894b0ae61@googlegroups.com> |
| In reply to | #95642 |
Dear Chris, Thank you. I got the answer (at least a partial one) that I will share in a while. I will first respond to the other posts I received to thank each and everyone. Please stay tuned. René
[toc] | [prev] | [next] | [standalone]
| From | dieter <dieter@handshake.de> |
|---|---|
| Date | 2015-08-26 07:51 +0200 |
| Message-ID | <mailman.37.1440568313.11709.python-list@python.org> |
| In reply to | #95640 |
RAH <rene.heymans@gmail.com> writes:
> I experienced an incomprehensible behavior (I've spent already many hours on this subject): the `file.write('string')` provides an error in run mode and not when interpreted at the console.
Maybe, I can explain the behavior: the interactive interpreter uses magic
to determine the console's encoding and automatically uses this
for console output. No such magic in non-interactive interpreter use.
Therefore, you can get "UnicodeEncoding" problems in non-interactive use
which you do not see in interactive use.
[toc] | [prev] | [next] | [standalone]
| From | RAH <rene.heymans@gmail.com> |
|---|---|
| Date | 2015-08-26 07:20 -0700 |
| Message-ID | <305ee8f5-0b37-41d3-a66a-a6a273c4072b@googlegroups.com> |
| In reply to | #95643 |
Dear Dieter, Indeed there is a difference. I will share my discoveries in a while after I respond to each one. Be in touch soon. Thanks. René
[toc] | [prev] | [next] | [standalone]
| From | Pete Dowdell <contact@stridebird.com> |
|---|---|
| Date | 2015-08-26 14:09 +0700 |
| Message-ID | <mailman.38.1440573177.11709.python-list@python.org> |
| In reply to | #95640 |
On 26/08/15 04:19, RAH wrote: > UnicodeEncodeError: 'ascii' codec can't encode character '\\xc7' in position 15: ordinal not in range(128) (Hi all, this is my first post to the list) This can be a frustrating issue to resolve, but your issue might be solved with this environment variable: PYTHONIOENCODING=UTF-8 Regards, pd
[toc] | [prev] | [next] | [standalone]
| From | RAH <rene.heymans@gmail.com> |
|---|---|
| Date | 2015-08-26 07:22 -0700 |
| Message-ID | <726f08fe-ed98-4bcd-9351-2d894699097a@googlegroups.com> |
| In reply to | #95644 |
Thank you Pete. Indeed it has to do with choice of encoding. I'll be back in a short while with more details. Cheers, René
[toc] | [prev] | [next] | [standalone]
| From | RAH <rene.heymans@gmail.com> |
|---|---|
| Date | 2015-08-26 08:02 -0700 |
| Message-ID | <4fec8570-dfdf-4097-b6e6-c79fbd6e3022@googlegroups.com> |
| In reply to | #95640 |
Dear All,
First, thanks to each and everyone.
There is indeed a solution by I haven't yet found the root of the problem (I'll come back to that at the end of my post).
1) After many trials and errors, I found that the problem was with the write() function in `h.write(str(d) + '\n')` and not with the argument itself which is a perfect string.
2) Reading the documentation it refers to the open() function and its preferred encoding.
3) I checked with the interpreter and got:
rse@Alibaba:~/test$ python
Python 3.4.0 (default, Jun 19 2015, 14:18:46)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from locale import *
>>> getpreferredencoding()
'UTF-8'
>>>
4) I knew everything was set up with UTF-8 (refer my first answer to Chris K.) so I couldn't believe it ! Another dead end ?
5) I had to make sure it was the same within the application, so I added a couple of statements to get and record the preferred encoding. And lo and behold I got his:
rse@Alibaba:~/test$ cat type
ANSI_X3.4-1968 <class 'bytes'> <class 'str'> <class 'dict'>
rse@Alibaba:~/test$
So, here the getpreferredencoding() function returns ANSI_X3.4-1968 instead of UTF-8 !?
6) The solution is then obvious: open the file by specifying the encoding; a suggestion made already by Chris A.
7) Now, that source of the problem is known, I must investigate why my run-time environment differs from the interpreter environment. I know it is the same machine, same Python 3.4.0. As the mod_wsgi module in Apache2 initiates Python for the run-time, I will look there around.
Dear All,
Thank you each and everyone for your contribution. I would suggest to close this subject. If I get a solution around mod_wsgi + python I will post it.
Kind regards to all,
René
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-08-27 01:57 +1000 |
| Message-ID | <mailman.53.1440604680.11709.python-list@python.org> |
| In reply to | #95656 |
On Thu, Aug 27, 2015 at 1:02 AM, RAH <rene.heymans@gmail.com> wrote: > 7) Now, that source of the problem is known, I must investigate why my run-time environment differs from the interpreter environment. I know it is the same machine, same Python 3.4.0. As the mod_wsgi module in Apache2 initiates Python for the run-time, I will look there around. > First place to look would be the environment. If os.environ["LANG"] has "C" when you run under Apache, that would be the explanation. But explicitly choosing the encoding is the best way for other reasons anyway, and it solves the problem, so researching this is merely a matter of curiosity. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | RAH <rene.heymans@gmail.com> |
|---|---|
| Date | 2015-08-26 12:23 -0700 |
| Message-ID | <9f5d0558-cfc4-4764-b0a2-2516c1cddc8e@googlegroups.com> |
| In reply to | #95662 |
On Wednesday, August 26, 2015 at 5:59:12 PM UTC+2, Chris Angelico wrote:
> On Thu, Aug 27, 2015 at 1:02 AM, RAH wrote:
> > 7) Now, that source of the problem is known, I must investigate why my run-time environment differs from the interpreter environment. I know it is the same machine, same Python 3.4.0. As the mod_wsgi module in Apache2 initiates Python for the run-time, I will look there around.
> >
> First place to look would be the environment. If os.environ["LANG"]
> has "C" when you run under Apache, that would be the explanation. But
> explicitly choosing the encoding is the best way for other reasons
> anyway, and it solves the problem, so researching this is merely a
> matter of curiosity.
>
> ChrisA
Hello Chris,
Thanks for your further input.
os.environ{"LANG"} returns `en_US.UTF-8`, exactly the same as asking in Bash
`echo $LANG`. But again this is the interpreter.
Now if I ask the same os.environ["LANG"] within my application, it returns `C`
So, again, there is a marked difference between what the interpreter shows and what the run-time shows.
In the meantime, I have checked the configuration directives of mod_wsgi. There is nothing there to set or choose a particular environment. And I wonder if there is a reason to check the Apache2 directives, like SetEnv ? Indeed Apache2 doesn't know anything about Python.
On the other hand the server I use (Ubuntu) still has Python 2.7 installed and I can't remove it with apt-get. I believe Ubuntu needs python a lot. The Python 3.4 has been installed separately and it could be that a doubtful configuration subsists.
Nevertheless I can't figure out why calling Python in the shell (interactive mode) or letting mod_wsgi start the same Python provide two different environments. I guess I must investigate that part because mod_wsgi gets to Python in what I would call 'auto-discovery' mode. Obviously it gets the same version 3.4.0 but maybe it picks up some 2.7 configuration files because the installation of 3.4 next to 2.7 might not be perfect. I'll look at it.
Thank you Chris. If I find something I'll post it here.
René
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-08-27 09:15 +1000 |
| Message-ID | <mailman.69.1440630926.11709.python-list@python.org> |
| In reply to | #95670 |
On Thu, Aug 27, 2015 at 5:23 AM, RAH <rene.heymans@gmail.com> wrote: > Nevertheless I can't figure out why calling Python in the shell (interactive mode) or letting mod_wsgi start the same Python provide two different environments. I guess I must investigate that part because mod_wsgi gets to Python in what I would call 'auto-discovery' mode. Obviously it gets the same version 3.4.0 but maybe it picks up some 2.7 configuration files because the installation of 3.4 next to 2.7 might not be perfect. I'll look at it. > Apache itself most likely is running with LANG=C and other environmental changes. It's not a Python-specific thing. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-08-27 08:59 +0300 |
| Message-ID | <87io812rw1.fsf@elektro.pacujo.net> |
| In reply to | #95681 |
Chris Angelico <rosuav@gmail.com>: > Apache itself most likely is running with LANG=C and other > environmental changes. It's not a Python-specific thing. The topic is discussed also at: <URL: http://blog.dscpl.com.au/2014/0 9/setting-lang-and-lcall-when-using.html>. Personally, I think the C locale is the only safe choice for Apache. CGI shell scripts, in particular, can act in suprising ways when the locale is changed. Marko
[toc] | [prev] | [next] | [standalone]
| From | RAH <rene.heymans@gmail.com> |
|---|---|
| Date | 2015-08-27 07:01 -0700 |
| Message-ID | <3e207419-9152-4552-a6da-ec8364ca62ea@googlegroups.com> |
| In reply to | #95640 |
Dear All,
The solution / explanation follows.
Thanks to Graham Dumpleton, the author of mod_wsgi (the WSGI module for Apache2) the source of the problem could be traced back to variables in Apache2. Below are the details reproduced from https://groups.google.com/forum/#!topic/modwsgi/4wdfCOnMUkU
Now, everything is indeed UTF-8 !
Thanks again to each and everyone.
Best regards,
René
Solution
------------
There is a file /etc/apache2/envvars referred to by /etc/apache2/apache2.conf.
In that file, I found the following lines:
## The locale used by some modules like mod_dav
export LANG=C
## Uncomment the following line to use the system default locale instead:
#. /etc/default/locale
As I don't need mod_dav, neither is it compiled with Apache2 ($apache2ctl -l), neither is it loaded with Apache2 ($apache2ctl -M), I commented / uncommented the 2 lines so that it now looks like:
#export LANG=C
. /etc/default/locale
export LANG
After a stop/start of Apache2, everything works fine and when I put the code:
from locale import getpreferredencoding
prefcoding = getpreferredencoding()
from os import environ
lang = environ["LANG"]
g = open('envresults', 'a')
g.write('LANG: ' + lang + '\n')
g.write('PrefCod: ' + prefcoding + '\n')
in my WSGI application, it gives me the same as the interpreter:
rse@Alibaba:~/test$ cat envresults
LANG: en_US.UTF-8
PrefCod: UTF-8
rse@Alibaba:~/test$
-*- The End -*-
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web