Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #43495
| Date | 2013-04-13 11:41 +1000 |
|---|---|
| From | Cameron Simpson <cs@zip.com.au> |
| Subject | Re: Unicode issue with Python v3.3 |
| References | <c80153db-9987-44f7-9065-708f97ccbc86@googlegroups.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.542.1365817342.3114.python-list@python.org> (permalink) |
On 11Apr2013 09:55, Nikos <nagia.retsina@gmail.com> wrote:
| Τη Πέμπτη, 11 Απριλίου 2013 1:45:22 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε:
| > On 10Apr2013 21:50, nagia.retsina@gmail.com <nagia.retsina@gmail.com> wrote:
| > | the doctype is coming form the attempt of script metrites.py to open and read the 'index.html' file.
| > | But i don't know how to try to open it as a byte file instead of an tetxt file.
Lele Gaifax showed one way:
from codecs import open
with open('index.html', encoding='utf-8') as f:
content = f.read()
But a plain open() should also do:
with open('index.html') as f:
content = f.read()
if you're not taking tight control of the file encoding.
The point here is to get _text_ (i.e. str) data from the file, not bytes.
If the text turns out to be incorrectly decoded (i.e. incorrectly
reading the file bytes and assembling them into text strings) because
the default encoding is wrong, then you may need to read for Lele's
more verbose open() example to select the correct encoding.
But first ignore that and get text (str) instead of bytes.
If you're already getting text from the file, something later is
making bytes and handing it to print().
Another approach to try is to use
sys.stdout.write()
instead of
print()
The print() function will take _anything_ and write text of some form.
The write() function will throw an exception if it gets the wrong type of data.
If sys.stdout is opened in binary mode then write() will require
bytes as data; strings will need to be explicitly turned into bytes
via .encode() in order to not raise an exception.
If sys.stdout is open in text mode, write() will require str data.
The sys.stdout file itself will transcribe to bytes for you.
If you take that route, at least you will not have confusion about
str versus bytes.
For an HTML output page I would advocate arranging that sys.stdout
is in text mode; that way you can do the natural thing and .write()
str data and lovely UTF-8 bytes will come out the other end.
If the above test (using .write() instead of print()) shows it to
be in binary mode we can fix that. But you need to find out.
You will want access to the error messages from the CGI environment;
do you have access to the web servers error_log? You can tail that
in a terminal while you reload the page to see what's going on.
| This works in the shell, but doesn't work on my website:
|
| $ cat utf8.txt
| υλικό!Πρόκειται γ
Ok, so your terminal is using UTF-8 as its output coding. (And so
is your mail posting program, since we see it unmangled on my screen
here.)
| $ python3
| Python 3.2.3 (default, Oct 19 2012, 20:10:41)
| [GCC 4.6.3] on linux2
| Type "help", "copyright", "credits" or "license" for more information.
| >>> data = open('utf8.txt').read()
| >>> print(data)
| υλικό!Πρόκειται γ
Likewise.
However, in an exciting twist, I seem to recall that Python invoked
interactively with aterminal as output will have the default terminal
encoding in place on sys.stdout. Producing what you expect. _However_,
python invoked in a batch environment where stdout is not a terminal
(such as in the CGI environment producing your web page), that is
_not_ necessarily the case.
| >>> print(data.encode('utf-8'))
| b'\xcf\x85\xce\xbb\xce\xb9\xce\xba\xcf\x8c!\xce\xa0\xcf\x81\xcf\x8c\xce\xba\xce\xb5\xce\xb9\xcf\x84\xce\xb1\xce\xb9 \xce\xb3\n'
|
| See, the last line is what i'am getting on my website.
The above line takes your Unicode text in "data" and transcribed
it to bytes using UTF-8 as the encoding. And print() is then receiving
that bytes object and printing its str() representation as "b'....'".
That str is itself unicode, and when print passes it to sys.stdout,
_that_ transcribed the unicode "b'...'" string as bytes to your
terminal. Using UTF-8 based on the previous examples above, but
since all those characters are in the bottom 127 code range the
byte sequence will be the same if it uses ASCII or ISO8859-1 or
almost anything else:-)
As you can see, there's a lot of encoding/decoding going on behind
the scenes even in this superficially simple example.
| If i remove
| the encode('utf-8') part in metrites.py, the webpage will not show
| anything at all...
Ah, but data will be being output. The print() function _will_ be
writing "data" out in some form. I suggest you remove the .encode()
and then examine the _source_ text of the web page, not its visible
form.
So: remove .encode(), reload the web page, "view page source"
(depends on your browser, it is ctrl-U in Firefox ((Cmd-U in firefox
on a Mac))).
I think a lot of the issue you have in this thread is that your
page is too complex. Make another page to do the same thing, and
start with nothing. Add stuff to it a single item at a time until
the page behaves incorrectly. Then you will know the exact item of
code that introduced the issue. And then that single item can be
examined in detail for the decode/encode issues.
The other issue in the thread is that people losing patience get
snarky. Respond only to the technical content. If a message is only
snarky, _ignore_ it. People like the last word; let them have it
and you won't get sidetracked into arguments.
Cheers,
--
Cameron Simpson <cs@zip.com.au>
PCs are like a submarine, it will work fine till you open Windows. - zollie101
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Unicode issue with Python v3.3 Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-09 14:10 -0700
Re: Unicode issue with Python v3.3 Ian Kelly <ian.g.kelly@gmail.com> - 2013-04-09 15:34 -0600
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-09 20:16 -0700
Re: Unicode issue with Python v3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-10 04:25 +0000
Re: Unicode issue with Python v3.3 Chris Angelico <rosuav@gmail.com> - 2013-04-10 14:46 +1000
Re: Unicode issue with Python v3.3 rusi <rustompmody@gmail.com> - 2013-04-09 22:06 -0700
Re: Unicode issue with Python v3.3 rusi <rustompmody@gmail.com> - 2013-04-09 23:04 -0700
Re: Unicode issue with Python v3.3 Antoine Pitrou <solipsis@pitrou.net> - 2013-04-10 07:04 +0000
Re: Unicode issue with Python v3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-10 08:28 +0000
People in the python community [was Re: Unicode issue with Python v3.3] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-10 08:34 +0000
Re: People in the python community [was Re: Unicode issue with Python v3.3] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-04-10 10:31 +0100
Re: People in the python community [was Re: Unicode issue with Python v3.3] Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-10 03:50 -0700
Re: People in the python community [was Re: Unicode issue with Python v3.3] Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-10 03:50 -0700
Re: People in the python community [was Re: Unicode issue with Python v3.3] Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-10 03:53 -0700
Re: People in the python community [was Re: Unicode issue with Python v3.3] Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-10 03:53 -0700
Re: People in the python community [was Re: Unicode issue with Python v3.3] Peter Otten <__peter__@web.de> - 2013-04-10 13:11 +0200
Re: People in the python community [was Re: Unicode issue with Python v3.3] Peter Otten <__peter__@web.de> - 2013-04-10 13:13 +0200
Re: People in the python community [was Re: Unicode issue with Python v3.3] Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-10 07:43 -0700
Re: People in the python community [was Re: Unicode issue with Python v3.3] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-04-10 16:15 +0100
Re: People in the python community [was Re: Unicode issue with Python v3.3] Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-10 09:50 -0700
Re: People in the python community [was Re: Unicode issue with Python v3.3] Michael Torrie <torriem@gmail.com> - 2013-04-11 21:20 -0600
Re: People in the python community [was Re: Unicode issue with Python v3.3] Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-10 09:50 -0700
Re: People in the python community [was Re: Unicode issue with Python v3.3] Chris Angelico <rosuav@gmail.com> - 2013-04-11 01:19 +1000
Re: People in the python community [was Re: Unicode issue with Python v3.3] Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-10 07:43 -0700
Re: Unicode issue with Python v3.3 Arnaud Delobelle <arnodel@gmail.com> - 2013-04-10 23:56 +0100
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-10 00:23 -0700
Re: Unicode issue with Python v3.3 Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-10 01:06 -0700
Re: Unicode issue with Python v3.3 Cameron Simpson <cs@zip.com.au> - 2013-04-11 09:17 +1000
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-10 21:50 -0700
Re: Unicode issue with Python v3.3 Cameron Simpson <cs@zip.com.au> - 2013-04-11 20:45 +1000
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-11 03:54 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-11 03:54 -0700
Re: Unicode issue with Python v3.3 Nikos <nagia.retsina@gmail.com> - 2013-04-11 09:55 -0700
Re: Unicode issue with Python v3.3 Cameron Simpson <cs@zip.com.au> - 2013-04-13 11:41 +1000
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-12 21:50 -0700
Re: Unicode issue with Python v3.3 Cameron Simpson <cs@zip.com.au> - 2013-04-13 20:28 +1000
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-13 07:16 -0700
Re: Unicode issue with Python v3.3 Chris Angelico <rosuav@gmail.com> - 2013-04-14 01:45 +1000
Re: Unicode issue with Python v3.3 Cameron Simpson <cs@zip.com.au> - 2013-04-14 10:01 +1000
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-13 07:16 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-12 21:50 -0700
Re: Unicode issue with Python v3.3 Nikos <nagia.retsina@gmail.com> - 2013-04-11 09:55 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-10 21:50 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-11 00:13 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-11 00:13 -0700
Re: Unicode issue with Python v3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-11 07:50 +0000
Re: Unicode issue with Python v3.3 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-11 08:20 +0000
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-11 03:07 -0700
Re: Unicode issue with Python v3.3 Lele Gaifax <lele@metapensiero.it> - 2013-04-11 12:45 +0200
Re: Unicode issue with Python v3.3 Nobody <nobody@nowhere.com> - 2013-04-10 19:08 +0100
Re: Unicode issue with Python v3.3 Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-10 11:25 -0700
Re: Unicode issue with Python v3.3 Ian Kelly <ian.g.kelly@gmail.com> - 2013-04-10 13:50 -0600
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-09 20:16 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-11 21:36 -0700
Re: Unicode issue with Python v3.3 alex23 <wuwei23@gmail.com> - 2013-04-11 22:06 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-11 22:42 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-12 05:50 -0700
Re: Unicode issue with Python v3.3 Chris Angelico <rosuav@gmail.com> - 2013-04-12 23:14 +1000
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-12 06:18 -0700
Re: Unicode issue with Python v3.3 Chris Angelico <rosuav@gmail.com> - 2013-04-12 23:21 +1000
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-12 06:18 -0700
Re: Unicode issue with Python v3.3 rusi <rustompmody@gmail.com> - 2013-04-12 06:29 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-12 07:36 -0700
Re: Unicode issue with Python v3.3 Ian Kelly <ian.g.kelly@gmail.com> - 2013-04-12 12:37 -0600
Re: Unicode issue with Python v3.3 Roy Smith <roy@panix.com> - 2013-04-12 14:49 -0400
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-12 13:48 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-12 13:48 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-13 23:00 -0700
Re: Unicode issue with Python v3.3 Cameron Simpson <cs@zip.com.au> - 2013-04-14 19:28 +1000
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-14 04:22 -0700
Re: Unicode issue with Python v3.3 Cameron Simpson <cs@zip.com.au> - 2013-04-18 09:00 +1000
Re: Unicode issue with Python v3.3 Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-17 20:37 -0700
Re: Unicode issue with Python v3.3 Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-17 20:37 -0700
Re: Unicode issue with Python v3.3 Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-19 12:16 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-14 04:22 -0700
Re: Unicode issue with Python v3.3 Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-15 11:42 -0700
Re: Unicode issue with Python v3.3 Νίκος Γκρ33κ <nikos.gr33k@gmail.com> - 2013-04-15 11:42 -0700
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-16 23:56 -0700
Re: Unicode issue with Python v3.3 Chris Angelico <rosuav@gmail.com> - 2013-04-17 17:01 +1000
Re: Unicode issue with Python v3.3 Chris Angelico <rosuav@gmail.com> - 2013-04-17 17:32 +1000
Re: Unicode issue with Python v3.3 nagia.retsina@gmail.com - 2013-04-16 23:56 -0700
csiph-web