Path: csiph.com!usenet.pasdenom.info!news.albasani.net!news.mixmin.net!hq-usenetpeers.eweka.nl!81.171.88.15.MISMATCH!eweka.nl!lightspeed.eweka.nl!82.197.223.106.MISMATCH!feeder1.cambriumusenet.nl!feed.tweaknews.nl!194.134.4.91.MISMATCH!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Newsgroups: comp.lang.python
Date: Fri, 12 Apr 2013 21:50:45 -0700 (PDT)
In-Reply-To: <mailman.542.1365817342.3114.python-list@python.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=94.68.69.168; posting-account=hGu1uQoAAACZy7LiR653nG0NwqDrTyoS
References: <c80153db-9987-44f7-9065-708f97ccbc86@googlegroups.com> <mailman.542.1365817342.3114.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Subject: Re: Unicode issue with Python v3.3
From: nagia.retsina@gmail.com
To: comp.lang.python@googlegroups.com
Content-Type: text/plain; charset=ISO-8859-7
Content-Transfer-Encoding: quoted-printable
Cc: Nikos <nagia.retsina@gmail.com>, =?ISO-8859-7?B?zd/q7/Igw+rxMzPq?= <nikos.gr33k@gmail.com>, python-list@python.org
Precedence: list
Message-ID: <mailman.545.1365828648.3114.python-list@python.org>
Lines: 316
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:43501

=D4=E7 =D3=DC=E2=E2=E1=F4=EF, 13 =C1=F0=F1=E9=EB=DF=EF=F5 2013 4:41:57 =F0.=
=EC. UTC+3, =EF =F7=F1=DE=F3=F4=E7=F2 Cameron Simpson =DD=E3=F1=E1=F8=E5:
> On 11Apr2013 09:55, Nikos <nagia.retsina@gmail.com> wrote:
>=20
> | =D4=E7 =D0=DD=EC=F0=F4=E7, 11 =C1=F0=F1=E9=EB=DF=EF=F5 2013 1:45:22 =EC=
.=EC. UTC+3, =EF =F7=F1=DE=F3=F4=E7=F2 Cameron Simpson =DD=E3=F1=E1=F8=E5:
>=20
> | > On 10Apr2013 21:50, nagia.retsina@gmail.com <nagia.retsina@gmail.com>=
 wrote:
>=20
> | > | the doctype is coming form the attempt of script metrites.py to ope=
n and read the 'index.html' file.
>=20
> | > | But i don't know how to try to open it as a byte file instead of an=
 tetxt file.
>=20
>=20
>=20
> Lele Gaifax showed one way:
>=20
>=20
>=20
>     from codecs import open
>=20
>     with open('index.html', encoding=3D'utf-8') as f:
>=20
>         content =3D f.read()
>=20
>=20
>=20
> But a plain open() should also do:
>=20
>=20
>=20
>     with open('index.html') as f:
>=20
>         content =3D f.read()
>=20
>=20
>=20
> if you're not taking tight control of the file encoding.
>=20
>=20
>=20
> The point here is to get _text_ (i.e. str) data from the file, not bytes.
>=20
>=20
>=20
> If the text turns out to be incorrectly decoded (i.e. incorrectly
>=20
> reading the file bytes and assembling them into text strings) because
>=20
> the default encoding is wrong, then you may need to read for Lele's
>=20
> more verbose open() example to select the correct encoding.
>=20
>=20
>=20
> But first ignore that and get text (str) instead of bytes.
>=20
> If you're already getting text from the file, something later is
>=20
> making bytes and handing it to print().
>=20
>=20
>=20
> Another approach to try is to use
>=20
>   sys.stdout.write()
>=20
> instead of
>=20
>   print()
>=20
>=20
>=20
> The print() function will take _anything_ and write text of some form.
>=20
> The write() function will throw an exception if it gets the wrong type of=
 data.
>=20
>=20
>=20
> If sys.stdout is opened in binary mode then write() will require
>=20
> bytes as data; strings will need to be explicitly turned into bytes
>=20
> via .encode() in order to not raise an exception.
>=20
>=20
>=20
> If sys.stdout is open in text mode, write() will require str data.
>=20
> The sys.stdout file itself will transcribe to bytes for you.
>=20
>=20
>=20
> If you take that route, at least you will not have confusion about
>=20
> str versus bytes.
>=20
>=20
>=20
> For an HTML output page I would advocate arranging that sys.stdout
>=20
> is in text mode; that way you can do the natural thing and .write()
>=20
> str data and lovely UTF-8 bytes will come out the other end.
>=20
>=20
>=20
> If the above test (using .write() instead of print()) shows it to
>=20
> be in binary mode we can fix that. But you need to find out.
>=20
>=20
>=20
> You will want access to the error messages from the CGI environment;
>=20
> do you have access to the web servers error_log? You can tail that
>=20
> in a terminal while you reload the page to see what's going on.
>=20
>=20
>=20
> | This works in the shell, but doesn't work on my website:
>=20
> |=20
>=20
> | $ cat utf8.txt
>=20
> | =F5=EB=E9=EA=FC!=D0=F1=FC=EA=E5=E9=F4=E1=E9 =E3
>=20
>=20
>=20
> Ok, so your terminal is using UTF-8 as its output coding. (And so
>=20
> is your mail posting program, since we see it unmangled on my screen
>=20
> here.)
>=20
>=20
>=20
> | $ python3
>=20
> | Python 3.2.3 (default, Oct 19 2012, 20:10:41)
>=20
> | [GCC 4.6.3] on linux2
>=20
> | Type "help", "copyright", "credits" or "license" for more information.
>=20
> | >>> data =3D open('utf8.txt').read()
>=20
> | >>> print(data)
>=20
> | =F5=EB=E9=EA=FC!=D0=F1=FC=EA=E5=E9=F4=E1=E9 =E3
>=20
>=20
>=20
> Likewise.
>=20
>=20
>=20
> However, in an exciting twist, I seem to recall that Python invoked
>=20
> interactively with aterminal as output will have the default terminal
>=20
> encoding in place on sys.stdout. Producing what you expect. _However_,
>=20
> python invoked in a batch environment where stdout is not a terminal
>=20
> (such as in the CGI environment producing your web page), that is
>=20
> _not_ necessarily the case.
>=20
>=20
>=20
> | >>> print(data.encode('utf-8'))
>=20
> | b'\xcf\x85\xce\xbb\xce\xb9\xce\xba\xcf\x8c!\xce\xa0\xcf\x81\xcf\x8c\xce=
\xba\xce\xb5\xce\xb9\xcf\x84\xce\xb1\xce\xb9 \xce\xb3\n'
>=20
> |=20
>=20
> | See, the last line is what i'am getting on my website.
>=20
>=20
>=20
> The above line takes your Unicode text in "data" and transcribed
>=20
> it to bytes using UTF-8 as the encoding. And print() is then receiving
>=20
> that bytes object and printing its str() representation as "b'....'".
>=20
> That str is itself unicode, and when print passes it to sys.stdout,
>=20
> _that_ transcribed the unicode "b'...'" string as bytes to your
>=20
> terminal. Using UTF-8 based on the previous examples above, but
>=20
> since all those characters are in the bottom 127 code range the
>=20
> byte sequence will be the same if it uses ASCII or ISO8859-1 or
>=20
> almost anything else:-)
>=20
>=20
>=20
> As you can see, there's a lot of encoding/decoding going on behind
>=20
> the scenes even in this superficially simple example.
>=20
>=20
>=20
> | If i remove
>=20
> | the encode('utf-8') part in metrites.py, the webpage will not show
>=20
> | anything at all...
>=20
>=20
>=20
> Ah, but data will be being output. The print() function _will_ be
>=20
> writing "data" out in some form.  I suggest you remove the .encode()
>=20
> and then examine the _source_ text of the web page, not its visible
>=20
> form.
>=20
>=20
>=20
> So: remove .encode(), reload the web page, "view page source"
>=20
> (depends on your browser, it is ctrl-U in Firefox ((Cmd-U in firefox
>=20
> on a Mac))).
>=20
>=20
>=20
> I think a lot of the issue you have in this thread is that your
>=20
> page is too complex. Make another page to do the same thing, and
>=20
> start with nothing. Add stuff to it a single item at a time until
>=20
> the page behaves incorrectly. Then you will know the exact item of
>=20
> code that introduced the issue. And then that single item can be
>=20
> examined in detail for the decode/encode issues.
>=20
>=20
>=20
> The other issue in the thread is that people losing patience get
>=20
> snarky. Respond only to the technical content. If a message is only
>=20
> snarky, _ignore_ it. People like the last word; let them have it
>=20
> and you won't get sidetracked into arguments.
>=20
>=20
>=20
> Cheers,
>=20
> --=20
>=20
> Cameron Simpson <cs@zip.com.au>
>=20
>=20
>=20
> PCs are like a submarine, it will work fine till you open Windows. - zoll=
ie101

First of all thank you very much Cameron for your detailed help and effort =
to write to me:

It seems another issue had happened without my knowledge, i was uploading s=
tuff at /root/public_html/cgi-bin instead of /home/nikos/public_html/cgi-bi=
n.

I realized that when i deliberately made error to metrites.py scropt and i =
got still the same page.

Ookey after that is corrected, i then tried the plain solution and i got th=
is response back form the shell:

Traceback (most recent call last):
  File "metrites.py", line 213, in &lt;module&gt;
    htmldata =3D f.read()
  File "/root/.local/lib/python2.7/lib/python3.3/encodings/iso8859_7.py", l=
ine 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0xae in position 47: =
character maps to &lt;undefined&gt;

then i switched to:

		with open('/home/nikos/www/' + page, encoding=3D'utf-8') as f:
			htmldata =3D f.read()

and i got no error at all, just pure run *from the shell*!
But i get internal server error when i try to run the webpage from the brow=
ser(Chrome).

So, can you tell me please where can i find the apache error log so to disp=
lay here please?

Apcher error_log is always better than running 'python3 metrites.py' becaus=
e even if the python script has no error apache will also display more web =
related things?

Thank you Cameron.