Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!feed.xsnews.nl!border-2.ams.xsnews.nl!feeder1.cambriumusenet.nl!feed.tweaknews.nl!194.134.4.91.MISMATCH!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Terry Reedy <tjreedy@udel.edu>
Subject: Re: Python 3.3, gettext and Unicode problems
Date: Sun, 30 Dec 2012 20:48:08 -0500
References: <CAFEv2m5dfn8StmwnoT8TTBTC=f_SMkijY4tAEOyFzq+RJ94_QQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Thunderbird/17.0
In-Reply-To: <CAFEv2m5dfn8StmwnoT8TTBTC=f_SMkijY4tAEOyFzq+RJ94_QQ@mail.gmail.com>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1480.1356918530.29569.python-list@python.org>
Lines: 102
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:35828

On 12/30/2012 7:39 PM, Marcel Rodrigues wrote:
> I'm using Python 3.3 (CPython) and am having trouble getting the
> standard gettext module to handle Unicode messages.

I have never even looked at the doc before, but I will take a look.

> My problem can be isolated as follows:
>
> I have 3 files in a folder: greeting.py, greeting.po and msgfmt.py.
>
> -- greeting.py --
> import gettext
>
> t =3D gettext.translation("greeting", "locale", ["pt"])
> _ =3D t.lgettext

gettext.lgettext(message)
Equivalent to gettext(), but the translation is returned in the=20
preferred system encoding, if no other encoding was explicitly set with=20
bind_textdomain_codeset().

Giving that 'preferred system encoding' apparent means=20
'locale.getpreferredencoding' and that seems to not be what you want,=20
why are you using the 'l' version?

>
> print("_charset =3D {0}\n".format(t._charset))
> print(_("hello"))

A strong suggestion: whenever you want to print a string and the=20
computation of the string (or bytes) involves encoding/decoding,=20
separate the computation and the printing (on two separate line).

s =3D _("hello")
print(s)

The reason is that printing also requires encoding for the output device =

and that process can also generate a UnicodeError that may be hard to=20
distinguish from an error in the computation of s itself.

> -- EOF --
>
> -- greeting.po --
> msgid ""
> msgstr ""
> "Project-Id-Version: 1.0\n"
> "MIME-Version: 1.0\n"
> "Content-Type: text/plain; charset=3DUTF-8\n"
> "Content-Transfer-Encoding: 8bit\n"
>
> msgid "hello"
> msgstr "ol=C3=A1"
> -- EOF --
>
> msgfmt.py was downloaded from
> http://hg.python.org/cpython/file/9e6ead98762e/Tools/i18n/msgfmt.py,
> since this tool apparently isn't included in the python3 package
> available on Arch Linux official repositories.
>
> It's probably also worth noting that the file greeting.po is encoded
> itself as UTF-8.
>
>  From that folder, I run the following commands:
>
> $ mkdir -p locale/pt/LC_MESSAGES
> $ python msgfmt.py -o !$/greeting.mo greeting.po
> $ python greeting.py
>
> The output is:
> _charset =3D UTF-8
>
> Traceback (most recent call last):
>    File "greeting.py", line 7, in <module>
>      print(_("hello"))
>    File "/usr/lib/python3.3/gettext.py", line 314, in lgettext
>      return tmsg.encode(locale.getpreferredencoding())
> UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in
> position 2: ordinal not in range(128)

In particular, we have seen, in previous posts here, this exact error=20
generated during printing rather than during the string computation and=20
posters have wasted time looking for the error in the string or bytes=20
computation itself.

> My interpretation of this output is that even though gettext correctly
> detects the MO file charset as UTF-8, it tries to encode the translated=

> message with the system's "preferred encoding", which happens to be ASC=
II.

Just as you seem to have requested ;-)

> Anyone know why this happens? Is this a bug on my code? Maybe I have
> misunderstood gettext...

You used lgettext (l =3D locale). As I said, I am new to this.

--=20
Terry Jan Reedy