Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <d6250c5d-ff7d-46ae-9e0a-1c51a6e9b7dc@googlegroups.com>
References: <fbeee40a-bc8a-4cef-abe7-2b2d54f59625@googlegroups.com> <d6250c5d-ff7d-46ae-9e0a-1c51a6e9b7dc@googlegroups.com>
Date: Sun, 25 Aug 2013 20:23:41 +0200
Subject: Re: can't get utf8 / unicode strings from embedded python
From: Vlastimil Brom <vlastimil.brom@gmail.com>
To: "David M. Cotter" <me@davecotter.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: python <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.222.1377455031.19984.python-list@python.org>
Lines: 137
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:52981

2013/8/25 David M. Cotter <me@davecotter.com>:
> i'm sorry this is so confusing, let me try to re-state the problem in as =
clear a way as i can.
>
> I have a C++ program, with very well tested unicode support.  All logging=
 is done in utf8.  I have conversion routines that work flawlessly, so i ca=
n assure you there is nothing wrong with logging and unicode support in the=
 underlying program.
>
> I am embedding python 2.7 into the program, and extending python with rou=
tines in my C++ program.
>
> I have a script, encoded in utf8, and *marked* as utf8 with this line:
>     # -*- coding: utf-8 -*-
>
> In that script, i have inline unicode text.  When I pass that text to my =
C++ program, the Python interpreter decides that these bytes are macRoman, =
and handily "converts" them to unicode.  To compensate, i must "convert" th=
ese "macRoman" characters encoded as utf8, back to macRoman, then "interpre=
t" them as utf8.  In this way i can recover the original unicode.
>
> When i return a unicode string back to python, i must do the reverse so t=
hat Python gets back what it expects.
>
> This is not related to printing, or sys.stdout, it does happen with that =
too but focusing on that is a red-herring.  Let's focus on just passing a s=
tring into C++ then back out.
>
> This would all actually make sense IF my script was marked as being "macR=
oman" even tho i entered UTF8 Characters, but that is not the case.
>
> Let's prove my statements.  Here is the script, *interpreted* as MacRoman=
:
> http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_macrom=
an.png
>
> and here it is again *interpreted* as utf8:
> http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_utf8.p=
ng
>
> here is the string conversion code:
>
> SuperString             ScPyObject::GetAs_String()
> {
>         SuperString             str;    //      underlying format of Supe=
rString is unicode
>
>         if (PyUnicode_Check(i_objP)) {
>                 ScPyObject              utf8Str(PyUnicode_AsUTF8String(i_=
objP));
>
>                 str =3D utf8Str.GetAs_String();
>         } else {
>                 const UTF8Char          *bytes_to_interpetZ =3D uc(PyStri=
ng_AsString(i_objP));
>
>                 //      the "Set" call *interprets*, does not *convert*
>                 str.Set(bytes_to_interpetZ, kCFStringEncodingUTF8);
>
>                 //      str is now unicode characters which *represent* m=
acRoman characters
>                 //      so *convert* these to actual macRoman
>
>                 //      fyi: Update_utf8 means "convert to this encoding =
and
>                 //      store the resulting bytes in the variable named "=
utf8"
>                 str.Update_utf8(kCFStringEncodingMacRoman);
>
>                 //      str is now unicode characters converted from macR=
oman
>                 //      so *reinterpret* them as UTF8
>
>                 //      FYI, we're just taking the pure bytes that are st=
ored in the utf8 variable
>                 //      and *interpreting* them to this encoding
>                 bytes_to_interpetZ =3D str.utf8().c_str();
>
>                 str.Set(bytes_to_interpetZ, kCFStringEncodingUTF8);
>         }
>
>         return str;
> }
>
> PyObject*       PyString_FromString(const SuperString& str)
> {
>         SuperString                     localStr(str);
>
>         //      localStr is the real, actual unicode string
>         //      but we must *interpret* it as macRoman, then take these "=
macRoman" characters
>         //      and "convert" them to unicode for Python to "get it"
>         const UTF8Char          *bytes_to_interpetZ =3D localStr.utf8().c=
_str();
>
>         //      take the utf8 bytes (actual utf8 prepresentation of strin=
g)
>         //      and say "no, these bytes are macRoman"
>         localStr.Set(bytes_to_interpetZ, kCFStringEncodingMacRoman);
>
>         //      okay so now we have unicode of MacRoman characters (!?)
>         //      return the underlying utf8 bytes of THAT as our string
>         return PyString_FromString(localStr.utf8Z());
> }
>
> And here is the results from running the script:
>    18: ---------------
>    18: Original string: fr=C3=B8=C3=A2n=C3=A7=C3=AF=C3=A9
>    18: converting...
>    18: it worked: fr=C3=B8=C3=A2n=C3=A7=C3=AF=C3=A9
>    18: ---------------
>    18: ---------------
>    18: Original string: =E6=8E=A7=E4=BB=B6
>    18: converting...
>    18: it worked: =E6=8E=A7=E4=BB=B6
>    18: ---------------
>
> Now the thing that absolutely utterly baffles me (if i'm not baffled enou=
gh) is that i get the EXACT same results on both Mac and Windows.  Why do t=
hey both insist on interpreting my script's bytes as MacRoman?
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
unfortunately, I don't have experience with embedding python and C++,
but he python (for python 2) part seems to be missing the u prefix in
the unicode literals.
like
u"fr=C3=B8=C3=A2n=C3=A7=C3=AF=C3=A9"
Is the c++ part prepared for python unicode object, or does it require
utf-8 encoded string (or the respective bytes)?
would
oldstr.encode("utf-8")
in the call make a difference?

regards,
   vbr