Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <d3e52d5b-84c9-4cb4-84bf-cbdd886425b1@googlegroups.com>
References: <fbeee40a-bc8a-4cef-abe7-2b2d54f59625@googlegroups.com> <d3e52d5b-84c9-4cb4-84bf-cbdd886425b1@googlegroups.com>
From: Benjamin Kaplan <benjamin.kaplan@case.edu>
Date: Sat, 24 Aug 2013 12:45:37 -0700
Subject: Re: can't get utf8 / unicode strings from embedded python
To: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.200.1377373941.19984.python-list@python.org>
Lines: 40
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:52947

On Sat, Aug 24, 2013 at 9:47 AM, David M. Cotter <me@davecotter.com> wrote:
>
> > What _are_ you using?
> i have scripts in a file, that i am invoking into my embedded python with=
in a C++ program.  there is no terminal involved.  the "print" statement ha=
s been redirected (via sys.stdout) to my custom print class, which does not=
 specify "encoding", so i tried the suggestion above to set it:
>
> static const char *s_RedirectScript =3D
>         "import " kEmbeddedModuleName "\n"
>         "import sys\n"
>         "\n"
>         "class CustomPrintClass:\n"
>         "       def write(self, stuff):\n"
>         "               " kEmbeddedModuleName "." kCustomPrint "(stuff)\n=
"
>         "class CustomErrClass:\n"
>         "       def write(self, stuff):\n"
>         "               " kEmbeddedModuleName "." kCustomErr "(stuff)\n"
>         "sys.stdout =3D CustomPrintClass()\n"
>         "sys.stderr =3D CustomErrClass()\n"
>         "sys.stdout.encoding =3D 'UTF-8'\n"
>         "sys.stderr.encoding =3D 'UTF-8'\n";
>
>
> but it didn't help.
>
> I'm still getting back a string that is a utf-8 string of characters that=
, if converted to "macRoman" and then interpreted as UTF8, shows the origin=
al, correct string.  who is specifying macRoman, and where, and how do i te=
ll whoever that is that i really *really* want utf8?
> --

If you're running this from a C++ program, then you aren't getting
back characters. You're getting back bytes. If you treat them as
UTF-8, they'll work properly. The only thing wrong is the text editor
you're using to open the file afterwards- since you aren't specifying
an encoding, it's assuming MacRoman. You can try putting the UTF-8 BOM
(it's not really a BOM) at the front of the file- the bytes 0xEF 0xBB
0xBF are used by some editors to identify a file as UTF-8.