Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Terry Reedy <tjreedy@udel.edu>
Subject: Re: can't get utf8 / unicode strings from embedded python
Date: Sun, 25 Aug 2013 14:59:36 -0400
References: <fbeee40a-bc8a-4cef-abe7-2b2d54f59625@googlegroups.com> <d6250c5d-ff7d-46ae-9e0a-1c51a6e9b7dc@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8
In-Reply-To: <d6250c5d-ff7d-46ae-9e0a-1c51a6e9b7dc@googlegroups.com>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.223.1377457194.19984.python-list@python.org>
Lines: 43
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:52982

On 8/25/2013 1:57 PM, David M. Cotter wrote:
> i'm sorry this is so confusing, let me try to re-state the problem in as clear a way as i can.
>
> I have a C++ program, with very well tested unicode support.  All logging is done in utf8.  I have conversion routines that work flawlessly, so i can assure you there is nothing wrong with logging and unicode support in the underlying program.

> I am embedding python 2.7 into the program, and extending python with routines in my C++ program.

If you want 'well-tested' (correct) unicode support from Python, use 
3.3. Unicode in 2.x is somewhat buggy and definitely flakey. The first 
fix was to make unicode *the* text type, in 3.0. The second was to 
redesign the internals in 3.3. It is possible that 2.7 is too broken for 
what you want to do.

> I have a script, encoded in utf8, and *marked* as utf8 with this line:
>      # -*- coding: utf-8 -*-
>
> In that script, i have inline unicode text.

The example scripts that you posted pictures of do *not* have unicode 
text. They have bytestring literals with (encoded) non-ascii chars 
inside them. This is not a great idea. I am not sure what bytes you end 
up with. Apparently, not what you expect.

To make them 'unicode text', you must prepend the literals with 'u'. 
Didn't someone say this before?

> When I pass that text to my C++ program, the Python interpreter decides that these bytes are macRoman, and handily "converts" them to unicode.  To compensate, i must "convert" these "macRoman" characters encoded as utf8, back to macRoman, then "interpret" them as utf8.  In this way i can recover the original unicode.
>
> When i return a unicode string back to python, i must do the reverse so that Python gets back what it expects.
>
> This is not related to printing, or sys.stdout, it does happen with that too but focusing on that is a red-herring.  Let's focus on just passing a string into C++ then back out.
>
> This would all actually make sense IF my script was marked as being "macRoman" even tho i entered UTF8 Characters, but that is not the case.
>
> Let's prove my statements.  Here is the script, *interpreted* as MacRoman:
> http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_macroman.png

Why are you posting pictures of code, instead of the (runnable) code 
itself, as you did with C code?

-- 
Terry Jan Reedy