Groups > comp.lang.python > #52913 > unrolled thread

can't get utf8 / unicode strings from embedded python

Started by	"David M. Cotter" <me@davecotter.com>
First post	2013-08-23 13:49 -0700
Last post	2013-08-28 10:46 -0700
Articles	20 — 9 participants

Back to article view | Back to comp.lang.python

  can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-23 13:49 -0700
    Re: can't get utf8 / unicode strings from embedded python Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-24 01:54 +0000
    Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-23 23:45 -0700
      Re: can't get utf8 / unicode strings from embedded python Dave Angel <davea@davea.name> - 2013-08-24 07:04 +0000
      Re: can't get utf8 / unicode strings from embedded python random832@fastmail.us - 2013-08-24 09:49 -0400
    Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-24 09:47 -0700
      Re: can't get utf8 / unicode strings from embedded python wxjmfauth@gmail.com - 2013-08-24 11:31 -0700
      Re: can't get utf8 / unicode strings from embedded python Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-08-24 12:45 -0700
      Re: can't get utf8 / unicode strings from embedded python random832@fastmail.us - 2013-08-24 20:01 -0400
    Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-25 10:57 -0700
      Re: can't get utf8 / unicode strings from embedded python Vlastimil Brom <vlastimil.brom@gmail.com> - 2013-08-25 20:23 +0200
      Re: can't get utf8 / unicode strings from embedded python Terry Reedy <tjreedy@udel.edu> - 2013-08-25 14:59 -0400
    Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-25 15:25 -0700
    Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-25 15:32 -0700
      Re: can't get utf8 / unicode strings from embedded python MRAB <python@mrabarnett.plus.com> - 2013-08-26 01:30 +0100
        Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-27 15:21 -0700
          Re: can't get utf8 / unicode strings from embedded python Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-27 23:24 +0000
            Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-27 22:57 -0700
              Re: can't get utf8 / unicode strings from embedded python Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-28 12:03 +0000
    Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-28 10:46 -0700

#52913 — can't get utf8 / unicode strings from embedded python

From	"David M. Cotter" <me@davecotter.com>
Date	2013-08-23 13:49 -0700
Subject	can't get utf8 / unicode strings from embedded python
Message-ID	<fbeee40a-bc8a-4cef-abe7-2b2d54f59625@googlegroups.com>

note everything works great if i use Ascii, but:

in my utf8-encoded script i have this:

>	print "frøânçïé"

in my embedded C++ i have this:

PyObject*	CPython_Script::print(PyObject *args)
{
	PyObject		*resultObjP	= NULL;
	const char		*utf8_strZ	= NULL;
	
	if (PyArg_ParseTuple(args, "s", &utf8_strZ)) {
		Log(utf8_strZ, false);

		resultObjP = Py_None;
		Py_INCREF(resultObjP);
	}
	
	return resultObjP;
}

Now, i know that my Log() can print utf8 (has for years, very well debugged)

but what it *actually* prints is this:

>	print "frøânçïé"
--> fr√∏√¢n√ß√Ø√©

another method i use looks like this:
>	kj_commands.menu("控件", "同步滑帧", "全局无滑帧")
or
>	kj_commands.menu(u"控件", u"同步滑帧", u"全局无滑帧")

and in my C++ i have:

SuperString		ScPyObject::GetAs_String()
{
	SuperString		str;
	
	if (PyUnicode_Check(i_objP)) {
		#if 1
		//	method 1
		{
			ScPyObject		utf8Str(PyUnicode_AsUTF8String(i_objP));
			
			str = utf8Str.GetAs_String();
		}
		#elif 0
		//	method 2
		{
			UTF8Char		*uniZ = (UTF8Char *)PyUnicode_AS_UNICODE(i_objP);
		
			str.assign(&uniZ[0], &uniZ[PyUnicode_GET_DATA_SIZE(i_objP)], kCFStringEncodingUTF16);
		}
		#else
		//	method 3
		{
			UTF32Vec			charVec(32768); CF_ASSERT(sizeof(UTF32Vec::value_type) == sizeof(wchar_t));
			PyUnicodeObject		*uniObjP = (PyUnicodeObject *)(i_objP);
			Py_ssize_t			sizeL(PyUnicode_AsWideChar(uniObjP, (wchar_t *)&charVec[0], charVec.size()));
			
			charVec.resize(sizeL);
			charVec.push_back(0);
			str.Set(SuperString(&charVec[0]));
		}
		#endif
	} else {
		str.Set(uc(PyString_AsString(i_objP)));
	}
	
	Log(str.utf8Z());
	
	return str;
}


for the string, "控件", i get:
--> Êéß‰ª∂

for the *unicode* string, u"控件", Methods 1, 2, and 3, i get the same thing:
--> Êéß‰ª∂

okay so what am i doing wrong???

[toc] | [next] | [standalone]

#52921

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-08-24 01:54 +0000
Message-ID	<52181238$0$29986$c3e8da3$5496439d@news.astraweb.com>
In reply to	#52913

On Fri, 23 Aug 2013 13:49:23 -0700, David M. Cotter wrote:

> note everything works great if i use Ascii, but:
> 
> in my utf8-encoded script i have this:
> 
>>	print "frøânçïé"

I see you are using Python 2, in which case there are probably two or 
three errors being made here.

Firstly, in Python 2, the compiler assumes that the source code is 
encoded in ASCII, actually ASCII plus arbitrary bytes. Since your source 
code is *actually* UTF-8, the bytes in the file are:

70 72 69 6E 74 20 22 66 72 C3 B8 C3 A2 6E C3 A7 C3 AF C3 A9 22

But Python doesn't know the file is encoded in UTF-8, it thinks it is 
reading ASCII plus junk, so when it reads the file it parses those bytes 
into a line of code:

print "~~~~~"

where the ~~~~~ represents a bunch of 13 rubbish junk bytes. So that's 
the first problem to fix. You can fix this by adding an encoding cookie 
at the beginning of your module, in the first or second line:

# -*- coding: utf-8 -*-

The second problem is that even once you've fixed the source encoding, 
you're still not dealing with a proper Unicode string. In Python 2, you 
need to use u" ... " delimiters for Unicode, otherwise the results you 
get are completely arbitrary and depend on the encoding of your terminal. 
For example, if I set my terminal encoding to IBM-850, I get:

fr°Ônþ´Ú

from those bytes. If I set it to Central European ISO-8859-3 I get this:

frĝânçïé

Clearly not what I intended. So change the line of code to:

print u"frøânçïé"

Those two changes ought to fix the problem, but if they don't, try 
setting your terminal encoding to UTF-8 as well and see if that helps.

[...]
> but what it *actually* prints is this:
> 
>>	print "frøânçïé"
> --> fr√∏√¢n√ß√Ø√©

It's hard to say what *exactly* is happening here, because you don't 
explain how the python print statement somehow gets into your C++ Log 
code. Do I guess right that it catches stdout?

If so, then what I expect is happening is that Python has read in the 
source code of

print "~~~~~"

with ~~~~~ as a bunch of junk bytes, and then your terminal is displaying 
those junk bytes according to whatever encoding it happens to be using. 
Since you are seeing this:

fr√∏√¢n√ß√Ø√©

my guess is that you're using a Mac, and the encoding is set to the 
MacRoman encoding. Am I close?

To summarise:

* Add an encoding cookie, to tell Python to use UTF-8 when parsing your 
source file.

* Use a Unicode string u"frøânçïé".

* Consider setting your terminal to use UTF-8, otherwise it may not be 
able to print all the characters you would like.

* You may need to change the way data gets into your C++ Log function. If 
it expects bytes, you may need to use u"...".encode('utf-8') rather than 
just u"...". But since I don't understand how data is getting into your 
Log function, I can't be sure about this.

I think that is everything. Does that fix your problem?

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#52927

From	"David M. Cotter" <me@davecotter.com>
Date	2013-08-23 23:45 -0700
Message-ID	<d16f50e3-3e8b-428c-8f84-99bdbf6d73eb@googlegroups.com>
In reply to	#52913

> I see you are using Python 2
correct

> Firstly, in Python 2, the compiler assumes that the source code is encoded in ASCII
gar, i must have been looking at doc for v3, as i thought it was all assumed to be utf8

> # -*- coding: utf-8 -*- 
okay, did that, still no change

> you need to use u" ... " delimiters for Unicode, otherwise the results you get are completely arbitrary and depend on the encoding of your terminal. 
okay, well, i'm on a mac, and not using "terminal" at all.  but if i were, it would be utf8
but it's still not flying :(

> For example, if I set my terminal encoding to IBM-850
okay how do you even do that?  this is not an interactive session, this is embedded python, within a C++ app, so there's no terminal.  

but that is a good question: all the docs say "default encoding" everywhere (as in "If string is a Unicode object, this function computes the default encoding of string and operates on that"), but fail to specify just HOW i can set the default encoding.  if i could just say "hey, default encoding is utf8", i think i'd be done?

> So change the line of code to: 
> print u"frøânçïé" 
okay, sure... 
but i get the exact same results

> Those two changes ought to fix the problem, but if they don't, try setting your terminal encoding to UTF-8 as well
well, i'm not sure what you mean by that.  i don't have a terminal here.
i'm logging to a utf8 log file (when i print)


> but what it *actually* prints is this: 
> 
>        print "frøânçïé" 
> --> fr√∏√¢n√ß√Ø√© 

>It's hard to say what *exactly* is happening here, because you don't explain how the python print statement somehow gets into your C++ Log code. Do I guess right that it catches stdout?
yes, i'm redirecting stdout to my own custom print class, and then from that function i call into my embedded C++ print function

>If so, then what I expect is happening is that Python has read in the source code of 

>print "~~~~~" 

>with ~~~~~ as a bunch of junk bytes, and then your terminal is displaying those junk bytes according to whatever encoding it happens to be using. 
>Since you are seeing this: 

>fr√∏√¢n√ß√Ø√© 

>my guess is that you're using a Mac, and the encoding is set to the MacRoman encoding. Am I close?
you hit the nail on the head there, i think.  using that as a hint, i took this text "fr√∏√¢n√ß√Ø√©" and pasted that into a "macRoman" document, then *reinterpreted* it as UTF8, and voala: "frøânçïé"

so, it seems that i AM getting my utf8 bytes, but i'm getting them converted to macRoman.  huh?  where is macRoman specified, and how to i change that to utf8?  i think that's the missing golden ticket

[toc] | [prev] | [next] | [standalone]

#52928

From	Dave Angel <davea@davea.name>
Date	2013-08-24 07:04 +0000
Message-ID	<mailman.188.1377327902.19984.python-list@python.org>
In reply to	#52927

David M. Cotter wrote:
> Steven wrote:
>> I see you are using Python 2
> correct
>
>>It's hard to say what *exactly* is happening here, because you don't explain how the python print statement somehow gets into your C++ Log code. Do I guess right that it catches stdout?
> yes, i'm redirecting stdout to my own custom print class, and then from that function i call into my embedded C++ print function
>

I don't know much about embedding Python, but each file object has an
encoding property.

Why not examine   sys.stdout.encoding ?  And change it to "UTF-8" ?

print "encoding is", sys.stdout.encoding

sys.stdout.encoding = "UTF-8"

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#52940

From	random832@fastmail.us
Date	2013-08-24 09:49 -0400
Message-ID	<mailman.195.1377352173.19984.python-list@python.org>
In reply to	#52927

On Sat, Aug 24, 2013, at 2:45, David M. Cotter wrote:
> > you need to use u" ... " delimiters for Unicode, otherwise the results you get are completely arbitrary and depend on the encoding of your terminal. 
> okay, well, i'm on a mac, and not using "terminal" at all.  but if i
> were, it would be utf8
> but it's still not flying :(

> so, it seems that i AM getting my utf8 bytes, but i'm getting them
> converted to macRoman.  huh?  where is macRoman specified, and how to i
> change that to utf8?  i think that's the missing golden ticket

You say you're not using terminal. What _are_ you using?

[toc] | [prev] | [next] | [standalone]

#52943

From	"David M. Cotter" <me@davecotter.com>
Date	2013-08-24 09:47 -0700
Message-ID	<d3e52d5b-84c9-4cb4-84bf-cbdd886425b1@googlegroups.com>
In reply to	#52913

> What _are_ you using? 
i have scripts in a file, that i am invoking into my embedded python within a C++ program.  there is no terminal involved.  the "print" statement has been redirected (via sys.stdout) to my custom print class, which does not specify "encoding", so i tried the suggestion above to set it:

static const char *s_RedirectScript = 
	"import " kEmbeddedModuleName "\n"
	"import sys\n"
	"\n"
	"class CustomPrintClass:\n"
	"	def write(self, stuff):\n"
	"		" kEmbeddedModuleName "." kCustomPrint "(stuff)\n"
	"class CustomErrClass:\n"
	"	def write(self, stuff):\n"
	"		" kEmbeddedModuleName "." kCustomErr "(stuff)\n"
	"sys.stdout = CustomPrintClass()\n"
	"sys.stderr = CustomErrClass()\n"
	"sys.stdout.encoding = 'UTF-8'\n"
	"sys.stderr.encoding = 'UTF-8'\n";


but it didn't help.

I'm still getting back a string that is a utf-8 string of characters that, if converted to "macRoman" and then interpreted as UTF8, shows the original, correct string.  who is specifying macRoman, and where, and how do i tell whoever that is that i really *really* want utf8?

[toc] | [prev] | [next] | [standalone]

#52945

From	wxjmfauth@gmail.com
Date	2013-08-24 11:31 -0700
Message-ID	<3132f4c4-9a26-448b-b821-1c119a83ccd8@googlegroups.com>
In reply to	#52943

Le samedi 24 août 2013 18:47:19 UTC+2, David M. Cotter a écrit :
> > What _are_ you using? 
> 
> i have scripts in a file, that i am invoking into my embedded python within a C++ program.  there is no terminal involved.  the "print" statement has been redirected (via sys.stdout) to my custom print class, which does not specify "encoding", so i tried the suggestion above to set it:
> 
> 
> 
> static const char *s_RedirectScript = 
> 
> 	"import " kEmbeddedModuleName "\n"
> 
> 	"import sys\n"
> 
> 	"\n"
> 
> 	"class CustomPrintClass:\n"
> 
> 	"	def write(self, stuff):\n"
> 
> 	"		" kEmbeddedModuleName "." kCustomPrint "(stuff)\n"
> 
> 	"class CustomErrClass:\n"
> 
> 	"	def write(self, stuff):\n"
> 
> 	"		" kEmbeddedModuleName "." kCustomErr "(stuff)\n"
> 
> 	"sys.stdout = CustomPrintClass()\n"
> 
> 	"sys.stderr = CustomErrClass()\n"
> 
> 	"sys.stdout.encoding = 'UTF-8'\n"
> 
> 	"sys.stderr.encoding = 'UTF-8'\n";
> 
> 
> 
> 
> 
> but it didn't help.
> 
> 
> 
> I'm still getting back a string that is a utf-8 string of characters that, if converted to "macRoman" and then interpreted as UTF8, shows the original, correct string.  who is specifying macRoman, and where, and how do i tell whoever that is that i really *really* want utf8?

--------

Always encode a "unicode" into the coding of the "system"
which will host it.

Adapting the hosting system to your "unicode" (encoded
unicode) is not a valid solution. A non sense.

sys.std***.encodings do nothing. They only give you
information about the coding of the hosting system.

The "system" can be anything, a db, a terminal, a gui, ...

Shortly, your "writer" should encode your "stuff"
to your "host" in a adequate way. It is up to you to
manage coherence. If your passive "writer" support only one
coding, adapt "stuff", if "stuff" lives in its own coding
(due to c++ ?) adapt your "writer".



Example from my interactive interpreter. It is in Python 3,
not important, basically the job is the same in Python 2.
This interpreter has the capability to support many codings,
and the coding of this host system can be changed on the
fly.

A commented session.

By default, a string, type str, is a unicode. The
host accepts "unicode". So, by default the sys.stdout
coding is '<unicode>'. 

>>> sys.stdout.encoding = '<unicode>'
>>> print("frøânçïé")
frøânçïé
>>> 

Setting the host to utf-8 and printing the above string gives
"something", but encoding into utf-8 works fine.

>>> sys.stdout.encoding = 'utf-8'
>>> sys.stdout.encoding
'utf-8'
>>> print("frøânçïé")
frÃ¸Ã¢nÃ§Ã¯Ã©
>>> print("frøânçïé".encode('utf-8'))
'frøânçïé'

Setting the host to 'mac-roman' works fine too,
as long it is properly encoded!

>>> sys.stdout.encoding = 'mac-roman'
>>> print("frøânçïé".encode('mac-roman'))
'frøânçïé'

But

>>> print("frøânçïé".encode('utf-8'))
'fr√∏√¢n√ß√Ø√©'

Ditto for cp850

>>> sys.stdout.encoding = 'cp850'
>>> print("frøânçïé".encode('cp850'))
'frøânçïé'

If the repertoire of characters of a coding scheme does not
contain the characters -> replace

>>> sys.stdout.encoding = 'cp437'
>>> print("frøânçïé".encode('cp437'))
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
  File "c:\python32\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\xf8' in position 2: character maps to
<undefined>
>>> print("frøânçïé".encode('cp437', 'replace'))
'fr?ânçïé'


Curiousities

>>> sys.stdout.encoding = 'utf-16-be'
>>> print("frøânçïé")
 f r ø â n ç ï é 
>>> print("frøânçïé".encode('utf-16-be'))
'frøânçïé' 
>>> sys.stdout.encoding = 'utf-32-be'
>>> print("frøânçïé".encode('utf-32-be'))
'frøânçïé'


jmf

[toc] | [prev] | [next] | [standalone]

#52947

From	Benjamin Kaplan <benjamin.kaplan@case.edu>
Date	2013-08-24 12:45 -0700
Message-ID	<mailman.200.1377373941.19984.python-list@python.org>
In reply to	#52943

On Sat, Aug 24, 2013 at 9:47 AM, David M. Cotter <me@davecotter.com> wrote:
>
> > What _are_ you using?
> i have scripts in a file, that i am invoking into my embedded python within a C++ program.  there is no terminal involved.  the "print" statement has been redirected (via sys.stdout) to my custom print class, which does not specify "encoding", so i tried the suggestion above to set it:
>
> static const char *s_RedirectScript =
>         "import " kEmbeddedModuleName "\n"
>         "import sys\n"
>         "\n"
>         "class CustomPrintClass:\n"
>         "       def write(self, stuff):\n"
>         "               " kEmbeddedModuleName "." kCustomPrint "(stuff)\n"
>         "class CustomErrClass:\n"
>         "       def write(self, stuff):\n"
>         "               " kEmbeddedModuleName "." kCustomErr "(stuff)\n"
>         "sys.stdout = CustomPrintClass()\n"
>         "sys.stderr = CustomErrClass()\n"
>         "sys.stdout.encoding = 'UTF-8'\n"
>         "sys.stderr.encoding = 'UTF-8'\n";
>
>
> but it didn't help.
>
> I'm still getting back a string that is a utf-8 string of characters that, if converted to "macRoman" and then interpreted as UTF8, shows the original, correct string.  who is specifying macRoman, and where, and how do i tell whoever that is that i really *really* want utf8?
> --

If you're running this from a C++ program, then you aren't getting
back characters. You're getting back bytes. If you treat them as
UTF-8, they'll work properly. The only thing wrong is the text editor
you're using to open the file afterwards- since you aren't specifying
an encoding, it's assuming MacRoman. You can try putting the UTF-8 BOM
(it's not really a BOM) at the front of the file- the bytes 0xEF 0xBB
0xBF are used by some editors to identify a file as UTF-8.

[toc] | [prev] | [next] | [standalone]

#52958

From	random832@fastmail.us
Date	2013-08-24 20:01 -0400
Message-ID	<mailman.204.1377388898.19984.python-list@python.org>
In reply to	#52943

On Sat, Aug 24, 2013, at 12:47, David M. Cotter wrote:
> > What _are_ you using? 
> i have scripts in a file, that i am invoking into my embedded python
> within a C++ program.  there is no terminal involved.  the "print"
> statement has been redirected (via sys.stdout) to my custom print class,
> which does not specify "encoding", so i tried the suggestion above to set
> it:

That doesn't answer my real question. What does your "custom print
class" do with the text?

[toc] | [prev] | [next] | [standalone]

#52980

From	"David M. Cotter" <me@davecotter.com>
Date	2013-08-25 10:57 -0700
Message-ID	<d6250c5d-ff7d-46ae-9e0a-1c51a6e9b7dc@googlegroups.com>
In reply to	#52913

i'm sorry this is so confusing, let me try to re-state the problem in as clear a way as i can.

I have a C++ program, with very well tested unicode support.  All logging is done in utf8.  I have conversion routines that work flawlessly, so i can assure you there is nothing wrong with logging and unicode support in the underlying program.

I am embedding python 2.7 into the program, and extending python with routines in my C++ program.

I have a script, encoded in utf8, and *marked* as utf8 with this line:
    # -*- coding: utf-8 -*- 

In that script, i have inline unicode text.  When I pass that text to my C++ program, the Python interpreter decides that these bytes are macRoman, and handily "converts" them to unicode.  To compensate, i must "convert" these "macRoman" characters encoded as utf8, back to macRoman, then "interpret" them as utf8.  In this way i can recover the original unicode.

When i return a unicode string back to python, i must do the reverse so that Python gets back what it expects.

This is not related to printing, or sys.stdout, it does happen with that too but focusing on that is a red-herring.  Let's focus on just passing a string into C++ then back out.

This would all actually make sense IF my script was marked as being "macRoman" even tho i entered UTF8 Characters, but that is not the case.

Let's prove my statements.  Here is the script, *interpreted* as MacRoman:
http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_macroman.png

and here it is again *interpreted* as utf8:
http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_utf8.png

here is the string conversion code:

SuperString		ScPyObject::GetAs_String()
{
	SuperString		str;	//	underlying format of SuperString is unicode
	
	if (PyUnicode_Check(i_objP)) {
		ScPyObject		utf8Str(PyUnicode_AsUTF8String(i_objP));
		
		str = utf8Str.GetAs_String();
	} else {
		const UTF8Char		*bytes_to_interpetZ = uc(PyString_AsString(i_objP));
		
		//	the "Set" call *interprets*, does not *convert*
		str.Set(bytes_to_interpetZ, kCFStringEncodingUTF8);
		
		//	str is now unicode characters which *represent* macRoman characters
		//	so *convert* these to actual macRoman 
		
		//	fyi: Update_utf8 means "convert to this encoding and 
		//	store the resulting bytes in the variable named "utf8"
		str.Update_utf8(kCFStringEncodingMacRoman);	
		
		//	str is now unicode characters converted from macRoman
		//	so *reinterpret* them as UTF8
		
		//	FYI, we're just taking the pure bytes that are stored in the utf8 variable
		//	and *interpreting* them to this encoding
		bytes_to_interpetZ = str.utf8().c_str();
		
		str.Set(bytes_to_interpetZ, kCFStringEncodingUTF8);
	}
	
	return str;
}

PyObject*	PyString_FromString(const SuperString& str)
{
	SuperString			localStr(str);
	
	//	localStr is the real, actual unicode string
	//	but we must *interpret* it as macRoman, then take these "macRoman" characters
	//	and "convert" them to unicode for Python to "get it"
	const UTF8Char		*bytes_to_interpetZ = localStr.utf8().c_str();

	//	take the utf8 bytes (actual utf8 prepresentation of string)
	//	and say "no, these bytes are macRoman"
	localStr.Set(bytes_to_interpetZ, kCFStringEncodingMacRoman);
	
	//	okay so now we have unicode of MacRoman characters (!?)
	//	return the underlying utf8 bytes of THAT as our string
	return PyString_FromString(localStr.utf8Z());
}

And here is the results from running the script:
   18: ---------------
   18: Original string: frøânçïé
   18: converting...
   18: it worked: frøânçïé
   18: ---------------
   18: ---------------
   18: Original string: 控件
   18: converting...
   18: it worked: 控件
   18: ---------------

Now the thing that absolutely utterly baffles me (if i'm not baffled enough) is that i get the EXACT same results on both Mac and Windows.  Why do they both insist on interpreting my script's bytes as MacRoman?

[toc] | [prev] | [next] | [standalone]

#52981

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2013-08-25 20:23 +0200
Message-ID	<mailman.222.1377455031.19984.python-list@python.org>
In reply to	#52980

2013/8/25 David M. Cotter <me@davecotter.com>:
> i'm sorry this is so confusing, let me try to re-state the problem in as clear a way as i can.
>
> I have a C++ program, with very well tested unicode support.  All logging is done in utf8.  I have conversion routines that work flawlessly, so i can assure you there is nothing wrong with logging and unicode support in the underlying program.
>
> I am embedding python 2.7 into the program, and extending python with routines in my C++ program.
>
> I have a script, encoded in utf8, and *marked* as utf8 with this line:
>     # -*- coding: utf-8 -*-
>
> In that script, i have inline unicode text.  When I pass that text to my C++ program, the Python interpreter decides that these bytes are macRoman, and handily "converts" them to unicode.  To compensate, i must "convert" these "macRoman" characters encoded as utf8, back to macRoman, then "interpret" them as utf8.  In this way i can recover the original unicode.
>
> When i return a unicode string back to python, i must do the reverse so that Python gets back what it expects.
>
> This is not related to printing, or sys.stdout, it does happen with that too but focusing on that is a red-herring.  Let's focus on just passing a string into C++ then back out.
>
> This would all actually make sense IF my script was marked as being "macRoman" even tho i entered UTF8 Characters, but that is not the case.
>
> Let's prove my statements.  Here is the script, *interpreted* as MacRoman:
> http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_macroman.png
>
> and here it is again *interpreted* as utf8:
> http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_utf8.png
>
> here is the string conversion code:
>
> SuperString             ScPyObject::GetAs_String()
> {
>         SuperString             str;    //      underlying format of SuperString is unicode
>
>         if (PyUnicode_Check(i_objP)) {
>                 ScPyObject              utf8Str(PyUnicode_AsUTF8String(i_objP));
>
>                 str = utf8Str.GetAs_String();
>         } else {
>                 const UTF8Char          *bytes_to_interpetZ = uc(PyString_AsString(i_objP));
>
>                 //      the "Set" call *interprets*, does not *convert*
>                 str.Set(bytes_to_interpetZ, kCFStringEncodingUTF8);
>
>                 //      str is now unicode characters which *represent* macRoman characters
>                 //      so *convert* these to actual macRoman
>
>                 //      fyi: Update_utf8 means "convert to this encoding and
>                 //      store the resulting bytes in the variable named "utf8"
>                 str.Update_utf8(kCFStringEncodingMacRoman);
>
>                 //      str is now unicode characters converted from macRoman
>                 //      so *reinterpret* them as UTF8
>
>                 //      FYI, we're just taking the pure bytes that are stored in the utf8 variable
>                 //      and *interpreting* them to this encoding
>                 bytes_to_interpetZ = str.utf8().c_str();
>
>                 str.Set(bytes_to_interpetZ, kCFStringEncodingUTF8);
>         }
>
>         return str;
> }
>
> PyObject*       PyString_FromString(const SuperString& str)
> {
>         SuperString                     localStr(str);
>
>         //      localStr is the real, actual unicode string
>         //      but we must *interpret* it as macRoman, then take these "macRoman" characters
>         //      and "convert" them to unicode for Python to "get it"
>         const UTF8Char          *bytes_to_interpetZ = localStr.utf8().c_str();
>
>         //      take the utf8 bytes (actual utf8 prepresentation of string)
>         //      and say "no, these bytes are macRoman"
>         localStr.Set(bytes_to_interpetZ, kCFStringEncodingMacRoman);
>
>         //      okay so now we have unicode of MacRoman characters (!?)
>         //      return the underlying utf8 bytes of THAT as our string
>         return PyString_FromString(localStr.utf8Z());
> }
>
> And here is the results from running the script:
>    18: ---------------
>    18: Original string: frøânçïé
>    18: converting...
>    18: it worked: frøânçïé
>    18: ---------------
>    18: ---------------
>    18: Original string: 控件
>    18: converting...
>    18: it worked: 控件
>    18: ---------------
>
> Now the thing that absolutely utterly baffles me (if i'm not baffled enough) is that i get the EXACT same results on both Mac and Windows.  Why do they both insist on interpreting my script's bytes as MacRoman?
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
unfortunately, I don't have experience with embedding python and C++,
but he python (for python 2) part seems to be missing the u prefix in
the unicode literals.
like
u"frøânçïé"
Is the c++ part prepared for python unicode object, or does it require
utf-8 encoded string (or the respective bytes)?
would
oldstr.encode("utf-8")
in the call make a difference?

regards,
   vbr

[toc] | [prev] | [next] | [standalone]

#52982

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-08-25 14:59 -0400
Message-ID	<mailman.223.1377457194.19984.python-list@python.org>
In reply to	#52980

On 8/25/2013 1:57 PM, David M. Cotter wrote:
> i'm sorry this is so confusing, let me try to re-state the problem in as clear a way as i can.
>
> I have a C++ program, with very well tested unicode support.  All logging is done in utf8.  I have conversion routines that work flawlessly, so i can assure you there is nothing wrong with logging and unicode support in the underlying program.

> I am embedding python 2.7 into the program, and extending python with routines in my C++ program.

If you want 'well-tested' (correct) unicode support from Python, use 
3.3. Unicode in 2.x is somewhat buggy and definitely flakey. The first 
fix was to make unicode *the* text type, in 3.0. The second was to 
redesign the internals in 3.3. It is possible that 2.7 is too broken for 
what you want to do.

> I have a script, encoded in utf8, and *marked* as utf8 with this line:
>      # -*- coding: utf-8 -*-
>
> In that script, i have inline unicode text.

The example scripts that you posted pictures of do *not* have unicode 
text. They have bytestring literals with (encoded) non-ascii chars 
inside them. This is not a great idea. I am not sure what bytes you end 
up with. Apparently, not what you expect.

To make them 'unicode text', you must prepend the literals with 'u'. 
Didn't someone say this before?

> When I pass that text to my C++ program, the Python interpreter decides that these bytes are macRoman, and handily "converts" them to unicode.  To compensate, i must "convert" these "macRoman" characters encoded as utf8, back to macRoman, then "interpret" them as utf8.  In this way i can recover the original unicode.
>
> When i return a unicode string back to python, i must do the reverse so that Python gets back what it expects.
>
> This is not related to printing, or sys.stdout, it does happen with that too but focusing on that is a red-herring.  Let's focus on just passing a string into C++ then back out.
>
> This would all actually make sense IF my script was marked as being "macRoman" even tho i entered UTF8 Characters, but that is not the case.
>
> Let's prove my statements.  Here is the script, *interpreted* as MacRoman:
> http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_macroman.png

Why are you posting pictures of code, instead of the (runnable) code 
itself, as you did with C code?

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#52988

From	"David M. Cotter" <me@davecotter.com>
Date	2013-08-25 15:25 -0700
Message-ID	<323bb822-b967-412c-a6db-bc8f993e2227@googlegroups.com>
In reply to	#52913

fair enough.  I can provide further proof of strangeness.
here is my latest script:  this is saved on disk as a UTF8 encoded file, and when viewing as UTF8, it shows the correct characters.

==================
# -*- coding: utf-8 -*- 
import time, kjams, kjams_lib

def log_success(msg, successB, str):
	if successB:
		print msg + " worked: " + str
	else:
		print msg + "failed: " + str

def do_test(orig_str):
	cmd_enum = kjams.enum_cmds()
	
	print "---------------"
	print "Original string: " + orig_str
	print "converting..."

	oldstr = orig_str;
	newstr = kjams_lib.do_command(cmd_enum.kScriptCommand_Unicode_Test, oldstr)
	log_success("first", oldstr == newstr, newstr);
	
	oldstr = unicode(orig_str, "UTF-8")
	newstr = kjams_lib.do_command(cmd_enum.kScriptCommand_Unicode_Test, oldstr)
	newstr = unicode(newstr, "UTF-8")
	log_success("second", oldstr == newstr, newstr);
	
	oldstr = unicode(orig_str, "UTF-8")
	oldstr.encode("UTF-8")
	newstr = kjams_lib.do_command(cmd_enum.kScriptCommand_Unicode_Test, oldstr)
	newstr = unicode(newstr, "UTF-8")
	log_success("third", oldstr == newstr, newstr);

	print "---------------"
	
def main():
	do_test("frøânçïé")
	do_test("控件")

#-----------------------------------------------------
if __name__ == "__main__":
	main()

==================
and the latest results:

   20: ---------------
   20: Original string: frøânçïé
   20: converting...
   20: first worked: frøânçïé
   20: second worked: frøânçïé
   20: third worked: frøânçïé
   20: ---------------
   20: ---------------
   20: Original string: 控件
   20: converting...
   20: first worked: 控件
   20: second worked: 控件
   20: third worked: 控件
   20: ---------------

now, given the C++ source code, this should NOT work, given that i'm doing some crazy re-coding of the bytes.

so, you see, it does not matter whether i pass "unicode" strings or regular "strings", they all translate to the same, weird macroman.  

for completeness, here is the C++ code that the script calls:

===================
			case kScriptCommand_Unicode_Test: {
				pyArg = iterP.NextArg_OrSyntaxError();
				
				if (pyArg.get()) {
					SuperString str = pyArg.GetAs_String();
					
					resultObjP = PyString_FromString(str);
				}
				break;
			}

===================

[toc] | [prev] | [next] | [standalone]

#52989

From	"David M. Cotter" <me@davecotter.com>
Date	2013-08-25 15:32 -0700
Message-ID	<cf8eba75-045c-4fa2-abae-14b8cf02c915@googlegroups.com>
In reply to	#52913

i got it!!  OMG!  so sorry for the confusion, but i learned a lot, and i can share the result:

the CORRECT code *was* what i had assumed.  the Python side has always been correct (no need to put "u" in front of strings, it is known that the bytes are utf8 bytes)

it was my "run script" function which read in the file.  THAT was what was "reinterpreting" the utf8 bytes as macRoman (on both platforms).  correct code below:

SuperString		ScPyObject::GetAs_String()
{
	SuperString		str;
	
	if (PyUnicode_Check(i_objP)) {
		ScPyObject		utf8Str(PyUnicode_AsUTF8String(i_objP));
		
		str = utf8Str.GetAs_String();
	} else {
		//	calling "uc" on this means "assume this is utf8"
		str.Set(uc(PyString_AsString(i_objP)));
	}
	
	return str;
}

PyObject*	PyString_FromString(const SuperString& str)
{
	return PyString_FromString(str.utf8Z());
}

[toc] | [prev] | [next] | [standalone]

#52994

From	MRAB <python@mrabarnett.plus.com>
Date	2013-08-26 01:30 +0100
Message-ID	<mailman.231.1377477011.19984.python-list@python.org>
In reply to	#52989

On 25/08/2013 23:32, David M. Cotter wrote:
> i got it!!  OMG!  so sorry for the confusion, but i learned a lot,
> and i can share the result:
>
> the CORRECT code *was* what i had assumed.  the Python side has
> always been correct (no need to put "u" in front of strings, it is
> known that the bytes are utf8 bytes)
>
> it was my "run script" function which read in the file.  THAT was
> what was "reinterpreting" the utf8 bytes as macRoman (on both
> platforms).  correct code below:
>
When working with Unicode, what you should be doing is:

1. Specifying the encoding line in the special comment.

2. Setting the encoding of the source file.

3. Using Unicode string literals in the source file.

You're doing (1) and (2), but not (3).

If you want to pass UTF-8 to the the C++, then encode the Unicode
string to bytes when you pass it. Using bytestring literals and relying
on the source file being UTF-8, like you doing, is just asking for
trouble, as you've found out! :-)

[toc] | [prev] | [next] | [standalone]

#53099

From	"David M. Cotter" <me@davecotter.com>
Date	2013-08-27 15:21 -0700
Message-ID	<8adbb9f0-5205-4cc7-8efe-7ea4b4e1b01c@googlegroups.com>
In reply to	#52994

i am already doing (3), and all is working perfectly.  bytestring literals are fine, i'm not sure what this trouble is that you speak of.

note that i'm not using PyRun_AnyFile(), i'm loading the script myself, assumed as utf8 (which was my original problem, i had assumed it was macRoman), then calling PyRun_SimpleString().  it works flawlessly now, on both mac and windows.

[toc] | [prev] | [next] | [standalone]

#53103

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-08-27 23:24 +0000
Message-ID	<521d3535$0$29986$c3e8da3$5496439d@news.astraweb.com>
In reply to	#53099

On Tue, 27 Aug 2013 15:21:00 -0700, David M. Cotter wrote:

> i am already doing (3), and all is working perfectly.  bytestring
> literals are fine, i'm not sure what this trouble is that you speak of.

Neither is anyone else, because your post is completely devoid of any 
context. Who are you talking to?

Wait, let me see if I can peer into my crystal ball and see if the 
spirits tell me what you are talking about... I see a post... no, 
repeated posts, by many people, telling you not to embed Unicode 
characters in Python 2.x plain byte strings...

You know what? You obviously know so much more about Unicode and Python 
than the entire Python community, you must be right. There is no possible 
way that misusing byte strings in this manner could possibly go wrong. 
Since byte strings literals containing Unicode data are "fine", it was 
clearly a complete waste of time to introduce Unicode strings in the 
first place.

Why bother using the official interface designed to work correctly with 
Unicode, when you can rely on an accident of implementation that just 
happens to work correctly in your environment but no guarantee it will 
work correctly anywhere else? What could *possibly* go wrong by relying 
on code working by accident like this?

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#53110

From	"David M. Cotter" <me@davecotter.com>
Date	2013-08-27 22:57 -0700
Message-ID	<ef554d17-75a6-4f85-9111-d12f51d0ab32@googlegroups.com>
In reply to	#53103

I am very sorry that I have offended you to such a degree you feel it necessary to publicly eviscerate me.

Perhaps I could have worded it like this:  "So far I have not seen any troubles including unicode characters in my strings, they *seem* to be fine for my use-case.  What kind of trouble has been seen with this by others?"

Really, I wonder why you are so angry at me for having made a mistake?  I'm going to guess that you don't have kids.

[toc] | [prev] | [next] | [standalone]

#53138

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-08-28 12:03 +0000
Message-ID	<521de712$0$6599$c3e8da3$5496439d@news.astraweb.com>
In reply to	#53110

On Tue, 27 Aug 2013 22:57:45 -0700, David M. Cotter wrote:

> I am very sorry that I have offended you to such a degree you feel it
> necessary to publicly eviscerate me.

You know David, you are right. I did over-react. And I apologise for 
that. I am sorry, I was excessively confrontational. (Although I think 
"eviscerate" is a bit strong.)

Putting aside my earlier sarcasm, the basic message remains the same: 
Python byte strings are not designed to work with Unicode characters, and 
if they do work, it is an accident, not defined behaviour.

> Perhaps I could have worded it like this:  "So far I have not seen any
> troubles including unicode characters in my strings, they *seem* to be
> fine for my use-case.  What kind of trouble has been seen with this by
> others?"

Exactly the same sort of trouble you were having earlier when you were 
inadvertently decoding the source file as MacRoman rather than UTF-8. 
Mojibake, garbage characters in your text, corrupted data.

http://en.wikipedia.org/wiki/Mojibake

The point is, you might not see these errors, because by accident all the 
relevant factors conspire to give you the correct result. You might test 
it on a Mac and on Windows and it all works well. You might even test it 
on a dozen different machines, and it works fine on all of them. But 
since you're relying on an accident of implementation, none of this is 
guaranteed. And then in eighteen months time, *something* changes -- a 
minor update to Python, a different version of Mac OS/X, an unusual 
Registry setting in Windows, who knows what?, and all of a sudden the 
factors no longer line up to give you the correct results and it all 
comes tumbling down in a big stinking mess. If you are lucky you will get 
a nice clear exception telling you something is broken, but more likely 
you'll just get corrupted data and mojibake and you, or the poor guy who 
maintains the code after you, will have no idea why. And you'll probably 
come here asking for our help to solve it.

If you came back and said "I tried it with the u prefix, and it broke a 
bunch of other code, and I don't have time to fix it now so I'm reverting 
to the u-less byte string form" I wouldn't *like* it but I could *accept* 
it as one of those sub-optimal compromises people make in Real Life. I've 
done the same thing myself, we probably all have: written code we knew 
was broken, but fixing it was too hard or too low a priority.

> Really, I wonder why you are so angry at me for having made a mistake? 
> I'm going to guess that you don't have kids.

What do kids have to do with this? Are you an adult or a child? *wink*

You didn't offend me so much as frustrate me. You had multiple people 
telling you the same thing, don't embed Unicode characters in a byte 
string, but you choose to not just ignore them but effectively declare 
that they were all wrong to give that advice, not just the people here 
but essentially the entire Python development community responsible for 
adding Unicode strings to the language. Can you blame me for feeling that 
your reply seemed rather arrogant?

In any case, I'm glad you responded with a little more restraint than I 
did, and I hope you can see my point of view and hopefully I haven't 
soured you on this forum.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#53173

From	"David M. Cotter" <me@davecotter.com>
Date	2013-08-28 10:46 -0700
Message-ID	<68ed2759-6df9-456c-b49d-bc9805e29e76@googlegroups.com>
In reply to	#52913

Thank you for your thoughtful and thorough response.  I now understand much better what you (and apparently the others) were warning me against and I will certainly consider that moving forward.

I very much appreciate your help as I learn about python and embedding and all these crazy encoding problems.

> What do kids have to do with this?
When a person has children, they quickly learn that the best way to deal with some one who seems to be not listening or having a tantrum: show understanding and compassion, restraint and patience, as you, in the most neutral way that you can, gently bit firmly guide said person back on track.  You learn that if you instead express your frustration at said person, that it never, ever helps the situation, and only causes more hurt to be spread around to the very people you are ostensibly attempting to help.

> Are you an adult or a child?
Perhaps my comment was lost in translation, but this is rather the question that I was obliquely asking you.  *wink right back*

In any case I thank you for your help, which has in fact been quite great!  My demo script is working, and I know now to properly advise my script writers regarding how to properly encode strings.

[toc] | [prev] | [standalone]

csiph-web

can't get utf8 / unicode strings from embedded python

Contents

#52913 — can't get utf8 / unicode strings from embedded python

#52921

#52927

#52928

#52940

#52943

#52945

#52947

#52958

#52980

#52981

#52982

#52988

#52989

#52994

#53099

#53103

#53110

#53138

#53173