Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #52981

Re: can't get utf8 / unicode strings from embedded python

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <vlastimil.brom@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'python,': 0.02; 'resulting': 0.04; 'encoding': 0.05; 'interpreter': 0.05; '-*-': 0.07; 'c++,': 0.07; 'encoded': 0.07; 'skip:u 30': 0.07; 'utf-8': 0.07; 'string': 0.09; 'coding:': 0.09; 'converted': 0.09; 'line:': 0.09; 'okay': 0.09; 'prefix': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'stored': 0.12; '2.7': 0.14; '18:': 0.16; 'fyi,': 0.16; 'literals.': 0.16; 'script,': 0.16; 'str)': 0.16; 'string)': 0.16; 'string:': 0.16; 'subject:unicode': 0.16; 'underlying': 0.16; 'unicode.': 0.16; 'utf8': 0.16; 'utterly': 0.16; 'subject:python': 0.16; 'variable': 0.18; 'passing': 0.19; 'entered': 0.20; 'seems': 0.21; 'cc:addr:python.org': 0.22; 'bytes': 0.24; 'case.': 0.24; 'focusing': 0.24; 'skip:l 30': 0.24; 'text.': 0.24; 'unicode': 0.24; '(or': 0.24; 'cc:2**0': 0.24; 'script': 0.25; 'logging': 0.26; 'pass': 0.26; '(for': 0.26; 'code:': 0.26; 'gets': 0.27; 'header:In-Reply-To:1': 0.27; 'skip:p 30': 0.29; 'url:bugs': 0.29; 'respective': 0.29; 'related': 0.29; 'characters': 0.30; 'message-id:@mail.gmail.com': 0.30; 'skip:( 20': 0.30; "i'm": 0.30; 'url:mailman': 0.30; 'program,': 0.31; 'embedding': 0.31; 'extending': 0.31; "we're": 0.32; 'text': 0.33; 'url:python': 0.33; 'running': 0.33; 'mac': 0.33; 'actual': 0.34; 'sense': 0.34; 'subject:from': 0.34; 'problem': 0.35; 'skip:u 20': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'c++': 0.36; 'object,': 0.36; 'done': 0.36; 'url:listinfo': 0.36; 'hi,': 0.36; 'url:org': 0.36; 'wrong': 0.37; 'too': 0.37; 'clear': 0.37; 'being': 0.38; 'skip:o 20': 0.38; 'skip:- 10': 0.38; 'does': 0.39; 'subject:can': 0.39; 'skip:p 20': 0.39; 'url:mail': 0.40; 'even': 0.60; 'skip:u 10': 0.60; 'subject: / ': 0.60; 'conversion': 0.61; 'skip:* 10': 0.61; 'back': 0.62; 'happen': 0.63; 'our': 0.64; 'taking': 0.65; 'here': 0.66; 'url:png': 0.68; 'reverse': 0.68; 'results': 0.69; 'inline': 0.74; 'printing,': 0.74; 'subject:get': 0.81; '8bit%:77': 0.84; 'characters,': 0.84; 'const': 0.84; 'it"': 0.84; 'routines': 0.84; 'absolutely': 0.87; 'recover': 0.91
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=1r1musBg/8mMlknt45mKBzyORdyx/4xYbzq95Alj03I=; b=J3ZjvbyUFgsh6I6Q2vj/aEcdEYWtSWiZgBOIsMELHdg96LZoOuIeWgm9pUR747+GFW nB1JMAO3I5kafFGEQE8Jd7yJPEyu31Svlz9cz63Z+uO5LmN9Bh2Op44hcklSwMT5ubHe gg2ylXZuRw1so+1Y9FvffT23e0cGh5bumaa91hI0NtrB40NQAxVPO/8osYiETFuary3U 2czua6jW03tOgnAD3XVFILHSypQLZa2pcz4zfD3hCSvG+LmqAKxrS3HlwB1rwbxcQ+aO COkoY+fACzgz2T17dHg19O4u51UHuUllADUx3LCvYG9xa7N3LdLFtCJaBH92WpGFmoha CYHg==
MIME-Version 1.0
X-Received by 10.60.40.67 with SMTP id v3mr10592801oek.16.1377455022056; Sun, 25 Aug 2013 11:23:42 -0700 (PDT)
In-Reply-To <d6250c5d-ff7d-46ae-9e0a-1c51a6e9b7dc@googlegroups.com>
References <fbeee40a-bc8a-4cef-abe7-2b2d54f59625@googlegroups.com> <d6250c5d-ff7d-46ae-9e0a-1c51a6e9b7dc@googlegroups.com>
Date Sun, 25 Aug 2013 20:23:41 +0200
Subject Re: can't get utf8 / unicode strings from embedded python
From Vlastimil Brom <vlastimil.brom@gmail.com>
To "David M. Cotter" <me@davecotter.com>
Content-Type text/plain; charset=UTF-8
Content-Transfer-Encoding quoted-printable
Cc python <python-list@python.org>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.222.1377455031.19984.python-list@python.org> (permalink)
Lines 137
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1377455031 news.xs4all.nl 16011 [2001:888:2000:d::a6]:60591
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:52981

Show key headers only | View raw


2013/8/25 David M. Cotter <me@davecotter.com>:
> i'm sorry this is so confusing, let me try to re-state the problem in as clear a way as i can.
>
> I have a C++ program, with very well tested unicode support.  All logging is done in utf8.  I have conversion routines that work flawlessly, so i can assure you there is nothing wrong with logging and unicode support in the underlying program.
>
> I am embedding python 2.7 into the program, and extending python with routines in my C++ program.
>
> I have a script, encoded in utf8, and *marked* as utf8 with this line:
>     # -*- coding: utf-8 -*-
>
> In that script, i have inline unicode text.  When I pass that text to my C++ program, the Python interpreter decides that these bytes are macRoman, and handily "converts" them to unicode.  To compensate, i must "convert" these "macRoman" characters encoded as utf8, back to macRoman, then "interpret" them as utf8.  In this way i can recover the original unicode.
>
> When i return a unicode string back to python, i must do the reverse so that Python gets back what it expects.
>
> This is not related to printing, or sys.stdout, it does happen with that too but focusing on that is a red-herring.  Let's focus on just passing a string into C++ then back out.
>
> This would all actually make sense IF my script was marked as being "macRoman" even tho i entered UTF8 Characters, but that is not the case.
>
> Let's prove my statements.  Here is the script, *interpreted* as MacRoman:
> http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_macroman.png
>
> and here it is again *interpreted* as utf8:
> http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_utf8.png
>
> here is the string conversion code:
>
> SuperString             ScPyObject::GetAs_String()
> {
>         SuperString             str;    //      underlying format of SuperString is unicode
>
>         if (PyUnicode_Check(i_objP)) {
>                 ScPyObject              utf8Str(PyUnicode_AsUTF8String(i_objP));
>
>                 str = utf8Str.GetAs_String();
>         } else {
>                 const UTF8Char          *bytes_to_interpetZ = uc(PyString_AsString(i_objP));
>
>                 //      the "Set" call *interprets*, does not *convert*
>                 str.Set(bytes_to_interpetZ, kCFStringEncodingUTF8);
>
>                 //      str is now unicode characters which *represent* macRoman characters
>                 //      so *convert* these to actual macRoman
>
>                 //      fyi: Update_utf8 means "convert to this encoding and
>                 //      store the resulting bytes in the variable named "utf8"
>                 str.Update_utf8(kCFStringEncodingMacRoman);
>
>                 //      str is now unicode characters converted from macRoman
>                 //      so *reinterpret* them as UTF8
>
>                 //      FYI, we're just taking the pure bytes that are stored in the utf8 variable
>                 //      and *interpreting* them to this encoding
>                 bytes_to_interpetZ = str.utf8().c_str();
>
>                 str.Set(bytes_to_interpetZ, kCFStringEncodingUTF8);
>         }
>
>         return str;
> }
>
> PyObject*       PyString_FromString(const SuperString& str)
> {
>         SuperString                     localStr(str);
>
>         //      localStr is the real, actual unicode string
>         //      but we must *interpret* it as macRoman, then take these "macRoman" characters
>         //      and "convert" them to unicode for Python to "get it"
>         const UTF8Char          *bytes_to_interpetZ = localStr.utf8().c_str();
>
>         //      take the utf8 bytes (actual utf8 prepresentation of string)
>         //      and say "no, these bytes are macRoman"
>         localStr.Set(bytes_to_interpetZ, kCFStringEncodingMacRoman);
>
>         //      okay so now we have unicode of MacRoman characters (!?)
>         //      return the underlying utf8 bytes of THAT as our string
>         return PyString_FromString(localStr.utf8Z());
> }
>
> And here is the results from running the script:
>    18: ---------------
>    18: Original string: frøânçïé
>    18: converting...
>    18: it worked: frøânçïé
>    18: ---------------
>    18: ---------------
>    18: Original string: 控件
>    18: converting...
>    18: it worked: 控件
>    18: ---------------
>
> Now the thing that absolutely utterly baffles me (if i'm not baffled enough) is that i get the EXACT same results on both Mac and Windows.  Why do they both insist on interpreting my script's bytes as MacRoman?
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
unfortunately, I don't have experience with embedding python and C++,
but he python (for python 2) part seems to be missing the u prefix in
the unicode literals.
like
u"frøânçïé"
Is the c++ part prepared for python unicode object, or does it require
utf-8 encoded string (or the respective bytes)?
would
oldstr.encode("utf-8")
in the call make a difference?

regards,
   vbr

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-23 13:49 -0700
  Re: can't get utf8 / unicode strings from embedded python Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-24 01:54 +0000
  Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-23 23:45 -0700
    Re: can't get utf8 / unicode strings from embedded python Dave Angel <davea@davea.name> - 2013-08-24 07:04 +0000
    Re: can't get utf8 / unicode strings from embedded python random832@fastmail.us - 2013-08-24 09:49 -0400
  Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-24 09:47 -0700
    Re: can't get utf8 / unicode strings from embedded python wxjmfauth@gmail.com - 2013-08-24 11:31 -0700
    Re: can't get utf8 / unicode strings from embedded python Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-08-24 12:45 -0700
    Re: can't get utf8 / unicode strings from embedded python random832@fastmail.us - 2013-08-24 20:01 -0400
  Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-25 10:57 -0700
    Re: can't get utf8 / unicode strings from embedded python Vlastimil Brom <vlastimil.brom@gmail.com> - 2013-08-25 20:23 +0200
    Re: can't get utf8 / unicode strings from embedded python Terry Reedy <tjreedy@udel.edu> - 2013-08-25 14:59 -0400
  Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-25 15:25 -0700
  Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-25 15:32 -0700
    Re: can't get utf8 / unicode strings from embedded python MRAB <python@mrabarnett.plus.com> - 2013-08-26 01:30 +0100
      Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-27 15:21 -0700
        Re: can't get utf8 / unicode strings from embedded python Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-27 23:24 +0000
          Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-27 22:57 -0700
            Re: can't get utf8 / unicode strings from embedded python Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-28 12:03 +0000
  Re: can't get utf8 / unicode strings from embedded python "David M. Cotter" <me@davecotter.com> - 2013-08-28 10:46 -0700

csiph-web