Path: csiph.com!usenet.pasdenom.info!news.franciliens.net!fdn.fr!usenet-fr.net!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.005 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'encoding': 0.05; 'c++,': 0.07; 'utf-8': 0.07; 'string': 0.09; 'encode': 0.09; 'python': 0.11; '"run': 0.16; '"u"': 0.16; 'assumed.': 0.16; 'bytes)': 0.16; 'doing,': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'literals': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'received:84.93': 0.16; 'received:84.93.230': 0.16; 'specifying': 0.16; 'subject:unicode': 0.16; 'unicode,': 0.16; 'utf8': 0.16; 'subject:python': 0.16; ':-)': 0.16; 'wrote:': 0.18; 'header:User- Agent:1': 0.23; 'bytes': 0.24; 'unicode': 0.24; 'file.': 0.24; 'source': 0.25; 'pass': 0.26; 'asking': 0.27; 'header:In-Reply- To:1': 0.27; 'function': 0.29; 'correct': 0.29; 'code': 0.31; '(on': 0.31; 'file': 0.32; 'front': 0.32; 'subject:from': 0.34; 'received:84': 0.35; 'but': 0.35; 'doing': 0.36; 'should': 0.36; 'being': 0.38; 'to:addr:python-list': 0.38; 'subject:can': 0.39; 'to:addr:python.org': 0.39; 'read': 0.60; 'subject: / ': 0.60; "you're": 0.61; "you've": 0.63; 'side': 0.67; 'header:Reply-To:1': 0.67; 'below:': 0.68; 'reply-to:no real name:2**0': 0.71; 'special': 0.74; 'subject:get': 0.81; '(2),': 0.84; 'comment.': 0.84; 'out!': 0.84; 'reply-to:addr:python.org': 0.84; 'lot,': 0.93 X-CM-Score: 0.00 X-CNFS-Analysis: v=2.1 cv=ZMDuxxLb c=1 sm=1 tr=0 a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=0Bzu9jTXAAAA:8 a=0kkAYlmtguIA:10 a=fJfERJHSPtEA:10 a=ihvODaAuJD4A:10 a=OUOv7kDek9cA:10 a=8nJEP1OIZ-IA:10 a=EBOSESyhAAAA:8 a=8AHkEIZyAAAA:8 a=C7vhqBQ2SW4A:10 a=-7dqoCFbp8SkbGO8nyoA:9 a=wPNLvfGTeEIA:10 X-AUTH: mrabarnett:2500 Date: Mon, 26 Aug 2013 01:30:15 +0100 From: MRAB User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: python-list@python.org Subject: Re: can't get utf8 / unicode strings from embedded python References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: python-list@python.org List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 26 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1377477011 news.xs4all.nl 15871 [2001:888:2000:d::a6]:36236 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:52994 On 25/08/2013 23:32, David M. Cotter wrote: > i got it!! OMG! so sorry for the confusion, but i learned a lot, > and i can share the result: > > the CORRECT code *was* what i had assumed. the Python side has > always been correct (no need to put "u" in front of strings, it is > known that the bytes are utf8 bytes) > > it was my "run script" function which read in the file. THAT was > what was "reinterpreting" the utf8 bytes as macRoman (on both > platforms). correct code below: > When working with Unicode, what you should be doing is: 1. Specifying the encoding line in the special comment. 2. Setting the encoding of the source file. 3. Using Unicode string literals in the source file. You're doing (1) and (2), but not (3). If you want to pass UTF-8 to the the C++, then encode the Unicode string to bytes when you pass it. Using bytestring literals and relying on the source file being UTF-8, like you doing, is just asking for trouble, as you've found out! :-)