Path: csiph.com!optima2.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!1.eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'anyway.': 0.04; 'subject:Python': 0.05; 'skip:\\ 20': 0.05; 'used.': 0.05; '-*-': 0.07; 'bytes.': 0.07; 'constructor': 0.07; 'strings.': 0.07; 'ugly': 0.07; 'utf-8': 0.07; 'cc:addr:python-list': 0.09; 'bytes,': 0.09; 'coding:': 0.09; 'decodes': 0.09; 'encoding.': 0.09; 'garbage': 0.09; 'lines:': 0.09; 'literal': 0.09; 'mode,': 0.09; 'says.': 0.09; 'subject:string': 0.09; 'unicode,': 0.09; 'python': 0.10; 'syntax': 0.13; 'wed,': 0.15; 'encoding': 0.15; 'file,': 0.15; 'thu,': 0.15; '.py': 0.16; '23,': 0.16; '3.2,': 0.16; 'anatoly': 0.16; 'api,': 0.16; 'downside': 0.16; 'file?': 0.16; 'formatting.': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'invalid.': 0.16; 'literal,': 0.16; 'need,': 0.16; 'quoted': 0.16; 'string:': 0.16; 'worst': 0.16; 'wrote:': 0.16; "wouldn't": 0.16; 'string': 0.17; 'byte': 0.18; 'bytes': 0.18; 'linux,': 0.18; 'string,': 0.18; '(not': 0.20; 'library': 0.20; 'changes': 0.20; '2015': 0.20; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; '3.2': 0.22; 'interpret': 0.22; 'ones.': 0.22; "skip:' 40": 0.22; 'trying': 0.22; 'am,': 0.23; 'defined': 0.23; 'bit': 0.23; 'specified': 0.23; 'import': 0.24; 'header:In-Reply-To:1': 0.24; 'module': 0.25; 'coding': 0.27; 'least': 0.27; 'message-id:@mail.gmail.com': 0.27; 'module.': 0.27; 'sequence': 0.27; 'correct': 0.28; 'this.': 0.28; 'actual': 0.28; '3.1': 0.29; 'cookie': 0.29; 'solution,': 0.29; 'print': 0.30; 'code': 0.30; 'error.': 0.31; 'probably': 0.31; 'entry': 0.31; 'another': 0.32; 'says': 0.32; 'possibly': 0.32; 'problem': 0.33; 'source': 0.33; "d'aprano": 0.33; 'steven': 0.33; 'file': 0.34; 'received:google.com': 0.35; 'text': 0.35; '???': 0.35; 'acceptable': 0.35; 'saved': 0.35; 'text.': 0.35; 'unicode': 0.35; 'something': 0.35; "isn't": 0.35; 'supports': 0.35; 'step': 0.36; 'but': 0.36; 'instead': 0.36; 'there': 0.36; 'depends': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'display': 0.37; 'say': 0.37; 'drop': 0.38; 'hi,': 0.38; 'sure': 0.39; 'along': 0.39; 'skip:e 20': 0.39; 'some': 0.40; 'your': 0.60; 'no.': 0.62; 'more': 0.63; 'within': 0.64; '8bit%:50': 0.66; 'better.': 0.66; 'capture': 0.66; 'else.': 0.66; 'percent': 0.66; 'receive': 0.71; 'further,': 0.72; 'jul': 0.72; 'aiui': 0.84; 'chrisa': 0.84; 'hassle.': 0.84; 'source:': 0.84; 'to:none': 0.91; 'migrating': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type:content-transfer-encoding; bh=NaDBvClZOSmwmZxnxg/1+gDV/MPuJgp3tUwN6w8VWdI=; b=RY+JMOCulHRS7SGgdvTH5X1/r/dbn6WIa63iYMddciuH7AQyHq6Q0HAYIldq0FB74s ent2qETQ+8EkqUHLEJVPr6fTXNBHg3E5MwAop/KtK+gayUpxwxKqplx0MiNLHtY+Fnta 6tDOu0/9JahUic0K3j4hk3MXAZ9MHqD5aWxQ1vHVtQAimt+24pEKRg5yDnm1sOlmMEuO 9b+/ALsKpIVu9qZK2N0u1QfMT6Y7rc3M9Kkz0gj9jyJZ751lYqko88+CsVjEzcPrd8l+ tdVzcNrRBkpXgVM/I/ShPYJZWHNtyjioodmD9Mz0k8+RBajSuTnMZXjyMICXb7Ak8V5n RKhg== MIME-Version: 1.0 X-Received: by 10.50.142.98 with SMTP id rv2mr6757131igb.41.1437576856540; Wed, 22 Jul 2015 07:54:16 -0700 (PDT) In-Reply-To: <55afaad8$0$1646$c3e8da3$5496439d@news.astraweb.com> References: <55afaad8$0$1646$c3e8da3$5496439d@news.astraweb.com> Date: Thu, 23 Jul 2015 00:54:16 +1000 Subject: Re: Encoding of Python 2 string literals From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 81 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1437576859 news.xs4all.nl 2860 [2001:888:2000:d::a6]:46402 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:94371 On Thu, Jul 23, 2015 at 12:38 AM, Steven D'Aprano wro= te: > On Wed, 22 Jul 2015 08:17 pm, anatoly techtonik wrote: > >> Hi, >> >> Is there a way to know encoding of string (bytes) literal >> defined in source file? For example, given that source: >> >> # -*- coding: utf-8 -*- >> from library import Entry >> Entry("=D1=82=D0=B5=D0=BA=D1=81=D1=82") >> >> Is there any way for Entry() constructor to know that >> string "=D1=82=D0=B5=D0=BA=D1=81=D1=82" passed into it is the utf-8 stri= ng? > > No. > > The entry constructor will receive a BYTE string, not a Unicode string, > containing some sequence of bytes. > > If the coding cookie is accurate, then it will be the UTF-8 encoding of t= hat > string, namely: > > '\xd1\x82\xd0\xb5\xd0\xba\xd1\x81\xd1\x82' > > If you print those bytes, at least under Linux, your terminal will probab= ly > interpret them as UTF-8 and display it as =D1=82=D0=B5=D0=BA=D1=81=D1=82 = but don't be fooled, the > string has length 10 (not 5). > > If the coding cookie is not accurate, you will get something else. Probab= ly > garbage, possibly a syntax error. Let's say you saved the text file using > the koi8-r encoding, but the coding cookie says utf-8. Then the text file > will actually contain bytes \xd4\xc5\xcb\xd3\xd4, but Python will try to > read those bytes as UTF-8, which is invalid. So at best you will get a > syntax error, at worst garbage text. AIUI the problem is more along these lines: 1) Put Unicode text into .py file, with a correct coding cookie 2) Part of that text is a quoted byte-string literal, which will capture those bytes. 3) The byte string is then passed along to another module. 4) ??? 5) The other module decodes the bytes to Unicode, using the specified encod= ing. The hole is step 4, as there's no way (AFAIK) to find out what encoding a source file used. But the solution isn't to find out the encoding... the solution is... > The right way to deal with this is to use an actual Unicode string: > > Entry(u"=D1=82=D0=B5=D0=BA=D1=81=D1=82") > > and make sure that the file is saved using UTF-8, as the encoding cookie > says. ... this. Downside is that this MAY require changes to the API, as it now has to take Unicode strings everywhere instead of byte strings. Upside: That's probably what your code is trying to do anyway. > It is acceptable to drop support for Python 3.1 and 3.2, and only support > 3.3 and better. The advantage of this is that 3.3 supports the u'' string > prefix. If you must support 3.1 and 3.2 as well, there is no good solutio= n, > just ugly ones. Definitely. If you're only just migrating now, 3.2 is in security-fix-only mode, and will be out of that within a year. Aim at 3.3+ and take advantage of u"..." compatibility, or even go a bit further, aim at 3.5+ and make use of bytestring percent formatting. Depends what you need, but I wouldn't bother supporting 3.2 if it's any hassle. ChrisA