Groups > comp.lang.python > #94369 > unrolled thread

Re: Encoding of Python 2 string literals

Started by	Steven D'Aprano <steve@pearwood.info>
First post	2015-07-23 00:38 +1000
Last post	2015-07-23 16:13 +1000
Articles	4 — 3 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Encoding of Python 2 string literals Steven D'Aprano <steve@pearwood.info> - 2015-07-23 00:38 +1000
    Re: Encoding of Python 2 string literals Chris Angelico <rosuav@gmail.com> - 2015-07-23 00:54 +1000
    Re: Encoding of Python 2 string literals dieter <dieter@handshake.de> - 2015-07-23 07:58 +0200
    Re: Encoding of Python 2 string literals Chris Angelico <rosuav@gmail.com> - 2015-07-23 16:13 +1000

#94369 — Re: Encoding of Python 2 string literals

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-07-23 00:38 +1000
Subject	Re: Encoding of Python 2 string literals
Message-ID	<55afaad8$0$1646$c3e8da3$5496439d@news.astraweb.com>

On Wed, 22 Jul 2015 08:17 pm, anatoly techtonik wrote:

> Hi,
> 
> Is there a way to know encoding of string (bytes) literal
> defined in source file? For example, given that source:
> 
>     # -*- coding: utf-8 -*-
>     from library import Entry
>     Entry("текст")
> 
> Is there any way for Entry() constructor to know that
> string "текст" passed into it is the utf-8 string?

No.

The entry constructor will receive a BYTE string, not a Unicode string,
containing some sequence of bytes.

If the coding cookie is accurate, then it will be the UTF-8 encoding of that
string, namely:

'\xd1\x82\xd0\xb5\xd0\xba\xd1\x81\xd1\x82'

If you print those bytes, at least under Linux, your terminal will probably
interpret them as UTF-8 and display it as текст but don't be fooled, the
string has length 10 (not 5).

If the coding cookie is not accurate, you will get something else. Probably
garbage, possibly a syntax error. Let's say you saved the text file using
the koi8-r encoding, but the coding cookie says utf-8. Then the text file
will actually contain bytes \xd4\xc5\xcb\xd3\xd4, but Python will try to
read those bytes as UTF-8, which is invalid. So at best you will get a
syntax error, at worst garbage text.

The right way to deal with this is to use an actual Unicode string:

Entry(u"текст")

and make sure that the file is saved using UTF-8, as the encoding cookie
says.

> I need to better prepare SCons for Python 3 migration.

The first step is to use proper Unicode strings u'' in Python 2.

It is acceptable to drop support for Python 3.1 and 3.2, and only support
3.3 and better. The advantage of this is that 3.3 supports the u'' string
prefix. If you must support 3.1 and 3.2 as well, there is no good solution,
just ugly ones.

-- 
Steven

[toc] | [next] | [standalone]

#94371

From	Chris Angelico <rosuav@gmail.com>
Date	2015-07-23 00:54 +1000
Message-ID	<mailman.868.1437576859.3674.python-list@python.org>
In reply to	#94369

On Thu, Jul 23, 2015 at 12:38 AM, Steven D'Aprano <steve@pearwood.info> wrote:
> On Wed, 22 Jul 2015 08:17 pm, anatoly techtonik wrote:
>
>> Hi,
>>
>> Is there a way to know encoding of string (bytes) literal
>> defined in source file? For example, given that source:
>>
>>     # -*- coding: utf-8 -*-
>>     from library import Entry
>>     Entry("текст")
>>
>> Is there any way for Entry() constructor to know that
>> string "текст" passed into it is the utf-8 string?
>
> No.
>
> The entry constructor will receive a BYTE string, not a Unicode string,
> containing some sequence of bytes.
>
> If the coding cookie is accurate, then it will be the UTF-8 encoding of that
> string, namely:
>
> '\xd1\x82\xd0\xb5\xd0\xba\xd1\x81\xd1\x82'
>
> If you print those bytes, at least under Linux, your terminal will probably
> interpret them as UTF-8 and display it as текст but don't be fooled, the
> string has length 10 (not 5).
>
> If the coding cookie is not accurate, you will get something else. Probably
> garbage, possibly a syntax error. Let's say you saved the text file using
> the koi8-r encoding, but the coding cookie says utf-8. Then the text file
> will actually contain bytes \xd4\xc5\xcb\xd3\xd4, but Python will try to
> read those bytes as UTF-8, which is invalid. So at best you will get a
> syntax error, at worst garbage text.

AIUI the problem is more along these lines:

1) Put Unicode text into .py file, with a correct coding cookie
2) Part of that text is a quoted byte-string literal, which will
capture those bytes.
3) The byte string is then passed along to another module.
4) ???
5) The other module decodes the bytes to Unicode, using the specified encoding.

The hole is step 4, as there's no way (AFAIK) to find out what
encoding a source file used. But the solution isn't to find out the
encoding... the solution is...

> The right way to deal with this is to use an actual Unicode string:
>
> Entry(u"текст")
>
> and make sure that the file is saved using UTF-8, as the encoding cookie
> says.

... this. Downside is that this MAY require changes to the API, as it
now has to take Unicode strings everywhere instead of byte strings.
Upside: That's probably what your code is trying to do anyway.

> It is acceptable to drop support for Python 3.1 and 3.2, and only support
> 3.3 and better. The advantage of this is that 3.3 supports the u'' string
> prefix. If you must support 3.1 and 3.2 as well, there is no good solution,
> just ugly ones.

Definitely. If you're only just migrating now, 3.2 is in
security-fix-only mode, and will be out of that within a year. Aim at
3.3+ and take advantage of u"..." compatibility, or even go a bit
further, aim at 3.5+ and make use of bytestring percent formatting.
Depends what you need, but I wouldn't bother supporting 3.2 if it's
any hassle.

ChrisA

[toc] | [prev] | [next] | [standalone]

#94427

From	dieter <dieter@handshake.de>
Date	2015-07-23 07:58 +0200
Message-ID	<mailman.900.1437631206.3674.python-list@python.org>
In reply to	#94369

Steven D'Aprano <steve@pearwood.info> writes:
> On Wed, 22 Jul 2015 08:17 pm, anatoly techtonik wrote:
>> Is there a way to know encoding of string (bytes) literal
>> defined in source file? For example, given that source:
>> 
>>     # -*- coding: utf-8 -*-
>>     from library import Entry
>>     Entry("текст")
>> 
>> Is there any way for Entry() constructor to know that
>> string "текст" passed into it is the utf-8 string?
> ...
> The right way to deal with this is to use an actual Unicode string:
>
> Entry(u"текст")
>
> and make sure that the file is saved using UTF-8, as the encoding cookie
> says.

In order to follow this recommendation, is there an easy way to
learn about the "encoding cookie"'s value -- rather than parsing
the first two lines of the source file (which may not always be available).

[toc] | [prev] | [next] | [standalone]

#94428

From	Chris Angelico <rosuav@gmail.com>
Date	2015-07-23 16:13 +1000
Message-ID	<mailman.901.1437631990.3674.python-list@python.org>
In reply to	#94369

On Thu, Jul 23, 2015 at 3:58 PM, dieter <dieter@handshake.de> wrote:
> Steven D'Aprano <steve@pearwood.info> writes:
>> On Wed, 22 Jul 2015 08:17 pm, anatoly techtonik wrote:
>>> Is there a way to know encoding of string (bytes) literal
>>> defined in source file? For example, given that source:
>>>
>>>     # -*- coding: utf-8 -*-
>>>     from library import Entry
>>>     Entry("текст")
>>>
>>> Is there any way for Entry() constructor to know that
>>> string "текст" passed into it is the utf-8 string?
>> ...
>> The right way to deal with this is to use an actual Unicode string:
>>
>> Entry(u"текст")
>>
>> and make sure that the file is saved using UTF-8, as the encoding cookie
>> says.
>
> In order to follow this recommendation, is there an easy way to
> learn about the "encoding cookie"'s value -- rather than parsing
> the first two lines of the source file (which may not always be available).

No; you don't need to. If you use a Unicode string literal (as marked
by the u"..." notation), the Python compiler will handle the decoding
for you. The string that's passed to Entry() will simply be a string
of Unicode codepoints - no encoding information needed. If you then
want that in UTF-8, you can encode it explicitly.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Re: Encoding of Python 2 string literals

Contents

#94369 — Re: Encoding of Python 2 string literals

#94371

#94427

#94428