Groups > comp.lang.python > #67000 > unrolled thread

Python 3.5, bytes, and %-interpolation (aka PEP 461)

Started by	Ethan Furman <ethan@stoneleaf.us>
First post	2014-02-24 11:54 -0800
Last post	2014-02-25 07:29 +0000
Articles	11 — 7 participants

Back to article view | Back to comp.lang.python

  Python 3.5, bytes, and %-interpolation  (aka PEP 461) Ethan Furman <ethan@stoneleaf.us> - 2014-02-24 11:54 -0800
    Re: Python 3.5, bytes, and %-interpolation  (aka PEP 461) Marko Rauhamaa <marko@pacujo.net> - 2014-02-24 22:46 +0200
      Re: Python 3.5, bytes, and %-interpolation  (aka PEP 461) random832@fastmail.us - 2014-02-24 16:04 -0500
        Re: Python 3.5, bytes, and %-interpolation  (aka PEP 461) Marko Rauhamaa <marko@pacujo.net> - 2014-02-25 00:18 +0200
          Re: Python 3.5, bytes, and %-interpolation  (aka PEP 461) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-02-24 23:55 +0000
            Re: Python 3.5, bytes, and %-interpolation  (aka PEP 461) wxjmfauth@gmail.com - 2014-02-25 00:07 -0800
              Re: Python 3.5, bytes, and %-interpolation  (aka PEP 461) Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-02-25 08:29 +0000
      Re: Python 3.5, bytes, and %-interpolation  (aka PEP 461) Ethan Furman <ethan@stoneleaf.us> - 2014-02-24 13:53 -0800
    Re: Python 3.5, bytes, and %-interpolation  (aka PEP 461) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-02-24 23:55 +0000
      Re: Python 3.5, bytes, and %-interpolation  (aka PEP 461) Ethan Furman <ethan@stoneleaf.us> - 2014-02-24 16:10 -0800
        Re: Python 3.5, bytes, and %-interpolation  (aka PEP 461) Steven D'Aprano <steve@pearwood.info> - 2014-02-25 07:29 +0000

#67000 — Python 3.5, bytes, and %-interpolation (aka PEP 461)

From	Ethan Furman <ethan@stoneleaf.us>
Date	2014-02-24 11:54 -0800
Subject	Python 3.5, bytes, and %-interpolation (aka PEP 461)
Message-ID	<mailman.7328.1393271688.18130.python-list@python.org>

Greetings!

A PEP is under discussion to add %-interpolation back to the bytes type in Python 3.5.

Assuming the PEP is accepted, what *will* be added back is:

Numerics:

   b'%d' % 10  --> b'10'
   b'%02x' % 10  --> b'0a'

Single byte:

   b'%c' % 80  --> b'P'

and generic:

   b'%s' % some_binary_blob --> b'tHE*&92h4' (or whatever)

What is under debate is whether we should also add %a:

   b'%a' % some_obj  --> b'some_obj_repr'

What %a would do:

   get the repr of some_obj

   convert it to ascii using backslashreplace (to handle any code points over 127)

   encode to bytes using 'ascii'

Can anybody think of a use-case for this particular feature?

--
~Ethan~

[toc] | [next] | [standalone]

#67003

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-02-24 22:46 +0200
Message-ID	<87ha7o9r2q.fsf@elektro.pacujo.net>
In reply to	#67000

Ethan Furman <ethan@stoneleaf.us>:

> Can anybody think of a use-case for this particular feature?

Internet protocol entities constantly have to format (and parse)
ASCII-esque octet strings:

    headers.append(b'Content-length: %d\r\n' % len(blob))

    headers.append(b'Content-length: {}\r\n'.format(len(blob)))

Now you must do:

    headers.append(('Content-length: %d\r\n' % len(blob)).encode())

    headers.append('Content-length: {}\r\n'.format(len(blob)).encode())

That is:

 1. ineffient (encode/decode shuffle)

 2. unnatural (strings usually have no place in protocols)

 3. confusing (what is stored as bytes, what is stored as strings?)

 4. error-prone (UTF-8 decoding exceptions etc)


To be sure, %s will definitely be needed as well:

   uri = b'http://%s/robots.txt' % host


Marko

[toc] | [prev] | [next] | [standalone]

#67004

From	random832@fastmail.us
Date	2014-02-24 16:04 -0500
Message-ID	<mailman.7331.1393275892.18130.python-list@python.org>
In reply to	#67003

On Mon, Feb 24, 2014, at 15:46, Marko Rauhamaa wrote:
> That is:
> 
>  1. ineffient (encode/decode shuffle)
> 
>  2. unnatural (strings usually have no place in protocols)

That's not at all clear. Why _aren't_ these protocols considered text
protocols? Why can't you add a string directly to headers?

[toc] | [prev] | [next] | [standalone]

#67005

From	Marko Rauhamaa <marko@pacujo.net>
Date	2014-02-25 00:18 +0200
Message-ID	<87d2ic2lyq.fsf@elektro.pacujo.net>
In reply to	#67004

random832@fastmail.us:

> On Mon, Feb 24, 2014, at 15:46, Marko Rauhamaa wrote:
>> That is:
>> 
>>  1. ineffient (encode/decode shuffle)
>> 
>>  2. unnatural (strings usually have no place in protocols)
>
> That's not at all clear. Why _aren't_ these protocols considered text
> protocols? Why can't you add a string directly to headers?

Text expresses a written human language. In prosaic terms, a Python
string is a sequence of ISO 10646 characters, whose codepoints are not
octets.

Most network protocols are defined in terms of octets, although many of
them can carry textual, audio or video payloads (among others). So when
RFC 3507 (ICAP) shows an example starting:

   RESPMOD icap://icap.example.org/satisf ICAP/1.0
   Host: icap.example.org
   Encapsulated: req-hdr=0, res-hdr=137, res-body=296

it consists of 8-bit octets and not some human language.

In practical terms, you get the bytes off the socket as, well, bytes. It
makes little sense to "decode" those bytes into a string for
manipulation. Manipulating bytes directly is both more efficient and
more natural from the point of view of the standard.

Many internet protocols happen to look like text. It makes it nicer for
human network programmers to work with them. However, they are primarily
meant for computers, and the message formats are really a form of binary
code.

Marko

[toc] | [prev] | [next] | [standalone]

#67008

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-02-24 23:55 +0000
Message-ID	<530bdbf8$0$29985$c3e8da3$5496439d@news.astraweb.com>
In reply to	#67005

On Tue, 25 Feb 2014 00:18:53 +0200, Marko Rauhamaa wrote:

> random832@fastmail.us:
> 
>> On Mon, Feb 24, 2014, at 15:46, Marko Rauhamaa wrote:
>>> That is:
>>> 
>>>  1. ineffient (encode/decode shuffle)
>>> 
>>>  2. unnatural (strings usually have no place in protocols)
>>
>> That's not at all clear. Why _aren't_ these protocols considered text
>> protocols? Why can't you add a string directly to headers?

You cannot mix text strings and byte strings in Python 3. Python 2 allows 
you to do so, and it leads to hard-to-diagnose bugs and confusing 
behaviour. This is why Python 3 insists on a strict separation between 
the two.

But of course you can add *byte* strings directly to byte headers. Just 
prefix your strings with a b, as in b'Header' instead of 'Header', and it 
will work fine.

However, you don't really want to be adding large numbers of byte strings 
together, due to efficiency. Better to use % interpolation to insert them 
all at once. Hence the push to add % to bytes in Python 3.

Marko replied:
> Text expresses a written human language. In prosaic terms, a Python
> string is a sequence of ISO 10646 characters, whose codepoints are not
> octets.

Almost correct, but not quite. Python strings are Unicode, not ISO-10646. 
The two are not the same.

http://www.unicode.org/faq/unicode_iso.html

> Most network protocols are defined in terms of octets, although many of
> them can carry textual, audio or video payloads (among others). So when
> RFC 3507 (ICAP) shows an example starting:
> 
>    RESPMOD icap://icap.example.org/satisf ICAP/1.0 Host:
>    icap.example.org
>    Encapsulated: req-hdr=0, res-hdr=137, res-body=296
> 
> it consists of 8-bit octets and not some human language.

Not really relevant. In practical terms, whether they are implemented as 
octets or not, the sequence "Host" *is* human language, specifically it 
is the English word Host that just happens to be encoded in ASCII. 
Likewise the sequence "Encapsulated" *is* the English word Encapsulated 
encoded in ASCII.

> In practical terms, you get the bytes off the socket as, well, bytes. It
> makes little sense to "decode" those bytes into a string for
> manipulation. Manipulating bytes directly is both more efficient and
> more natural from the point of view of the standard.

But not necessarily more natural from the point of the programmer, which 
is what matters.

I agree that if you don't need to interpret the data as Unicode text, 
then there's no real benefit to decoding to text. (In fact, if your data 
can contain arbitrary bytes, you may not be able to decode to text, since 
not all byte sequences are legal UTF-8.)

> Many internet protocols happen to look like text. It makes it nicer for
> human network programmers to work with them. However, they are primarily
> meant for computers, and the message formats are really a form of binary
> code.

The reason that, say, the subject header line in emails starts with the 
word "Subject" rather than some arbitrary binary code is because it is 
intended to be human-readable. Not just human-readable, but *semantically 
meaningful*. That's why the subject line is labelled "Subject" rather 
than "Field 23" or "SJT".

Fortunately, such headers are usually (always?) ASCII, and byte strings 
in Python privilege ASCII-encoded text. When you write b'Subject', you 
get the same sequence of bytes as 'Subject'.encode('ascii').

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#67032

From	wxjmfauth@gmail.com
Date	2014-02-25 00:07 -0800
Message-ID	<c14b37dd-139f-4110-98e0-08ca437f91af@googlegroups.com>
In reply to	#67008

Le mardi 25 février 2014 00:55:36 UTC+1, Steven D'Aprano a écrit :
> 
> 
> 
> However, you don't really want to be adding large numbers of byte strings 
> 
> together, due to efficiency. Better to use % interpolation to insert them 
> 
> all at once. Hence the push to add % to bytes in Python 3.
> 
> 

>>> timeit.timeit("'abc' * 1000 + '\u20ac'")
2.3244550589543564
>>> timeit.timeit("x * 1000 + y", "x = 'abc'.encode('utf-8'); y = '\u20ac'.encode('utf-8')")
0.9365105183684364
>>> timeit.timeit("'\u0153' + 'abc' * 1000 + '\u20ac'")
3.0469319226397715
>>> timeit.timeit("z + x * 1000 + y", "x = 'abc'.encode('utf-8'); y = '\u20ac'.encode('utf-8'); z = '\u0153'.encode('utf-8')")
1.9215464486771339
>>> 

Interpolation will not help.

What is wrong by design will always stay wrong by design.

jmf

>>>

[toc] | [prev] | [next] | [standalone]

#67033

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2014-02-25 08:29 +0000
Message-ID	<mailman.7346.1393316990.18130.python-list@python.org>
In reply to	#67032

On 25/02/2014 08:07, wxjmfauth@gmail.com wrote:
>
> What is wrong by design will always stay wrong by design.
>

Why are you making the statement that PEP 461 is wrong by design?


-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com

[toc] | [prev] | [next] | [standalone]

#67006

From	Ethan Furman <ethan@stoneleaf.us>
Date	2014-02-24 13:53 -0800
Message-ID	<mailman.7332.1393281705.18130.python-list@python.org>
In reply to	#67003

On 02/24/2014 01:04 PM, random832@fastmail.us wrote:
> On Mon, Feb 24, 2014, at 15:46, Marko Rauhamaa wrote:
>> That is:
>>
>>   1. ineffient (encode/decode shuffle)
>>
>>   2. unnatural (strings usually have no place in protocols)
>
> That's not at all clear. Why _aren't_ these protocols considered text
> protocols? Why can't you add a string directly to headers?

Because text is a high-order abstraction.  You don't store text in files, you don't transmit text over the wire or 
through the air -- those actions are done with a lower abstraction, that of bytes.

You're framework may allow you to add a string, but under the covers it's converting to bytes -- at which point is up to 
the framework.

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#67007

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-02-24 23:55 +0000
Message-ID	<530bdbe2$0$29985$c3e8da3$5496439d@news.astraweb.com>
In reply to	#67000

On Mon, 24 Feb 2014 11:54:54 -0800, Ethan Furman wrote:

> Greetings!
> 
> A PEP is under discussion to add %-interpolation back to the bytes type
> in Python 3.5.
> 
> Assuming the PEP is accepted, what *will* be added back is:
> 
> Numerics:
> 
>    b'%d' % 10  --> b'10'
>    b'%02x' % 10  --> b'0a'
> 
> Single byte:
> 
>    b'%c' % 80  --> b'P'

Will %c also accept a length-1 bytes object?

b'%c' % b'x'
=> b'x'



> and generic:
> 
>    b'%s' % some_binary_blob --> b'tHE*&92h4' (or whatever)

Will b'%s' take any arbitrary object, as in:

b'Key: %s' % [1, 2, 3, 4]
=> b'Key: [1, 2, 3, 4]'


or only something which is already bytes (i.e. a bytes or bytearray 
object)?



> What is under debate is whether we should also add %a:
> 
>    b'%a' % some_obj  --> b'some_obj_repr'
> 
> What %a would do:
> 
>    get the repr of some_obj
> 
>    convert it to ascii using backslashreplace (to handle any code points
>    over 127)
> 
>    encode to bytes using 'ascii'
> 
> Can anybody think of a use-case for this particular feature?


Not me.




-- 
Steven

[toc] | [prev] | [next] | [standalone]

#67009

From	Ethan Furman <ethan@stoneleaf.us>
Date	2014-02-24 16:10 -0800
Message-ID	<mailman.7333.1393288494.18130.python-list@python.org>
In reply to	#67007

On 02/24/2014 03:55 PM, Steven D'Aprano wrote:
> On Mon, 24 Feb 2014 11:54:54 -0800, Ethan Furman wrote:
>
>> Greetings!
>>
>> A PEP is under discussion to add %-interpolation back to the bytes type
>> in Python 3.5.
>>
>> Assuming the PEP is accepted, what *will* be added back is:
>>
>> Numerics:
>>
>>     b'%d' % 10  --> b'10'
>>     b'%02x' % 10  --> b'0a'
>>
>> Single byte:
>>
>>     b'%c' % 80  --> b'P'
>
> Will %c also accept a length-1 bytes object?
>
> b'%c' % b'x'
> => b'x'

Yes.


>> and generic:
>>
>>     b'%s' % some_binary_blob --> b'tHE*&92h4' (or whatever)
>
> Will b'%s' take any arbitrary object, as in:
>
> b'Key: %s' % [1, 2, 3, 4]
> => b'Key: [1, 2, 3, 4]'

No.

> or only something which is already bytes (i.e. a bytes or bytearray
> object)?

It must already be bytes, or have __bytes__ method (that returns bytes, obviously ;) .


>> What is under debate is whether we should also add %a:
>>
>>     b'%a' % some_obj  --> b'some_obj_repr'
>>
>> What %a would do:
>>
>>     get the repr of some_obj
>>
>>     convert it to ascii using backslashreplace (to handle any code points
>>     over 127)
>>
>>     encode to bytes using 'ascii'
>>
>> Can anybody think of a use-case for this particular feature?
>
> Not me.

I find that humorous, as %a would work with your list example above.  :)

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#67030

From	Steven D'Aprano <steve@pearwood.info>
Date	2014-02-25 07:29 +0000
Message-ID	<530c466c$0$29980$c3e8da3$5496439d@news.astraweb.com>
In reply to	#67009

On Mon, 24 Feb 2014 16:10:53 -0800, Ethan Furman wrote:

> On 02/24/2014 03:55 PM, Steven D'Aprano wrote:

>> Will b'%s' take any arbitrary object, as in:
>>
>> b'Key: %s' % [1, 2, 3, 4]
>> => b'Key: [1, 2, 3, 4]'
> 
> No.

Very glad to hear it.

[...]
>>> Can anybody think of a use-case for this particular feature?
>>
>> Not me.
> 
> I find that humorous, as %a would work with your list example above.  :)

I know. But why would I want to do it? "It won't fail" is not a use-case. 
I can subclass int and give it a __getitem__ method that raise SystemExit, 
but that's not a use-case for doing so :-)

I cannot think of any reason to want to ASCII-ise the repr of arbitrary 
objects, and on the rare occasion that I did, I could say 

repr(obj).encode('ascii', 'backslashescape')

I don't object to this feature, but nor do I want it.

-- 
Steven

[toc] | [prev] | [standalone]

csiph-web

Python 3.5, bytes, and %-interpolation (aka PEP 461)

Contents

#67000 — Python 3.5, bytes, and %-interpolation (aka PEP 461)

#67003

#67004

#67005

#67008

#67032

#67033

#67006

#67007

#67009

#67030