Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #92394 > unrolled thread

Re: enhancement request: make py3 read/write py2 pickle format

Started byRobert Kern <robert.kern@gmail.com>
First post2015-06-10 12:22 +0100
Last post2015-06-10 20:47 -0400
Articles 6 — 4 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: enhancement request: make py3 read/write py2 pickle format Robert Kern <robert.kern@gmail.com> - 2015-06-10 12:22 +0100
    Re: enhancement request: make py3 read/write py2 pickle format Marko Rauhamaa <marko@pacujo.net> - 2015-06-10 15:08 +0300
      Re: enhancement request: make py3 read/write py2 pickle format random832@fastmail.us - 2015-06-10 09:38 -0400
      Re: enhancement request: make py3 read/write py2 pickle format Robert Kern <robert.kern@gmail.com> - 2015-06-10 14:52 +0100
        Re: enhancement request: make py3 read/write py2 pickle format Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2015-06-11 11:30 +1200
          Re: enhancement request: make py3 read/write py2 pickle format random832@fastmail.us - 2015-06-10 20:47 -0400

#92394 — Re: enhancement request: make py3 read/write py2 pickle format

FromRobert Kern <robert.kern@gmail.com>
Date2015-06-10 12:22 +0100
SubjectRe: enhancement request: make py3 read/write py2 pickle format
Message-ID<mailman.337.1433935377.13271.python-list@python.org>
On 2015-06-10 12:04, Neal Becker wrote:
> Chris Warrick wrote:
>
>> On Tue, Jun 9, 2015 at 8:08 PM, Neal Becker <ndbecker2@gmail.com> wrote:
>>> One of the most annoying problems with py2/3 interoperability is that the
>>> pickle formats are not compatible.  There must be many who, like myself,
>>> often use pickle format for data storage.
>>>
>>> It certainly would be a big help if py3 could read/write py2 pickle
>>> format. You know, backward compatibility?
>>
>> Don’t use pickle. It’s unsafe — it executes arbitrary code, which
>> means someone can give you a pickle file that will delete all your
>> files or eat your cat.
>>
>> Instead, use a safe format that has no ability to execute code, like
>> JSON. It will also work with other programming languages and
>> environments if you ever need to talk to anyone else.
>>
>> But, FYI: there is backwards compatibility if you ask for it, in the
>> form of protocol versions. That’s all you should know — again, don’t
>> use pickle.
>
> I believe a good native serialization system is essential for any modern
> programming language.  If pickle isn't it, we need something else that can
> serialize all language objects.  Or, are you saying, it's impossible to do
> this safely?

By the very nature of the stated problem: serializing all language objects. 
Being able to construct any object, including instances of arbitrary classes, 
means that arbitrary code can be executed. All I have to do is make a pickle 
file for an object that claims that its constructor is shutil.rmtree().

This is fine in some use cases (e.g. wire format for otherwise-secured 
communication between two endpoints under your complete control), but it is 
worrying in others, like your use case of data storage (and presumably sharing).

Python 2/3 is also the least of your compatibility worries there. Refactor a 
class to a different module, or did one of your third-party dependencies do 
this? Poof! Your pickle files no longer work.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

[toc] | [next] | [standalone]


#92396

FromMarko Rauhamaa <marko@pacujo.net>
Date2015-06-10 15:08 +0300
Message-ID<878ubr3gv8.fsf@elektro.pacujo.net>
In reply to#92394
Robert Kern <robert.kern@gmail.com>:

> By the very nature of the stated problem: serializing all language
> objects. Being able to construct any object, including instances of
> arbitrary classes, means that arbitrary code can be executed. All I
> have to do is make a pickle file for an object that claims that its
> constructor is shutil.rmtree().

You can't serialize/migrate arbitrary objects. Consider open TCP
connections, open files and other objects that extend outside the Python
VM. Also objects hold references to each other, leading to a huge
reference mesh.

For example:

   a.buddy = b
   b.buddy = a
   with open("a", "wb") as f: f.write(serialize(a))
   with open("b", "wb") as f: f.write(serialize(b))

   with open("a", "rb") as f: aa = deserialize(f.read())
   with open("b", "rb") as f: bb = deserialize(f.read())
   assert aa.buddy is bb


Marko

[toc] | [prev] | [next] | [standalone]


#92398

Fromrandom832@fastmail.us
Date2015-06-10 09:38 -0400
Message-ID<mailman.340.1433943523.13271.python-list@python.org>
In reply to#92396
On Wed, Jun 10, 2015, at 08:08, Marko Rauhamaa wrote:
> You can't serialize/migrate arbitrary objects. Consider open TCP
> connections, open files and other objects that extend outside the Python
> VM. Also objects hold references to each other, leading to a huge
> reference mesh.
> 
> For example:
> 
>    a.buddy = b
>    b.buddy = a
>    with open("a", "wb") as f: f.write(serialize(a))
>    with open("b", "wb") as f: f.write(serialize(b))
> 
>    with open("a", "rb") as f: aa = deserialize(f.read())
>    with open("b", "rb") as f: bb = deserialize(f.read())
>    assert aa.buddy is bb

Of course, if you serialize a single dict with e.g. {'a': a, 'b': b},
you can expect (with advanced serialization tools, anyway  - I suspect
JSON will just make a mess or exceed maximum recursion depth)
result['a'].buddy is result['b']

[toc] | [prev] | [next] | [standalone]


#92399

FromRobert Kern <robert.kern@gmail.com>
Date2015-06-10 14:52 +0100
Message-ID<mailman.341.1433944371.13271.python-list@python.org>
In reply to#92396
On 2015-06-10 13:08, Marko Rauhamaa wrote:
> Robert Kern <robert.kern@gmail.com>:
>
>> By the very nature of the stated problem: serializing all language
>> objects. Being able to construct any object, including instances of
>> arbitrary classes, means that arbitrary code can be executed. All I
>> have to do is make a pickle file for an object that claims that its
>> constructor is shutil.rmtree().
>
> You can't serialize/migrate arbitrary objects. Consider open TCP
> connections, open files and other objects that extend outside the Python
> VM.

Yes, yes, but that's really beside the point. Yes, there are some objects for 
which it doesn't even make sense to serialize. But my point is that even in this 
slightly smaller set of objects that *can* be serialized (and pickle currently 
does serialize), being able to serialize all of them entails arbitrary code 
execution to deserialize them. To allow people to write their own types that can 
be serialized, you have to let them specify arbitrary callables that will do the 
reconstruction. If you whitelist the possible reconstruction callables, you have 
greatly restricted the types that can participate in the serialization system.

> Also objects hold references to each other, leading to a huge
> reference mesh.
>
> For example:
>
>     a.buddy = b
>     b.buddy = a
>     with open("a", "wb") as f: f.write(serialize(a))
>     with open("b", "wb") as f: f.write(serialize(b))
>
>     with open("a", "rb") as f: aa = deserialize(f.read())
>     with open("b", "rb") as f: bb = deserialize(f.read())
>     assert aa.buddy is bb

Yeah, no one expects that to work. For example, if I deserialize the same string 
twice, you can't expect to get identical returned objects (as in, 
"deserialize(pickle) is deserialize(pickle)"). However, pickle does correctly 
handle fairly arbitrary reference graphs within the context of a single 
serialization, which is the most that can be asked of a serialization system. 
That isn't really a concern here.

 >>> class A(object):
...     pass
...
 >>> a = A()
 >>> b = A()
 >>> a.buddy = b
 >>> b.buddy = a
 >>> data = [a, b]
 >>> data[0].buddy is data[1]
True
 >>> data[1].buddy is data[0]
True
 >>> import cPickle
 >>> unpickled = cPickle.loads(cPickle.dumps(data))
 >>> unpickled[0].buddy is unpickled[1]
True
 >>> unpickled[1].buddy is unpickled[0]
True

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

[toc] | [prev] | [next] | [standalone]


#92431

FromGregory Ewing <greg.ewing@canterbury.ac.nz>
Date2015-06-11 11:30 +1200
Message-ID<ctrvkhFfutdU1@mid.individual.net>
In reply to#92399
Robert Kern wrote:
> To allow people to write their own types that can be serialized, 
> you have to let them specify arbitrary callables that will do the 
> reconstruction. If you whitelist the possible reconstruction callables, 
> you have greatly restricted the types that can participate in the 
> serialization system.

If whitelisting a type is the *only* thing you need to
do to make it serialisable, I think that comes close
enough to the stated goal of being able to "serialise
all [potentially serialisable] language objects".

Having to be explicit about which types are deserialisable
is probably a good thing anyway. It gives you an opportunity
to specify the mapping between the external format and
class names, so that your serialised data doesn't contain
assumptions about implementation details of your program.

-- 
Greg

[toc] | [prev] | [next] | [standalone]


#92438

Fromrandom832@fastmail.us
Date2015-06-10 20:47 -0400
Message-ID<mailman.371.1433983690.13271.python-list@python.org>
In reply to#92431
On Wed, Jun 10, 2015, at 19:30, Gregory Ewing wrote:
> If whitelisting a type is the *only* thing you need to
> do to make it serialisable, I think that comes close
> enough to the stated goal of being able to "serialise
> all [potentially serialisable] language objects".

IMO the serialization framework should handle this by providing your own
way to look them up (almost but not entirely unlike providing your own
globals table to eval) rather than by having a whitelist.

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web