Groups > comp.lang.python > #92373 > unrolled thread

Re: enhancement request: make py3 read/write py2 pickle format

Started by	Chris Angelico <rosuav@gmail.com>
First post	2015-06-10 09:06 +1000
Last post	2015-06-10 11:03 +1000
Articles	20 on this page of 23 — 9 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: enhancement request: make py3 read/write py2 pickle format Chris Angelico <rosuav@gmail.com> - 2015-06-10 09:06 +1000
    Re: enhancement request: make py3 read/write py2 pickle format Irmen de Jong <irmen.NOSPAM@xs4all.nl> - 2015-06-10 02:17 +0200
      Re: enhancement request: make py3 read/write py2 pickle format Devin Jeanpierre <jeanpierreda@gmail.com> - 2015-06-09 17:47 -0700
        Re: enhancement request: make py3 read/write py2 pickle format Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-06-10 13:52 +1000
          Re: enhancement request: make py3 read/write py2 pickle format random832@fastmail.us - 2015-06-09 23:57 -0400
            Re: enhancement request: make py3 read/write py2 pickle format Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-06-10 15:43 +1000
          Re: enhancement request: make py3 read/write py2 pickle format Chris Angelico <rosuav@gmail.com> - 2015-06-10 14:00 +1000
          Re: enhancement request: make py3 read/write py2 pickle format Devin Jeanpierre <jeanpierreda@gmail.com> - 2015-06-09 21:48 -0700
            Re: enhancement request: make py3 read/write py2 pickle format Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-06-10 19:36 +1000
              Re: enhancement request: make py3 read/write py2 pickle format Irmen de Jong <irmen.NOSPAM@xs4all.nl> - 2015-06-10 19:34 +0200
              Re: enhancement request: make py3 read/write py2 pickle format Devin Jeanpierre <jeanpierreda@gmail.com> - 2015-06-10 15:10 -0700
                Re: enhancement request: make py3 read/write py2 pickle format Steven D'Aprano <steve@pearwood.info> - 2015-06-11 13:21 +1000
                  Re: enhancement request: make py3 read/write py2 pickle format Devin Jeanpierre <jeanpierreda@gmail.com> - 2015-06-10 22:39 -0700
                    Re: enhancement request: make py3 read/write py2 pickle format Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-06-11 19:20 +1000
              Re: enhancement request: make py3 read/write py2 pickle format Terry Reedy <tjreedy@udel.edu> - 2015-06-10 19:25 -0400
              Re: enhancement request: make py3 read/write py2 pickle format Devin Jeanpierre <jeanpierreda@gmail.com> - 2015-06-10 16:39 -0700
              Re: enhancement request: make py3 read/write py2 pickle format Devin Jeanpierre <jeanpierreda@gmail.com> - 2015-06-10 16:48 -0700
              Re: enhancement request: make py3 read/write py2 pickle format Terry Reedy <tjreedy@udel.edu> - 2015-06-10 19:46 -0400
              Re: enhancement request: make py3 read/write py2 pickle format Chris Angelico <rosuav@gmail.com> - 2015-06-11 09:58 +1000
              Re: enhancement request: make py3 read/write py2 pickle format Devin Jeanpierre <jeanpierreda@gmail.com> - 2015-06-10 17:02 -0700
                Re: enhancement request: make py3 read/write py2 pickle format Marko Rauhamaa <marko@pacujo.net> - 2015-06-11 07:08 +0300
              Re: enhancement request: make py3 read/write py2 pickle format Serhiy Storchaka <storchaka@gmail.com> - 2015-06-11 14:11 +0300
      Re: enhancement request: make py3 read/write py2 pickle format Chris Angelico <rosuav@gmail.com> - 2015-06-10 11:03 +1000

Page 1 of 2 [1] 2 Next page →

#92373 — Re: enhancement request: make py3 read/write py2 pickle format

From	Chris Angelico <rosuav@gmail.com>
Date	2015-06-10 09:06 +1000
Subject	Re: enhancement request: make py3 read/write py2 pickle format
Message-ID	<mailman.325.1433891190.13271.python-list@python.org>

On Wed, Jun 10, 2015 at 6:07 AM, Devin Jeanpierre
<jeanpierreda@gmail.com> wrote:
> There's a lot of subtle issues with pickle compatibility. e.g.
> old-style vs new-style classes. It's kinda hard and it's better to
> give up. I definitely agree it's better to use something else instead.
> For example, we switched to using protocol buffers, which have much
> better compatibility properties and are a bit more testable to boot
> (since text format protobufs are always output in a canonical (sorted)
> form.)

Or use JSON, if your data fits within that structure. It's easy to
read and write, it's human-readable, and it's safe (no chance of
arbitrary code execution). Forcing yourself to use a format that can
basically be processed by ast.literal_eval() is a good discipline -
means you don't accidentally save/load too much.

ChrisA

[toc] | [next] | [standalone]

#92380

From	Irmen de Jong <irmen.NOSPAM@xs4all.nl>
Date	2015-06-10 02:17 +0200
Message-ID	<55778208$0$2899$e4fe514c@news2.news.xs4all.nl>
In reply to	#92373

On 10-6-2015 1:06, Chris Angelico wrote:
> On Wed, Jun 10, 2015 at 6:07 AM, Devin Jeanpierre
> <jeanpierreda@gmail.com> wrote:
>> There's a lot of subtle issues with pickle compatibility. e.g.
>> old-style vs new-style classes. It's kinda hard and it's better to
>> give up. I definitely agree it's better to use something else instead.
>> For example, we switched to using protocol buffers, which have much
>> better compatibility properties and are a bit more testable to boot
>> (since text format protobufs are always output in a canonical (sorted)
>> form.)
> 
> Or use JSON, if your data fits within that structure. It's easy to
> read and write, it's human-readable, and it's safe (no chance of
> arbitrary code execution). Forcing yourself to use a format that can
> basically be processed by ast.literal_eval() is a good discipline -
> means you don't accidentally save/load too much.
> 
> ChrisA
> 

I made a specialized serializer for this, which is more expressive than JSON. It outputs
python literal expressions that can be directly parsed by ast.literal_eval(). You can
find it on pypi (https://pypi.python.org/pypi/serpent).  It's the default serializer of
Pyro, and it includes a Java and .NET version as well as an added bonus.


Irmen

[toc] | [prev] | [next] | [standalone]

#92381

From	Devin Jeanpierre <jeanpierreda@gmail.com>
Date	2015-06-09 17:47 -0700
Message-ID	<mailman.328.1433897321.13271.python-list@python.org>
In reply to	#92380

Passing around data that can be put into ast.literal_eval is
synonymous with passing around data taht can be put into eval. It
sounds like a trap.

Other points against JSON / etc.: the lack of schema makes it easier
to stuff anything in there (not as easily as pickle, mind), and by
returning a plain dict, it becomes easier to require a field than to
allow a field to be missing, which is bad for robustness and bad for
data format migrations. (Protobuf (v3) has schemas and gives every
field a default value.)

For human readable serialized data, text format protocol buffers are
seriously underrated. (Relatedly: underdocumented, too.)

/me lifts head out of kool-aid and gasps for air

-- Devin

On Tue, Jun 9, 2015 at 5:17 PM, Irmen de Jong <irmen.NOSPAM@xs4all.nl> wrote:
> On 10-6-2015 1:06, Chris Angelico wrote:
>> On Wed, Jun 10, 2015 at 6:07 AM, Devin Jeanpierre
>> <jeanpierreda@gmail.com> wrote:
>>> There's a lot of subtle issues with pickle compatibility. e.g.
>>> old-style vs new-style classes. It's kinda hard and it's better to
>>> give up. I definitely agree it's better to use something else instead.
>>> For example, we switched to using protocol buffers, which have much
>>> better compatibility properties and are a bit more testable to boot
>>> (since text format protobufs are always output in a canonical (sorted)
>>> form.)
>>
>> Or use JSON, if your data fits within that structure. It's easy to
>> read and write, it's human-readable, and it's safe (no chance of
>> arbitrary code execution). Forcing yourself to use a format that can
>> basically be processed by ast.literal_eval() is a good discipline -
>> means you don't accidentally save/load too much.
>>
>> ChrisA
>>
>
> I made a specialized serializer for this, which is more expressive than JSON. It outputs
> python literal expressions that can be directly parsed by ast.literal_eval(). You can
> find it on pypi (https://pypi.python.org/pypi/serpent).  It's the default serializer of
> Pyro, and it includes a Java and .NET version as well as an added bonus.
>
>
> Irmen
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]

#92384

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-06-10 13:52 +1000
Message-ID	<5577b46d$0$12975$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92381

On Wednesday 10 June 2015 10:47, Devin Jeanpierre wrote:

> Passing around data that can be put into ast.literal_eval is
> synonymous with passing around data taht can be put into eval. It
> sounds like a trap.

In what way?

literal_eval will cleanly and safely refuse to evaluate strings like:

    "len(None)"
    "100**100**100"
    "__import__('os').system('rm this')"

and so on, which makes it significantly safer when given untrusted data. I 
suppose that one might be able to perform a DOS attack by passing it:

    "1000 ... 0"

where the ... represents, say, a gigabyte of zeroes, but if an attacker has 
the ability to feed you gigabytes of data, they don't need literal_eval to 
DOS you.

If you can think of an actual attack against literal_eval, please tell us or 
report it, so it can be fixed.

> For human readable serialized data, text format protocol buffers are
> seriously underrated. (Relatedly: underdocumented, too.)

Ironically, literal_eval is designed to process text-format protocols using 
human-readable Python syntax for common data types like int, str, and dict.

-- 
Steve

[toc] | [prev] | [next] | [standalone]

#92385

From	random832@fastmail.us
Date	2015-06-09 23:57 -0400
Message-ID	<mailman.330.1433908633.13271.python-list@python.org>
In reply to	#92384

On Tue, Jun 9, 2015, at 23:52, Steven D'Aprano wrote:
> > For human readable serialized data, text format protocol buffers are
> > seriously underrated. (Relatedly: underdocumented, too.)
> 
> Ironically, literal_eval is designed to process text-format protocols
> using 
> human-readable Python syntax for common data types like int, str, and
> dict.

"protocol buffers" is the name of a specific tool.

[toc] | [prev] | [next] | [standalone]

#92388

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-06-10 15:43 +1000
Message-ID	<5577ce71$0$12998$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92385

On Wednesday 10 June 2015 13:57, random832@fastmail.us wrote:

> On Tue, Jun 9, 2015, at 23:52, Steven D'Aprano wrote:
>> > For human readable serialized data, text format protocol buffers are
>> > seriously underrated. (Relatedly: underdocumented, too.)
>> 
>> Ironically, literal_eval is designed to process text-format protocols
>> using
>> human-readable Python syntax for common data types like int, str, and
>> dict.
> 
> "protocol buffers" is the name of a specific tool.

It is? It sounds like a generic term for, you know, a buffer used by a 
protocol. I live and learn.

https://developers.google.com/protocol-buffers/docs/pythontutorial

You have to:

- write a data template, in a separate file; just don't call it a schema, 
because this isn't XML;

- don't forget the technically-optional-but-recommended (and required if you 
use other languages) "package" header, which is completely redundant in 
Python;

- run a separate compiler over that template, which will generate Python 
classes for you; just don't think that these classes are first class 
citizens that you can extend using inheritance, because they're not;

- import the generated module containing those classes;

- and now you have you're very own private pickle-like format, yay!


I'm sure that this has its uses for big, complex projects, but for 
lightweight needs, it seems over-engineered and unPythonic.

-- 
Steve

[toc] | [prev] | [next] | [standalone]

#92386

From	Chris Angelico <rosuav@gmail.com>
Date	2015-06-10 14:00 +1000
Message-ID	<mailman.331.1433908864.13271.python-list@python.org>
In reply to	#92384

On Wed, Jun 10, 2015 at 1:57 PM,  <random832@fastmail.us> wrote:
> On Tue, Jun 9, 2015, at 23:52, Steven D'Aprano wrote:
>> > For human readable serialized data, text format protocol buffers are
>> > seriously underrated. (Relatedly: underdocumented, too.)
>>
>> Ironically, literal_eval is designed to process text-format protocols
>> using
>> human-readable Python syntax for common data types like int, str, and
>> dict.
>
> "protocol buffers" is the name of a specific tool.

Yes, it is. But the point is that literal_eval, JSON, and other such
tools are _also_ text-format protocols that serialize to/from human
readable data. I'm not sure what the advantage of protocol buffers is,
but it's not like "human readable" is such a rarity. (It is still a
strike against pickle.)

ChrisA

[toc] | [prev] | [next] | [standalone]

#92387

From	Devin Jeanpierre <jeanpierreda@gmail.com>
Date	2015-06-09 21:48 -0700
Message-ID	<mailman.332.1433911768.13271.python-list@python.org>
In reply to	#92384

On Tue, Jun 9, 2015 at 8:52 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Wednesday 10 June 2015 10:47, Devin Jeanpierre wrote:
>
>> Passing around data that can be put into ast.literal_eval is
>> synonymous with passing around data taht can be put into eval. It
>> sounds like a trap.
>
> In what way?

I misspoke, and instead of "synonymous", meant "also means".
(Implication, not equivalence.)

>> For human readable serialized data, text format protocol buffers are
>> seriously underrated. (Relatedly: underdocumented, too.)
>
> Ironically, literal_eval is designed to process text-format protocols using
> human-readable Python syntax for common data types like int, str, and dict.

"Protocol buffers" are a specific technology, not an abstract concept,
and literal_eval is not a great idea.

* the common serializer (repr) does not output a canonical form, and
  can serialize things in a way that they can't be deserialized
* there is no schema
* there is no well understood migration story for when the data you
  load and store changes
* it is not usable from other programming languages
* it encourages the use of eval when literal_eval becomes inconvenient
  or insufficient
* It is not particularly well specified or documented compared to the
  alternatives.
* The types you get back differ in python 2 vs 3

For most apps, the alternatives are better. Irmen's serpent library is
strictly better on every front, for example. (Except potentially
security, who knows.)

At least it's better than pickle, security wise. Reliability wise,
repr is a black hole, so no dice. :(

-- Devin

[toc] | [prev] | [next] | [standalone]

#92392

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-06-10 19:36 +1000
Message-ID	<5578053a$0$11102$c3e8da3@news.astraweb.com>
In reply to	#92387

On Wednesday 10 June 2015 14:48, Devin Jeanpierre wrote:

[...]
> and literal_eval is not a great idea.
> 
> * the common serializer (repr) does not output a canonical form, and
>   can serialize things in a way that they can't be deserialized

For literals, the canonical form is that understood by Python. I'm pretty 
sure that these have been stable since the days of Python 1.0, and will 
remain so pretty much forever:

ints: 12345
floats: 1.2345
strings: "spam"
None
True
False
lists, tuples, dicts and sets containing the above

There may be a few differences between Python 2 and 3, e.g. no set literal 
in Python 2, but in general the Python syntax is well-known and understood 
by anyone programming in Python.

> * there is no schema
> * there is no well understood migration story for when the data you
>   load and store changes

literal_eval is not a serialisation format itself. It is a primitive 
operation usable when serialising. E.g. you might write out a simple Unix-
style rc file of key:value pairs:

length=23.45
width=10.95
landscape=False

split on "=" and call literal_eval on the value.

This is a perfectly reasonable light-weight solution for simple 
serialisation needs.

> * it is not usable from other programming languages

That's okay, we're not writing in other programming languages :-)

> * it encourages the use of eval when literal_eval becomes inconvenient
>   or insufficient

I don't think so. I think that people who make the effort to import ast and 
call ast.literal_eval are fully aware of the dangers of eval and aren't 
silly enough to start using eval.

> * It is not particularly well specified or documented compared to the
>   alternatives.
> * The types you get back differ in python 2 vs 3

Doesn't matter. The type you *write* are different in Python 2 vs 3, so of 
course you do.

> For most apps, the alternatives are better. Irmen's serpent library is
> strictly better on every front, for example. (Except potentially
> security, who knows.)

Beyond simple needs, like rc files, literal_eval is not sufficient. You 
can't use it to deserialise arbitrary objects. That might be a feature, but 
if you need something more powerful than basic ints, floats, strings and a 
few others, literal_eval will not be powerful enough.

I think we are in violent agreement :-)

-- 
Steve

[toc] | [prev] | [next] | [standalone]

#92412

From	Irmen de Jong <irmen.NOSPAM@xs4all.nl>
Date	2015-06-10 19:34 +0200
Message-ID	<5578750e$0$2950$e4fe514c@news2.news.xs4all.nl>
In reply to	#92392

On 10-6-2015 11:36, Steven D'Aprano wrote:
>> For most apps, the alternatives are better. Irmen's serpent library is
>> strictly better on every front, for example. (Except potentially
>> security, who knows.)
> 
> Beyond simple needs, like rc files, literal_eval is not sufficient. You 
> can't use it to deserialise arbitrary objects. That might be a feature, but 
> if you need something more powerful than basic ints, floats, strings and a 
> few others, literal_eval will not be powerful enough.

Just to have this off my chest:

I guess that "serialization format" is not the most correct term for what serpent does
(or in general, for the literal expressions that literal_eval accepts). Serpent doesn't
strive to (de)serialize everything perfectly. It is meant as a pythonic data transfer
format.

You can do this by explicitly mapping your application's object model to and from the
wire data format, or do it in a more pythonic way (IMO) and let python take care of most
of it automatically. Serpent is smart (I hope) about a number of non-primitive types. If
needed, use its hooks to teach it about types it doesn't readily recognize.
Yes, it does force you to reduce the arbitrary types you want to process to the set of
types that are accepted in a python literal expression. Thankfully lists, sets, tuples
and dicts are also among them.

Raison d'être for serpent is that I was looking for a safe pythonic alternative for
pickle, and with fewer limitations than Json.   I chose to use ast.literal_eval from the
standard library to do the "deserialization" for me, and so only had to build some code
to "serialize" object trees into python literal expressions :)

Regarding security: I simply trust the docstring of ast.literal_eval here;
"Safely evaluate an expression node or a string containing a Python expression. [...]"

Irmen

[toc] | [prev] | [next] | [standalone]

#92427

From	Devin Jeanpierre <jeanpierreda@gmail.com>
Date	2015-06-10 15:10 -0700
Message-ID	<mailman.361.1433974273.13271.python-list@python.org>
In reply to	#92392

FWIW most of the objections below also apply to JSON, so this doesn't
just have to be about repr/literal_eval. I'm definitely a huge
proponent of widespread use of something like protocol buffers, both
for production code and personal hacky projects.

On Wed, Jun 10, 2015 at 2:36 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Wednesday 10 June 2015 14:48, Devin Jeanpierre wrote:
>
> [...]
>> and literal_eval is not a great idea.
>>
>> * the common serializer (repr) does not output a canonical form, and
>>   can serialize things in a way that they can't be deserialized
>
> For literals, the canonical form is that understood by Python. I'm pretty
> sure that these have been stable since the days of Python 1.0, and will
> remain so pretty much forever:

The problem is that there are two different ways repr might write out
a dict equal to {'a': 1, 'b': 2}. This can make tests brittle -- e.g.
it's why doctest fails badly at examples involving dictionaries. Text
format protocol buffers output everything sorted, so that you can do
textual diffs for compatibility tests and such.

At work, one thing we do in places is mock out services using "golden"
expected protobuf responses, so that you can test that the server
returns exactly that, and test what the client does with that,
separately. These are checked into perforce in text format.

>> * there is no schema
>> * there is no well understood migration story for when the data you
>>   load and store changes
>
> literal_eval is not a serialisation format itself. It is a primitive
> operation usable when serialising. E.g. you might write out a simple Unix-
> style rc file of key:value pairs:
>
-snip-
>
> split on "=" and call literal_eval on the value.
>
> This is a perfectly reasonable light-weight solution for simple
> serialisation needs.

I could spend a bunch of time writing yet another config file format,
or I could use text format protocol buffers, YAML, or TOML and call it
a day.

>> * it encourages the use of eval when literal_eval becomes inconvenient
>>   or insufficient
>
> I don't think so. I think that people who make the effort to import ast and
> call ast.literal_eval are fully aware of the dangers of eval and aren't
> silly enough to start using eval.

The problem is when you have your config file format using python
literals, and another programmer wants to deal with it and doesn't
look at your codebase, and things like that. When transferring data,
this can happen a lot, since you are often not the user of the data
you wrote, and you can't control how others consume it. They might use
eval even if you didn't mean for them to. For example, in JavaScript,
this was once a common problem for services exposing JSON, and it
still happens even now.

>> * It is not particularly well specified or documented compared to the
>>   alternatives.
>> * The types you get back differ in python 2 vs 3
>
> Doesn't matter. The type you *write* are different in Python 2 vs 3, so of
> course you do.

In a shared 2/3 codebase, if I write bytes I expect to get bytes, and
if I write unicode I expect to get unicode. (There is a third category
of thing, which should be bytes on 2.x and string on 3.x, but it's
probably best to handle that outside of the deserializer). If you
thread it through repr and literal_eval using different versions for
each, unicode in python 3 becomes bytes in python 2, and vice versa.
So it makes migrating to Python 3 even harder.

>> For most apps, the alternatives are better. Irmen's serpent library is
>> strictly better on every front, for example. (Except potentially
>> security, who knows.)
>
> Beyond simple needs, like rc files, literal_eval is not sufficient. You
> can't use it to deserialise arbitrary objects. That might be a feature, but
> if you need something more powerful than basic ints, floats, strings and a
> few others, literal_eval will not be powerful enough.

No, it is powerful enough. After all, JSON has the same limitations.
Protobuf only adds enums and structs to JSON's types, and it's
potentially the most-used serialization format in the world by
operations per second.

Serialization libraries/formats usually need handholding to serialize
complex Python objects into simple serializable types. [Except pickle,
and that's the very reason it's insecure (per previous discussion in
thread.)]

-- Devin

[toc] | [prev] | [next] | [standalone]

#92446

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-06-11 13:21 +1000
Message-ID	<5578feb6$0$12984$c3e8da3$5496439d@news.astraweb.com>
In reply to	#92427

On Thu, 11 Jun 2015 08:10 am, Devin Jeanpierre wrote:

[...]
>> For literals, the canonical form is that understood by Python. I'm pretty
>> sure that these have been stable since the days of Python 1.0, and will
>> remain so pretty much forever:
> 
> The problem is that there are two different ways repr might write out
> a dict equal to {'a': 1, 'b': 2}. This can make tests brittle -- e.g.
> it's why doctest fails badly at examples involving dictionaries. 

Only if they are badly written.

Yes, dicts are *less convenient* for doctests, but if they fail, the blame
is on the author of the tests themselves, not doctest.

Unordered output is not a problem for dicts, because dicts also have
unordered *input*. It doesn't matter whether you input {'a':1,'b':2} or
{'b':2,'a':1}, you will get the same dict either way.

[...]
> I could spend a bunch of time writing yet another config file format,
> or I could use text format protocol buffers, YAML, or TOML and call it
> a day.

Writing a rc parser is so trivial that it's almost easier to just write it
than it is to look up the APIs for YAML or JSON, to say nothing of the
rigmarole of defining a protocol buffer config file, compiling it,
importing the module, and using that.

def read(configfile):
    config = collections.OrderedDict()
    with open(configfile) as f:
        for line in f:
            line = line.strip()
            if line.startswith('#"): continue
            key, value = line.split("=", 1)
            key = key.rstrip()
            value = value.lstrip()
            config[key] = ast.literal_eval(value)
    return config

That's a basic, *but acceptable*, rc parser written in literally under a
minute. At the risk of ending up with egg on my face, I reckon that it's so
simple and so obviously correct that I can tell it works correctly without
even testing it. (Famous last words, huh?)

Unlike any of the richer, more powerful serialisation formats like YAML,
JSON, or protocol buffer, its not only human readable but human writable
too. By which I mean, while it is *possible* for a sufficiently motivated
person to write correctly formatted JSON, YAML or even XML, it's not really
something you would choose to do willingly. But Unix sys admins hand-edit
rc files every day.

But of course this also means it's less powerful and can deal with few types
of data. Power comes at a cost of complexity, and simplicity itself can be
a virtue. I wouldn't use JSON etc. for config files until I was sure that a
simpler INI or RC file wasn't sufficient for my needs.

Some how I have drifted away from serialisation in general to specifically
config files... never mind.

[...]
> The problem is when you have your config file format using python
> literals, and another programmer wants to deal with it and doesn't
> look at your codebase, and things like that. When transferring data,
> this can happen a lot, since you are often not the user of the data
> you wrote, and you can't control how others consume it. 

Not only can I not control how they consume it, but I don't care how they
consume it :-)

I hear what you are saying, and I don't disagree with it. I'm just standing
up for simplicity as a virtue when appropriate. If I'm writing a script to
save a bunch of values to pass to another script after some human editing,
it's faster for me to just write out the key:value pairs than it is to
learn how to use protocol buffer, deal with a separate compilation step,
etc. It's actually easier to write out, and read in, the key:values than to
use the configfile module. If you don't need multiple sections, default
values, or variable interpolation, even configparser is overkill.

But if I'm swapping data with others, or if I have to use a richer set of
types or functionality, then naturally I'm going to need something more
powerful, preferably something standard so I don't have to document the
internal format, just say "use XML with this schema" or whatever.

> They might use 
> eval even if you didn't mean for them to. For example, in JavaScript,
> this was once a common problem for services exposing JSON, and it
> still happens even now.

<shrug> If they choose to use eval, *that's not my fault*. You can't stop
them from deserialising your data and then passing any and all strings to
eval, so why should I be expected to stop them from something similar?

[...]
>> Beyond simple needs, like rc files, literal_eval is not sufficient. You
>> can't use it to deserialise arbitrary objects. That might be a feature,
>> but if you need something more powerful than basic ints, floats, strings
>> and a few others, literal_eval will not be powerful enough.
> 
> No, it is powerful enough. After all, JSON has the same limitations.

In the sense that you can build arbitrary objects from a combination of a
few basic types, yes, literal_eval is "powerful enough" if you are prepared
to re-invent JSON, YAML, or protocol buffer.

But I'm not talking about re-inventing what already exists. If I want JSON,
I'll use JSON, not spend weeks or months re-writing it from scratch. I
can't do this:

class MyClass:
    pass

a = MyClass()
serialised = repr(a)
b = ast.literal_eval(serialised)
assert a == b

which is what I mean when I say literal_eval isn't powerful enough to handle
arbitrary types. That's not a bug, that's a feature of literal_eval. It is
*designed* to have that limitation.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#92455

From	Devin Jeanpierre <jeanpierreda@gmail.com>
Date	2015-06-10 22:39 -0700
Message-ID	<mailman.382.1434001230.13271.python-list@python.org>
In reply to	#92446

Snipped aplenty.

On Wed, Jun 10, 2015 at 8:21 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> On Thu, 11 Jun 2015 08:10 am, Devin Jeanpierre wrote:
> [...]
>> I could spend a bunch of time writing yet another config file format,
>> or I could use text format protocol buffers, YAML, or TOML and call it
>> a day.
>
> Writing a rc parser is so trivial that it's almost easier to just write it
> than it is to look up the APIs for YAML or JSON, to say nothing of the
> rigmarole of defining a protocol buffer config file, compiling it,
> importing the module, and using that.
>
-snip
>
> That's a basic, *but acceptable*, rc parser written in literally under a
> minute. At the risk of ending up with egg on my face, I reckon that it's so
> simple and so obviously correct that I can tell it works correctly without
> even testing it. (Famous last words, huh?)

I won't try to egg you. That said, you have to write tests. Also,
everyone who uses it has to learn the format and API, and it may have
corner cases you aren't aware of, it has to get ported to python 3 if
you wrote it for python 2, the parsing errors are obscure and might
need improvement, and so on. There's a place for this, but I suspect
it is small compared to the place where it seemed like a good idea at
the time.

>>> Beyond simple needs, like rc files, literal_eval is not sufficient. You
>>> can't use it to deserialise arbitrary objects. That might be a feature,
>>> but if you need something more powerful than basic ints, floats, strings
>>> and a few others, literal_eval will not be powerful enough.
>>
>> No, it is powerful enough. After all, JSON has the same limitations.
>
> In the sense that you can build arbitrary objects from a combination of a
> few basic types, yes, literal_eval is "powerful enough" if you are prepared
> to re-invent JSON, YAML, or protocol buffer.
>
> But I'm not talking about re-inventing what already exists. If I want JSON,
> I'll use JSON, not spend weeks or months re-writing it from scratch. I
> can't do this:
>
> class MyClass:
>     pass
>
> a = MyClass()
> serialised = repr(a)
> b = ast.literal_eval(serialised)
> assert a == b

I don't understand. You can't do that in JSON, YAML, XML, or protocol
buffers, either. They only provide a small set of types, comparable to
(but smaller) than the set of types you get from literal_eval/repr.

-- Devin

[toc] | [prev] | [next] | [standalone]

#92459

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-06-11 19:20 +1000
Message-ID	<557952eb$0$21718$c3e8da3@news.astraweb.com>
In reply to	#92455

On Thursday 11 June 2015 15:39, Devin Jeanpierre wrote:

>> But I'm not talking about re-inventing what already exists. If I want
>> JSON, I'll use JSON, not spend weeks or months re-writing it from
>> scratch. I can't do this:
>>
>> class MyClass:
>>pass
>>
>> a = MyClass()
>> serialised = repr(a)
>> b = ast.literal_eval(serialised)
>> assert a == b
> 
> I don't understand. You can't do that in JSON, YAML, XML, or protocol
> buffers, either. They only provide a small set of types, comparable to
> (but smaller) than the set of types you get from literal_eval/repr.

Well, what do people do when they want to serialise something like MyClass, 
but have to use (say) JSON rather than pickle?

I'd write a method to export enough information (as JSON) to reconstruct the 
instance, and another method to take that JSON and build an instance. If I'm 
going to do all that, *I would use JSON* rather than try to create my own 
format invented from scratch using only literal_eval.

Although... I suppose if I really wanted to be quick and dirty about it...

py> import ast
py> class MyClass(object): pass
... 
py> a = MyClass()
py> s = repr(a.__dict__)
py> b = object.__new__(MyClass)
py> b.__dict__ = ast.literal_eval(s)
py> b
<__main__.MyClass object at 0xb725218c>

;-)

-- 
Steve

[toc] | [prev] | [next] | [standalone]

#92430

From	Terry Reedy <tjreedy@udel.edu>
Date	2015-06-10 19:25 -0400
Message-ID	<mailman.364.1433978778.13271.python-list@python.org>
In reply to	#92392

On 6/10/2015 6:10 PM, Devin Jeanpierre wrote:

> The problem is that there are two different ways repr might write out
> a dict equal to {'a': 1, 'b': 2}. This can make tests brittle

Not if one compares objects rather than string representations of 
objects.  I am strongly of the view that code and tests should be 
written to directly compare objects as much as possible.

> it's why doctest fails badly at examples involving dictionaries.

or sets or addresses or object ids or locale-dependent strings or random 
numbers or values dependent on random numbers.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#92432

From	Devin Jeanpierre <jeanpierreda@gmail.com>
Date	2015-06-10 16:39 -0700
Message-ID	<mailman.365.1433979632.13271.python-list@python.org>
In reply to	#92392

On Wed, Jun 10, 2015 at 4:25 PM, Terry Reedy <tjreedy@udel.edu> wrote:
> On 6/10/2015 6:10 PM, Devin Jeanpierre wrote:
>
>> The problem is that there are two different ways repr might write out
>> a dict equal to {'a': 1, 'b': 2}. This can make tests brittle
>
>
> Not if one compares objects rather than string representations of objects.
> I am strongly of the view that code and tests should be written to directly
> compare objects as much as possible.

For serialization formats that always output the same string for the
same data (like text format protos), there is no practical difference
between the two, except that if you're comparing text, you can easily
supply a diff to update one to match the other.

-- Devin

[toc] | [prev] | [next] | [standalone]

#92433

From	Devin Jeanpierre <jeanpierreda@gmail.com>
Date	2015-06-10 16:48 -0700
Message-ID	<mailman.366.1433980134.13271.python-list@python.org>
In reply to	#92392

On Wed, Jun 10, 2015 at 4:39 PM, Devin Jeanpierre
<jeanpierreda@gmail.com> wrote:
> On Wed, Jun 10, 2015 at 4:25 PM, Terry Reedy <tjreedy@udel.edu> wrote:
>> On 6/10/2015 6:10 PM, Devin Jeanpierre wrote:
>>
>>> The problem is that there are two different ways repr might write out
>>> a dict equal to {'a': 1, 'b': 2}. This can make tests brittle
>>
>>
>> Not if one compares objects rather than string representations of objects.
>> I am strongly of the view that code and tests should be written to directly
>> compare objects as much as possible.
>
> For serialization formats that always output the same string for the
> same data (like text format protos), there is no practical difference
> between the two, except that if you're comparing text, you can easily
> supply a diff to update one to match the other.

Ugh, there's also the fiddly difference between what goes in and what
you read. A serialized data structure might contain lots of data that
is ignored by the deserializer (in protobuf), or it might contain data
which can't be loaded by the deserializer or produces weird /
incorrect results. Being able to inspect and test the serialized data
separately from the deserialized data is useful in that regard, so
that you know where the failure lies, but it's sort of fuzzy.

Some examples of where this crops up: pickles after you've moved a
class, JSON encoders that try to be clever and output invalid JSON,
protocol buffers with unexpected fields.

Overall, though, the diff thing is probably the bigger reason everyone
wants to do this sort of thing with serialized data. If you do it
right and are principled about it, I don't see a problem with it.

-- Devin

[toc] | [prev] | [next] | [standalone]

#92434

From	Terry Reedy <tjreedy@udel.edu>
Date	2015-06-10 19:46 -0400
Message-ID	<mailman.367.1433980205.13271.python-list@python.org>
In reply to	#92392

On 6/10/2015 7:39 PM, Devin Jeanpierre wrote:
> On Wed, Jun 10, 2015 at 4:25 PM, Terry Reedy <tjreedy@udel.edu> wrote:
>> On 6/10/2015 6:10 PM, Devin Jeanpierre wrote:
>>
>>> The problem is that there are two different ways repr might write out
>>> a dict equal to {'a': 1, 'b': 2}. This can make tests brittle

You commented about *tests*

>> Not if one compares objects rather than string representations of objects.
>> I am strongly of the view that code and tests should be written to directly
>> compare objects as much as possible.

I responded about *tests*

> For serialization formats that always output the same string for the
> same data (like text format protos), there is no practical difference
> between the two, except that if you're comparing text, you can easily
> supply a diff to update one to match the other.

Serialization is a different issue.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#92435

From	Chris Angelico <rosuav@gmail.com>
Date	2015-06-11 09:58 +1000
Message-ID	<mailman.368.1433980692.13271.python-list@python.org>
In reply to	#92392

On Thu, Jun 11, 2015 at 8:10 AM, Devin Jeanpierre
<jeanpierreda@gmail.com> wrote:
> The problem is that there are two different ways repr might write out
> a dict equal to {'a': 1, 'b': 2}. This can make tests brittle -- e.g.
> it's why doctest fails badly at examples involving dictionaries. Text
> format protocol buffers output everything sorted, so that you can do
> textual diffs for compatibility tests and such.

With Python's JSON module [1], you can pass sort_keys=True to
stipulate that the keys be lexically ordered, which should make the
output "canonical". Pike's Standards.JSON.encode() [2] can take a flag
value to canonicalize the output, which currently has the same effect
(sort mappings by their indices). I did a quick check for Ruby and
didn't find anything in its standard library JSON module, but knowing
Ruby, it'll be available somewhere in a gem. A web search for 'perl
json' brought up a CPAN link [4] that has a canonicalize option for
sorting by keys. So that's three out of four definite, one uncertain,
where it's pretty easy to ensure that you get byte-for-byte identical
output from a JSON encoder.

Even though failing doctests are a separate problem, it's useful to
have canonical output. Your diffs get less noisy, for instance.
Coupled with a human-readability flag (eg "indent=4" in Python,
"Standards.JSON.HUMAN_READABLE" in Pike) that splits the result over
multiple lines, it can make a pretty easy to diff file. Definitely
worth doing... and definitely worth using a JSON encoder rather than
repr().

ChrisA

[1] https://docs.python.org/3/library/json.html#json.dump
[2] http://pike.lysator.liu.se/generated/manual/modref/ex/predef_3A_3A/Standards/JSON.html
[3] http://ruby-doc.org/stdlib-2.0.0/libdoc/json/rdoc/JSON.html
[4] http://search.cpan.org/~makamaka/JSON-2.90/lib/JSON.pm#PERL_-%3E_JSON

[toc] | [prev] | [next] | [standalone]

#92436

From	Devin Jeanpierre <jeanpierreda@gmail.com>
Date	2015-06-10 17:02 -0700
Message-ID	<mailman.369.1433980969.13271.python-list@python.org>
In reply to	#92392

On Wed, Jun 10, 2015 at 4:46 PM, Terry Reedy <tjreedy@udel.edu> wrote:
> On 6/10/2015 7:39 PM, Devin Jeanpierre wrote:
>>
>> On Wed, Jun 10, 2015 at 4:25 PM, Terry Reedy <tjreedy@udel.edu> wrote:
>>>
>>> On 6/10/2015 6:10 PM, Devin Jeanpierre wrote:
>>>
>>>> The problem is that there are two different ways repr might write out
>>>> a dict equal to {'a': 1, 'b': 2}. This can make tests brittle
>
>
> You commented about *tests*
>
>>> Not if one compares objects rather than string representations of
>>> objects.
>>> I am strongly of the view that code and tests should be written to
>>> directly
>>> compare objects as much as possible.
>
>
> I responded about *tests*
>
>> For serialization formats that always output the same string for the
>> same data (like text format protos), there is no practical difference
>> between the two, except that if you're comparing text, you can easily
>> supply a diff to update one to match the other.
>
>
> Serialization is a different issue.

Yes, tests of code that uses serialization (caching, RPCs, etc.).

I mentioned above a sort of test that divides tests of a client and
server along RPC boundaries by providing fake queries and responses,
and testing that those are the queries and responses given by the
client and server. This way you don't need to actually start the
client and server to test them both and their interactions. This is
one example, there are other uses, but they go along the same lines.
For example, one can also imagine testing that a serialized structure
is identical across version changes, so that it's guaranteed to be
forwards/backwards compatible. It is not enough to test that the
deserialized form is, because it might differ substantially, as long
as the communicated serialized structure is the same.

-- Devin

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Re: enhancement request: make py3 read/write py2 pickle format

Contents

#92373 — Re: enhancement request: make py3 read/write py2 pickle format

#92380

#92381

#92384

#92385

#92388

#92386

#92387

#92392

#92412

#92427

#92446

#92455

#92459

#92430

#92432

#92433

#92434

#92435

#92436