Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #21454 > unrolled thread

How to know that two pyc files contain the same code

Started byGelonida N <gelonida@gmail.com>
First post2012-03-10 15:48 +0100
Last post2012-03-11 06:30 +0100
Articles 7 — 4 participants

Back to article view | Back to comp.lang.python


Contents

  How to know that two pyc files contain the same code Gelonida N <gelonida@gmail.com> - 2012-03-10 15:48 +0100
    Re: How to know that two pyc files contain the same code Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-10 22:52 +0000
      Re: How to know that two pyc files contain the same code Chris Angelico <rosuav@gmail.com> - 2012-03-11 12:15 +1100
        Re: How to know that two pyc files contain the same code Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-11 07:06 +0000
          Re: How to know that two pyc files contain the same code Rick Johnson <rantingrickjohnson@gmail.com> - 2012-03-11 08:22 -0700
          Re: How to know that two pyc files contain the same code Gelonida N <gelonida@gmail.com> - 2012-03-12 00:56 +0100
      Re: How to know that two pyc files contain the same code Gelonida N <gelonida@gmail.com> - 2012-03-11 06:30 +0100

#21454 — How to know that two pyc files contain the same code

FromGelonida N <gelonida@gmail.com>
Date2012-03-10 15:48 +0100
SubjectHow to know that two pyc files contain the same code
Message-ID<mailman.544.1331390950.3037.python-list@python.org>
Hi,

I want to know whether two .pyc files are identical.

With identical I mean whether they contain the same byte code.

Unfortunately it seems, that .pyc files contain also something like the
time stamp of the related source file.

So though two pyc files contain the same byte code, they will not be
byte identical.

One option, that I found is to use
python -m unpyclib.application -d filename.pyc and check whether the
results are identical.


However even this will fail if the files were not compiled under the
same absolute path name as the source filename is contained twice  (at
least for my trivial example) in the disassemblers output.


Thanks a lot for any other idea.


[toc] | [next] | [standalone]


#21473

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-03-10 22:52 +0000
Message-ID<4f5bdb14$0$29891$c3e8da3$5496439d@news.astraweb.com>
In reply to#21454
On Sat, 10 Mar 2012 15:48:48 +0100, Gelonida N wrote:

> Hi,
> 
> I want to know whether two .pyc files are identical.
> 
> With identical I mean whether they contain the same byte code.

Define "identical" and "the same".

If I compile these two files:


# file ham.py
x = 23
def func():
    a = 23
    return a + 19



# file = spam.py
def func():
    return 42

tmp = 19
x = 4 + tmp
del tmp


do you expect spam.pyc and ham.pyc to count as "the same"?


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#21475

FromChris Angelico <rosuav@gmail.com>
Date2012-03-11 12:15 +1100
Message-ID<mailman.554.1331428520.3037.python-list@python.org>
In reply to#21473
On Sun, Mar 11, 2012 at 9:52 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Sat, 10 Mar 2012 15:48:48 +0100, Gelonida N wrote:
> Define "identical" and "the same".
>
> If I compile these two files:
>
>
> # file ham.py
> x = 23
> def func():
>    a = 23
>    return a + 19
>
>
>
> # file = spam.py
> def func():
>    return 42
>
> tmp = 19
> x = 4 + tmp
> del tmp
>
>
> do you expect spam.pyc and ham.pyc to count as "the same"?

They do not contain the same code. They may contain code which has the
same effect, but it is not the same code.

I don't think Python has the level of aggressive optimization that
would make these compile to the same bytecode, but if it did, then
they would _become identical_ per the OP's description - that they
contain identical bytecode. In fact, I think the OP defined it quite
clearly.

ChrisA

[toc] | [prev] | [next] | [standalone]


#21482

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-03-11 07:06 +0000
Message-ID<4f5c4f0d$0$29891$c3e8da3$5496439d@news.astraweb.com>
In reply to#21475
On Sun, 11 Mar 2012 12:15:11 +1100, Chris Angelico wrote:

> On Sun, Mar 11, 2012 at 9:52 AM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Sat, 10 Mar 2012 15:48:48 +0100, Gelonida N wrote: Define
>> "identical" and "the same".
>>
>> If I compile these two files:
>>
>>
>> # file ham.py
>> x = 23
>> def func():
>>    a = 23
>>    return a + 19
>>
>>
>>
>> # file = spam.py
>> def func():
>>    return 42
>>
>> tmp = 19
>> x = 4 + tmp
>> del tmp
>>
>>
>> do you expect spam.pyc and ham.pyc to count as "the same"?
> 
> They do not contain the same code. They may contain code which has the
> same effect, but it is not the same code.

To me, they do: they contain a function "func" which takes no arguments 
and returns 42, and a global "x" initialised to 23. Everything else is an 
implementation detail.

I'm not being facetious. One should be asking what is the *purpose* of 
this question -- is it to detect when two pyc files contain the same 
*interface*, or to determine if they were generated from identical source 
code files (and if the later, do comments and whitespace matter)?

What if one merely changed the order of definition? Instead of:

def foo(): pass
def bar(): pass

one had this?

def bar(): pass
def foo(): pass

It depends on why the OP cares if they are "identical". I can imagine use-
cases where the right solution is to forget ideas about identical code, 
and just checksum the files (ignoring any timestamps).


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#21495

FromRick Johnson <rantingrickjohnson@gmail.com>
Date2012-03-11 08:22 -0700
Message-ID<90223b79-6a21-448a-945f-c5569a46b9d3@v7g2000yqb.googlegroups.com>
In reply to#21482
On Mar 11, 2:06 am, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:

> I'm not being facetious. [...]

Just in case anybody does not know already: the "D'A" in "Steven
D'Aprano" stands for "Devils Advocate"; his real name is "Steven
Prano".

[toc] | [prev] | [next] | [standalone]


#21512

FromGelonida N <gelonida@gmail.com>
Date2012-03-12 00:56 +0100
Message-ID<mailman.578.1331510204.3037.python-list@python.org>
In reply to#21482
On 03/11/2012 08:06 AM, Steven D'Aprano wrote:
> What if one merely changed the order of definition? Instead of:
> 
> def foo(): pass
> def bar(): pass
> 
> one had this?
> 
> def bar(): pass
> def foo(): pass
> 
> It depends on why the OP cares if they are "identical". I can imagine use-
> cases where the right solution is to forget ideas about identical code, 
> and just checksum the files (ignoring any timestamps).

I guess this is what I will do for my use case
Perform a checksum ignoring the time stamp.

What I did not know though is where the time stamp was located.
it seems it's in bytes 4-7 for all C-python versions so far.

What is regrettable though is, that the absolute path name is part of
the .pyc file, as I do not care


Following snippet calculates the hash of a .pyc file by just ignoring
the time stamp:

import hashlib

def md5_for_pyc(fname):
    hasher = hashlib.md5()
    with open(fname, 'rb') as fin:
        version = fin.read(4)
        hasher.update(version)
        _tstamp = fin.read(4)
        bytecode = fin.read()
        hasher.update(bytecode)
    return hasher.hexdigest()




[toc] | [prev] | [next] | [standalone]


#21478

FromGelonida N <gelonida@gmail.com>
Date2012-03-11 06:30 +0100
Message-ID<mailman.557.1331443814.3037.python-list@python.org>
In reply to#21473
Hi Steven,

On 03/10/2012 11:52 PM, Steven D'Aprano wrote:
> > On Sat, 10 Mar 2012 15:48:48 +0100, Gelonida N wrote:
> >
>> >> Hi,
>> >>
>> >> I want to know whether two .pyc files are identical.
>> >>
>> >> With identical I mean whether they contain the same byte code.
> >
> > Define "identical" and "the same".
Indeed! Identical is not that simple to define and depends on the context.

One definition of identical, that would suit me at the moment would be:

If I have two .pyc files, which were the result of a compilation of
two identical .py files, then I would like to treat these two .pyc files
as identical,
even if they were compiled at different times (absolutely necessary)
and with a different absolute path (would be nice)

Above definition of identical byte code would also mean, that any error
message about errors in a given line number would be identical for both
.pyc files

> >
> > If I compile these two files:
> >
> >
> > # file ham.py
> > x = 23
> > def func():
> >     a = 23
> >     return a + 19
> >
> >
> >
> > # file = spam.py
> > def func():
> >     return 42
> >
> > tmp = 19
> > x = 4 + tmp
> > del tmp
> >
> >
> > do you expect spam.pyc and ham.pyc to count as "the same"?
> >
For most pythons I would not expect, that ham.py and spam.py would
result in the same byte code and would thus not even have the same
performance,

I agree, though that an efficient compiler might generate the same byte
code, though I wonder if an optimizing compiler would/should be allowed
to optimize away the global variable tmp, as it would be visible (though
only for a short time) in a multithreading environment.

If the byte code were different in two .pyc files. then I would
like to have them treated as different .pyc files.

If by coincidence, the generated btye code were the same, then I
wouldn't mind, if they were treated as identical,  but I wouldn't insist.

Up to my knowledge Python (or at least C-python) stores line numbers in
the .pyc files, so that it can report exact line numbers refering to the
originating source code in case of an exception or for back traces

So there is the choice to say, that two pyc files with exactly the same
byte code would be treated identical if white spaces / line numbers of
their sources were different or the choice to say, that they are
different.

Being conservative I'd treat them as different.

Ideally I'd like to be able depending on my use case to distinguish
following cases.
a) .pyc files with identical byte code
b) .pyc files with identical byte code AND source code line numbers
c) same as b) AND identical source file names.





[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web