Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #98647 > unrolled thread
| Started by | Anas Belemlih <anas.belemlih@gmail.com> |
|---|---|
| First post | 2015-11-11 08:34 -0800 |
| Last post | 2015-11-12 21:24 +0000 |
| Articles | 15 — 10 participants |
Back to article view | Back to comp.lang.python
new to python, help please !! Anas Belemlih <anas.belemlih@gmail.com> - 2015-11-11 08:34 -0800
Re: new to python, help please !! John Gordon <gordon@panix.com> - 2015-11-11 16:58 +0000
Re: new to python, help please !! Tim Chase <python.list@tim.thechases.com> - 2015-11-11 11:06 -0600
Re: new to python, help please !! Ben Finney <ben+python@benfinney.id.au> - 2015-11-12 04:16 +1100
Re: new to python, help please !! Quivis <quivis@domain.invalid> - 2015-11-11 17:48 +0000
Re: new to python, help please !! Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-11-12 13:58 +1100
Re: new to python, help please !! Marko Rauhamaa <marko@pacujo.net> - 2015-11-12 08:21 +0200
Re: new to python, help please !! Tim Chase <python.list@tim.thechases.com> - 2015-11-12 05:48 -0600
Re: new to python, help please !! <paul.hermeneutic@gmail.com> - 2015-11-12 07:27 -0700
Re: new to python, help please !! Quivis <quivis@domain.invalid> - 2015-11-12 17:55 +0000
Re: new to python, help please !! Denis McMahon <denismfmcmahon@gmail.com> - 2015-11-12 19:49 +0000
Re: new to python, help please !! Peter Otten <__peter__@web.de> - 2015-11-12 15:56 +0100
Re: new to python, help please !! Tim Chase <python.list@tim.thechases.com> - 2015-11-12 09:00 -0600
Re: new to python, help please !! Peter Otten <__peter__@web.de> - 2015-11-12 16:41 +0100
Re: new to python, help please !! Denis McMahon <denismfmcmahon@gmail.com> - 2015-11-12 21:24 +0000
| From | Anas Belemlih <anas.belemlih@gmail.com> |
|---|---|
| Date | 2015-11-11 08:34 -0800 |
| Subject | new to python, help please !! |
| Message-ID | <93aef8e5-3d6f-41f4-a625-cd3c2007686e@googlegroups.com> |
i am a beginning programmer, i am trying to write a simple code to compare two character sets in 2 seperate files. ( 2 hash value files basically)
idea is:
open both files, measure the length of the loop on.
if the length doesn't match, == files do not match
if length matchs, loop while comparing each character from each file if they match.
please tell me what i am doing wrong ? i am using python 2.7
**********************************
hash1= open ("file1.md5", "r")
line1 =hash1.read()
hash2 = open("file2.md5","r")
line2= hash2.read()
number1 = len(line1)
number2 = len(line2)
#**************************
i=0
s1=line1[i]
s2=line2[i]
count = 0
if number1 != number2:
print " hash table not the same size"
else:
while count < number1:
if s1 == s2:
print " character", line1[i]," matchs"
i=i+1
count=count+1
else
print "Hash values corrupt"
[toc] | [next] | [standalone]
| From | John Gordon <gordon@panix.com> |
|---|---|
| Date | 2015-11-11 16:58 +0000 |
| Message-ID | <n1vs3e$6ih$1@reader1.panix.com> |
| In reply to | #98647 |
In <93aef8e5-3d6f-41f4-a625-cd3c2007686e@googlegroups.com> Anas Belemlih <anas.belemlih@gmail.com> writes:
> i=0
> s1=line1[i]
> s2=line2[i]
> count = 0
> if number1 != number2:
> print " hash table not the same size"
> else:
> while count < number1:
> if s1 == s2:
> print " character", line1[i]," matchs"
> i=i+1
> count=count+1
> else
> print "Hash values corrupt"
It looks like you're expecting s1 and s2 to automatically update their
values when i gets incremented, but it doesn't work like that. When you
increment i, you also have to reassign s1 and s2.
--
John Gordon A is for Amy, who fell down the stairs
gordon@panix.com B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2015-11-11 11:06 -0600 |
| Message-ID | <mailman.246.1447261727.16136.python-list@python.org> |
| In reply to | #98647 |
On 2015-11-11 08:34, Anas Belemlih wrote:
> i am a beginning programmer, i am trying to write a simple code
> to compare two character sets in 2 seperate files. ( 2 hash value
> files basically) idea is: open both files, measure the length of
> the loop on.
>
> if the length doesn't match, == files do not match
>
> if length matchs, loop while comparing each character from each
> file if they match. please tell me what i am doing wrong ? i am
> using python 2.7
>
> **********************************
> hash1= open ("file1.md5", "r")
> line1 =hash1.read()
> hash2 = open("file2.md5","r")
> line2= hash2.read()
>
> number1 = len(line1)
> number2 = len(line2)
>
> #**************************
> i=0
> s1=line1[i]
> s2=line2[i]
> count = 0
>
> if number1 != number2:
> print " hash table not the same size"
> else:
> while count < number1:
> if s1 == s2:
> print " character", line1[i]," matchs"
> i=i+1
> count=count+1
> else
> print "Hash values corrupt"
Well, the immediate answer is that you don't update s1 or s2 inside
your loop. Also, the indent on "count=count+1" is wrong. Finally,
if the hashes don't match, you don't break out of your while loop.
That said, the pythonesque way of writing this would likely look
something much more like
with open("file1.md5") as a, open("file2.md5") as b:
for s1, s2 in zip(a, b):
if s1 != s2:
print("Files differ")
You can compare the strings to get the actual offset if you want, or
check the lengths if you really want a more verbatim translation of
your code:
with open("file1.md5") as a, open("file2.md5") as b:
for s1, s2 in zip(a, b):
if len(s1) != len(s2):
print("not the same size")
else:
for i, (c1, c2) in enumerate(zip(s1, s2)):
if c1 == c2:
print(" character %s matches" % c1)
else:
print(" %r and %r differ at position %i" % (s1, s2, i))
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2015-11-12 04:16 +1100 |
| Message-ID | <mailman.247.1447262197.16136.python-list@python.org> |
| In reply to | #98647 |
Anas Belemlih <anas.belemlih@gmail.com> writes: > i am a beginning programmer, i am trying to write a simple code to > compare two character sets in 2 seperate files. ( 2 hash value files > basically) Welcome, and congratulations on arriving at Python for your programming! As a beginning programmer, you will benefit from joining the ‘tutor’ forum <URL:https://mail.python.org/mailman/listinfo/tutor>, which is much better suited to collaborative teaching of newcomers. -- \ “As scarce as truth is, the supply has always been in excess of | `\ the demand.” —Josh Billings | _o__) | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Quivis <quivis@domain.invalid> |
|---|---|
| Date | 2015-11-11 17:48 +0000 |
| Message-ID | <obL0y.222880$6i2.63495@fx35.am4> |
| In reply to | #98647 |
On Wed, 11 Nov 2015 08:34:30 -0800, Anas Belemlih wrote:
> md5
If those are md5 values stored inside files, wouldn't it be easier to
just hash them?
import hashlib
m1 = hashlib.sha224(open('f1').read()).hexdigest()
m2 = hashlib.sha224(open('f2').read()).hexdigest()
if m1 == m2:
print 'Equal!'
else:
print 'Different!'
--
_____ __ __ __ __ __ __ __
(( )) || || || \\ // || ((
\\_/X| \\_// || \V/ || \_))
Omnia paratus *~*~*~*~*~*~*
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2015-11-12 13:58 +1100 |
| Message-ID | <5644005e$0$2932$c3e8da3$76491128@news.astraweb.com> |
| In reply to | #98655 |
On Thursday 12 November 2015 04:48, Quivis wrote:
> On Wed, 11 Nov 2015 08:34:30 -0800, Anas Belemlih wrote:
>
>> md5
>
> If those are md5 values stored inside files, wouldn't it be easier to
> just hash them?
>
> import hashlib
>
> m1 = hashlib.sha224(open('f1').read()).hexdigest()
> m2 = hashlib.sha224(open('f2').read()).hexdigest()
I presume that the purpose of the exercise is to learn basic Python skills
like looping.
Also, using sha224 when all you want is a simple "different"/"equal" is
horribly inefficient. Sha224 needs to read the entire file, every single
byte, *and* perform a bunch of expensive cryptographic operations. Consider
reading two five GB files, the first starting with byte \x30 and the second
starting with byte \x60. The two bytes are different, so we know the files
differ, but sha224 still needs to do a massive amount of work.
--
Steve
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-11-12 08:21 +0200 |
| Message-ID | <8737wbu49x.fsf@elektro.pacujo.net> |
| In reply to | #98666 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info>:
> On Thursday 12 November 2015 04:48, Quivis wrote:
>
>> On Wed, 11 Nov 2015 08:34:30 -0800, Anas Belemlih wrote:
>>
>>> md5
>>
>> If those are md5 values stored inside files, wouldn't it be easier to
>> just hash them?
>>
>> import hashlib
>>
>> m1 = hashlib.sha224(open('f1').read()).hexdigest()
>> m2 = hashlib.sha224(open('f2').read()).hexdigest()
>
> I presume that the purpose of the exercise is to learn basic Python
> skills like looping.
And if you really wanted to compare two files that are known to contain
MD5 checksums, the simplest way is:
with open('f1.md5') as f1, open('f2.md5') as f2:
if f1.read() == f2.read():
...
else:
...
Marko
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2015-11-12 05:48 -0600 |
| Message-ID | <mailman.269.1447337476.16136.python-list@python.org> |
| In reply to | #98674 |
On 2015-11-12 08:21, Marko Rauhamaa wrote:
> And if you really wanted to compare two files that are known to
> contain MD5 checksums, the simplest way is:
>
> with open('f1.md5') as f1, open('f2.md5') as f2:
> if f1.read() == f2.read():
> ...
> else:
> ...
Though that suffers if the files are large. Might try
CHUNK_SIZE = 4 * 1024 # read 4k chunks
# chunk_offset = 0
with open('f1.md5') as f1, open('f2.md5') as f2:
while True:
c1 = f1.read(CHUNK_SIZE)
c2 = f2.read(CHUNK_SIZE)
if c1 or c2:
# chunk_offset += 1
if c1 != c2:
not_the_same(c1, c2)
# not_the_same(chunk_offset * CHUNK_SIZE, c1, c2)
break
else: # EOF
the_same()
break
which should perform better if the files are huge
-tkc
[toc] | [prev] | [next] | [standalone]
| From | <paul.hermeneutic@gmail.com> |
|---|---|
| Date | 2015-11-12 07:27 -0700 |
| Message-ID | <mailman.270.1447338456.16136.python-list@python.org> |
| In reply to | #98674 |
Would some form of subprocess.Popen() on cmp or fc /b be easier?
On Nov 12, 2015 7:13 AM, "Tim Chase" <python.list@tim.thechases.com> wrote:
> On 2015-11-12 08:21, Marko Rauhamaa wrote:
> > And if you really wanted to compare two files that are known to
> > contain MD5 checksums, the simplest way is:
> >
> > with open('f1.md5') as f1, open('f2.md5') as f2:
> > if f1.read() == f2.read():
> > ...
> > else:
> > ...
>
> Though that suffers if the files are large. Might try
>
> CHUNK_SIZE = 4 * 1024 # read 4k chunks
> # chunk_offset = 0
> with open('f1.md5') as f1, open('f2.md5') as f2:
> while True:
> c1 = f1.read(CHUNK_SIZE)
> c2 = f2.read(CHUNK_SIZE)
> if c1 or c2:
> # chunk_offset += 1
> if c1 != c2:
> not_the_same(c1, c2)
> # not_the_same(chunk_offset * CHUNK_SIZE, c1, c2)
> break
> else: # EOF
> the_same()
> break
>
> which should perform better if the files are huge
>
> -tkc
>
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>
[toc] | [prev] | [next] | [standalone]
| From | Quivis <quivis@domain.invalid> |
|---|---|
| Date | 2015-11-12 17:55 +0000 |
| Message-ID | <po41y.186836$wR.71600@fx43.am4> |
| In reply to | #98666 |
On Thu, 12 Nov 2015 13:58:35 +1100, Steven D'Aprano wrote: > horribly inefficient Assuming it was md5 values, who cares? Those are small. -- _____ __ __ __ __ __ __ __ (( )) || || || \\ // || (( \\_/X| \\_// || \V/ || \_)) Omnia paratus *~*~*~*~*~*~*
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2015-11-12 19:49 +0000 |
| Message-ID | <n22qh2$m26$1@dont-email.me> |
| In reply to | #98709 |
On Thu, 12 Nov 2015 17:55:33 +0000, Quivis wrote: > On Thu, 12 Nov 2015 13:58:35 +1100, Steven D'Aprano wrote: > >> horribly inefficient > > Assuming it was md5 values, who cares? Those are small. A file of 160 million md5 hashes as 32 character hex strings is a huge file. Your method calculates the hash over both files to test whether the contents are different. If the input files are both lists of 160 million md5 hashes, you're calculating the hash of two 5 gigabyte files. In your method the size of the lines of data is irrelevant to the execution time, the execution time varies with the size of the datafiles. -- Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2015-11-12 15:56 +0100 |
| Message-ID | <mailman.271.1447340210.16136.python-list@python.org> |
| In reply to | #98647 |
Tim Chase wrote:
> with open("file1.md5") as a, open("file2.md5") as b:
> for s1, s2 in zip(a, b):
> if s1 != s2:
> print("Files differ")
Note that this will not detect extra lines in one of the files.
I recommend that you use itertools.zip_longest (izip_longest in Python 2)
instead of the built-in zip().
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2015-11-12 09:00 -0600 |
| Message-ID | <mailman.272.1447341014.16136.python-list@python.org> |
| In reply to | #98647 |
On 2015-11-12 15:56, Peter Otten wrote:
> Tim Chase wrote:
>
> > with open("file1.md5") as a, open("file2.md5") as b:
> > for s1, s2 in zip(a, b):
> > if s1 != s2:
> > print("Files differ")
>
> Note that this will not detect extra lines in one of the files.
> I recommend that you use itertools.zip_longest (izip_longest in
> Python 2) instead of the built-in zip().
Yeah, I noticed that after pushing <send> but then posted a later
version that just read chunks of the file which should catch that
file-size difference. Or, as in that other message, prefix it with
an fstat() check to compare file-sizes so that you don't even have to
open the files if the sizes differ.
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2015-11-12 16:41 +0100 |
| Message-ID | <mailman.274.1447342924.16136.python-list@python.org> |
| In reply to | #98647 |
Tim Chase wrote:
> On 2015-11-12 15:56, Peter Otten wrote:
>> Tim Chase wrote:
>>
>> > with open("file1.md5") as a, open("file2.md5") as b:
>> > for s1, s2 in zip(a, b):
>> > if s1 != s2:
>> > print("Files differ")
>>
>> Note that this will not detect extra lines in one of the files.
>> I recommend that you use itertools.zip_longest (izip_longest in
>> Python 2) instead of the built-in zip().
>
> Yeah, I noticed that after pushing <send> but then posted a later
> version that just read chunks of the file which should catch that
> file-size difference. Or, as in that other message, prefix it with
> an fstat() check to compare file-sizes so that you don't even have to
> open the files if the sizes differ.
>
> -tkc
>>> os.path.getsize("file1.md5")
10
>>> os.path.getsize("file2.md5")
10
>>> with open("file1.md5") as a, open("file2.md5") as b:
... for s, t in zip(a, b):
... if s != t: print("different")
...
>>> from itertools import zip_longest
>>> with open("file1.md5") as a, open("file2.md5") as b:
... for s, t in zip_longest(a, b):
... if s != t: print("different")
...
different
I admit I cheated and used Python 3 ;)
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2015-11-12 21:24 +0000 |
| Message-ID | <n23020$m26$2@dont-email.me> |
| In reply to | #98647 |
On Wed, 11 Nov 2015 08:34:30 -0800, Anas Belemlih wrote: > i am a beginning programmer, i am trying to write a simple code to > compare two character sets in 2 seperate files. ( 2 hash value files > basically) Why? If you simply wish to compare two files, most operating systems provide executable tools at the OS level which are more efficient than anything you will write in a scripting language. Lesson 1 of computing. Use the right tool for the job. Writing a new program is not always the right tool. -- Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web