Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #33979
| References | (1 earlier) <50B3E865.9070700@davea.name> <CAKhY55OUNvGFhCjZLS7HNoDydKqHxZrnaN+bWySCsmKSk1vqsw@mail.gmail.com> <50B43246.2010902@davea.name> <mailman.313.1354026304.29569.python-list@python.org> <ahk36lFeqmlU3@mid.individual.net> |
|---|---|
| Date | 2012-11-27 19:57 +0100 |
| Subject | Re: Compare list entry from csv files |
| From | Anatoli Hristov <tolidtm@gmail.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.320.1354042676.29569.python-list@python.org> (permalink) |
On Tue, Nov 27, 2012 at 4:05 PM, Neil Cerutti <neilc@norwich.edu> wrote:
> On 2012-11-27, Anatoli Hristov <tolidtm@gmail.com> wrote:
>> Thanks for your help. I will do my best for the forum :)
>>
>> I advanced a little bit with the algorithm and at least I can
>> now extract and compare the fields :) For my beginner skills I
>> think this is too much for me. Now next step is to add the
>> second field with the number to the Namelist and copy it to a
>> third filename I suppose.
>
> I had to write a similar type of program, and I imagine it's a
> common problem. Sometimes new students provide incorrect SSN's or
> simply leave them blank. This makes it impossible for us to match
> their application for financial aid to their admissions record.
>
> You have to analyze how you're going to match records.
>
> In my case, missing SSN's are one case. A likeley match in this
> case is when the names are eerily similar.
>
> In the other case, where they simply got their SSN wrong, I have
> to check for both a similar SSN and a similar name.
>
> But you still have to define "similar." I looked up an algorithm
> on the web called Levenshtein Distance, and implemented it like
> so.
>
> def levenshteindistance(first, second):
> """Find the Levenshtein distance between two strings."""
> if len(first) > len(second):
> first, second = second, first
> if len(second) == 0:
> return len(first)
> first_length = len(first) + 1
> second_length = len(second) + 1
> distance_matrix = [[0] * second_length for x in range(first_length)]
> for i in range(first_length):
> distance_matrix[i][0] = i
> for j in range(second_length):
> distance_matrix[0][j]=j
> for i in range(1, first_length):
> for j in range(1, second_length):
> deletion = distance_matrix[i-1][j] + 1
> insertion = distance_matrix[i][j-1] + 1
> substitution = distance_matrix[i-1][j-1]
> if first[i-1] != second[j-1]:
> substitution += 1
> distance_matrix[i][j] = min(insertion, deletion, substitution)
> return distance_matrix[first_length-1][second_length-1]
>
> The algorithm return a count of every difference between the two
> strings, from 0 to the length of the longest string.
>
> Python provides difflib, which implements a similar algorithm, so
> I used that as well (kinda awkwardly). I used
> difflib.get_close_matches to get candidates, and then
> difflib.SequenceMatcher to provide me a score measuring the
> closeness.
>
> matches = difflib.get_close_matches(s1, s2)
> for m in matches:
> scorer = difflib.SequenceMatcher(None, s1, m)
> ratio = scorer.ratio()
> if ratio == 0.0:
> # perfect match
> if ratio > MAX_RATIO: # You gotta choose this. I used 0.1
> # close match
>
> The two algorithms come up with different guesses, and I pass on
> their suggestions for fixes to a human being. Both versions of
> the program take roughly 5 minutes to run the comparison on
> 2000-12000 records between the two files.
>
> I like the results of Levenshtein distance a little better, but
> difflib finds some stuff that it misses.
>
> In your case, the name is munged horribly in one of the files so
> you'll first have to first sort it out somehow.
>
> --
> Neil Cerutti
> --
> http://mail.python.org/mailman/listinfo/python-list
Thank you all for the help, but I figured that out and the program now
works perfect. I would appreciate if you have some notes about my
script as I'm noob :)
Here is the code:
import csv
origf = open('c:/Working/Test_phonebook.csv', 'rt')
secfile = open('c:/Working/phones.csv', 'rt')
phonelist = []
namelist = []
names = csv.reader(origf, delimiter=';')
phones = csv.reader(secfile, delimiter=';')
for tel in phones:
phonelist.append(tel)
def finder(name_row,rows):
for ex_phone in phonelist:
telstr = ex_phone[0].lower()
if telstr.find(name_row) >= 0:
print "\nName found: %s" % name_row
namelist[rows][-1] = ex_phone[-1].lower()
else:
pass
return
def name_find():
rows = 0
for row in names:
namelist.append(row)
name_row = row[0].lower()
finder(name_row,rows)
rows = rows+1
name_find()
ofile = open('c:/Working/ttest.csv', "wb")
writer = csv.writer(wfile, delimiter=';')
for insert in namelist:
writer.writerow(insert)
wfile.close()
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Re: Compare list entry from csv files Anatoli Hristov <tolidtm@gmail.com> - 2012-11-27 15:24 +0100
Re: Compare list entry from csv files Neil Cerutti <neilc@norwich.edu> - 2012-11-27 15:05 +0000
Re: Compare list entry from csv files Anatoli Hristov <tolidtm@gmail.com> - 2012-11-27 19:57 +0100
Re: Compare list entry from csv files Neil Cerutti <neilc@norwich.edu> - 2012-11-27 20:41 +0000
Re: Compare list entry from csv files Anatoli Hristov <tolidtm@gmail.com> - 2012-11-29 11:22 +0100
Re: Compare list entry from csv files Thomas Bach <thbach@students.uni-mainz.de> - 2012-11-29 13:07 +0100
RE: Compare list entry from csv files "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-11-29 23:13 +0000
Re: Compare list entry from csv files Dave Angel <d@davea.name> - 2012-11-29 22:17 -0500
Re: Compare list entry from csv files Anatoli Hristov <tolidtm@gmail.com> - 2012-11-30 10:26 +0100
Re: Compare list entry from csv files Anatoli Hristov <tolidtm@gmail.com> - 2012-11-30 10:29 +0100
Re: Compare list entry from csv files Dave Angel <d@davea.name> - 2012-11-27 14:59 -0500
csiph-web