Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #33979

Re: Compare list entry from csv files

Path csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <tolidtm@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'else:': 0.03; 'algorithm': 0.03; 'string.': 0.04; 'beginner': 0.05; 'case.': 0.05; 'fixes': 0.05; '%s"': 0.07; 'filename': 0.07; 'implements': 0.07; 'matches': 0.07; 'python': 0.09; '0.1': 0.09; 'deletion': 0.09; 'incorrect': 0.09; 'insertion': 0.09; 'record.': 0.09; 'rows': 0.09; 'subject:files': 0.09; 'substitution': 0.09; 'wrong,': 0.09; 'cc:addr:python-list': 0.10; 'looked': 0.10; 'def': 0.10; 'files.': 0.13; '"""find': 0.16; 'csv': 0.16; 'perfect.': 0.16; 'row': 0.16; 'somehow.': 0.16; 'suppose.': 0.16; 'writer': 0.16; 'wrote:': 0.17; 'versions': 0.20; 'define': 0.20; 'sort': 0.21; 'bit': 0.21; 'import': 0.21; 'received:209.85.214.174': 0.21; 'help.': 0.22; 'cc:2**0': 0.23; 'insert': 0.23; 'cc:no real name:2**0': 0.24; 'second': 0.24; 'so.': 0.24; 'script': 0.24; 'pass': 0.25; 'least': 0.25; 'cc:addr:python.org': 0.25; 'header :In-Reply-To:1': 0.25; 'common': 0.26; 'leave': 0.26; 'first,': 0.27; 'implemented': 0.27; 'message-id:@mail.gmail.com': 0.27; 'subject:list': 0.28; 'run': 0.28; 'comparison': 0.29; 'finds': 0.29; 'second,': 0.29; 'strings,': 0.29; 'url:mailman': 0.29; 'case,': 0.29; 'this.': 0.29; "i'm": 0.29; 'notes': 0.30; 'stuff': 0.30; 'problem.': 0.32; 'url:python': 0.32; 'help,': 0.32; 'print': 0.32; 'url:listinfo': 0.32; 'extract': 0.33; 'code:': 0.33; 'skip:d 20': 0.34; 'program,': 0.34; 'received:google.com': 0.34; 'third': 0.34; 'thanks': 0.34; 'nov': 0.35; 'pm,': 0.35; 'sometimes': 0.35; 'similar': 0.35; 'received:209.85': 0.35; 'next': 0.35; 'add': 0.36; 'but': 0.36; 'url:org': 0.36; 'compare': 0.36; 'thank': 0.36; 'too': 0.36; 'skip:p 20': 0.36; 'two': 0.37; 'received:209': 0.37; 'subject:: ': 0.38; 'files': 0.38; 'skip:l 20': 0.38; 'skip:o 20': 0.38; 'some': 0.38; 'received:209.85.214': 0.39; 'step': 0.39; 'called': 0.39; 'little': 0.39; 'application': 0.40; 'where': 0.40; 'header:Received:5': 0.40; 'url:mail': 0.40; 'think': 0.40; 'your': 0.60; 'first': 0.61; "you'll": 0.62; 'distance': 0.62; 'provide': 0.62; 'between': 0.63; 'close': 0.63; 'different': 0.63; 'perfect': 0.63; 'skip:n 10': 0.63; 'here': 0.65; 'choose': 0.65; 'results': 0.65; 'records.': 0.65; 'records': 0.68; 'score': 0.75; 'tel': 0.75; 'algorithm,': 0.84; 'longest': 0.84; 'neil': 0.84; 'phones:': 0.84; 'ratio': 0.91; 'imagine': 0.96
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=AgN7B4obPNDeFBefmgT6E0/C/KhpEzaFiMtGfaLx6ZA=; b=JXZPwg1HYF5E2eqDypk53hv/u/II4qg8WWp6cTRo/kgC8R3+eNk30wcf93IjibntsU /vITnr3+A/sXZsv7w1w7x0RIsFx+nmL4oFuA6tErhg6IWYZ6cAPO71yGkj2gw0NuBklQ mJgsRAiRa9lZa9/e4U/Urm2J/gYAIeaYRun671b6J+7G3/d0P0xLbSP5f1LKnzR50Xl3 eIwzpx31vAEl4bdiOhru25KbCQsDjd7vZzKujV33QZ079zbM8tjstM/keAtuhWRBA3Ic XoFf3SpXOs24O7buKK33SipChHWGpj/27GX/nasaXujR1HhPQ2z5sppTAVo9TDlNQcWf oI3w==
MIME-Version 1.0
In-Reply-To <ahk36lFeqmlU3@mid.individual.net>
References <CAKhY55M3iOhhMbn=VuwJ7OMz_JUHVxwGy1h_BDpfBY8jQ5aRkA@mail.gmail.com> <50B3E865.9070700@davea.name> <CAKhY55OUNvGFhCjZLS7HNoDydKqHxZrnaN+bWySCsmKSk1vqsw@mail.gmail.com> <50B43246.2010902@davea.name> <mailman.313.1354026304.29569.python-list@python.org> <ahk36lFeqmlU3@mid.individual.net>
Date Tue, 27 Nov 2012 19:57:53 +0100
Subject Re: Compare list entry from csv files
From Anatoli Hristov <tolidtm@gmail.com>
To Neil Cerutti <neilc@norwich.edu>
Content-Type text/plain; charset=UTF-8
Cc python-list@python.org
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.320.1354042676.29569.python-list@python.org> (permalink)
Lines 128
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1354042676 news.xs4all.nl 6876 [2001:888:2000:d::a6]:46203
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:33979

Show key headers only | View raw


On Tue, Nov 27, 2012 at 4:05 PM, Neil Cerutti <neilc@norwich.edu> wrote:
> On 2012-11-27, Anatoli Hristov <tolidtm@gmail.com> wrote:
>> Thanks for your help. I will do my best for the forum :)
>>
>> I advanced a little bit with the algorithm and at least I can
>> now extract and compare the fields :) For my beginner skills I
>> think this is too much for me. Now next step is to add the
>> second field with the number to the Namelist and copy it to a
>> third filename I suppose.
>
> I had to write a similar type of program, and I imagine it's a
> common problem. Sometimes new students provide incorrect SSN's or
> simply leave them blank. This makes it impossible for us to match
> their application for financial aid to their admissions record.
>
> You have to analyze how you're going to match records.
>
> In my case, missing SSN's are one case. A likeley match in this
> case is when the names are eerily similar.
>
> In the other case, where they simply got their SSN wrong, I have
> to check for both a similar SSN and a similar name.
>
> But you still have to define "similar." I looked up an algorithm
> on the web called Levenshtein Distance, and implemented it like
> so.
>
> def levenshteindistance(first, second):
>     """Find the Levenshtein distance between two strings."""
>     if len(first) > len(second):
>         first, second = second, first
>     if len(second) == 0:
>         return len(first)
>     first_length = len(first) + 1
>     second_length = len(second) + 1
>     distance_matrix = [[0] * second_length for x in range(first_length)]
>     for i in range(first_length):
>        distance_matrix[i][0] = i
>     for j in range(second_length):
>        distance_matrix[0][j]=j
>     for i in range(1, first_length):
>         for j in range(1, second_length):
>             deletion = distance_matrix[i-1][j] + 1
>             insertion = distance_matrix[i][j-1] + 1
>             substitution = distance_matrix[i-1][j-1]
>             if first[i-1] != second[j-1]:
>                 substitution += 1
>             distance_matrix[i][j] = min(insertion, deletion, substitution)
>     return distance_matrix[first_length-1][second_length-1]
>
> The algorithm return a count of every difference between the two
> strings, from 0 to the length of the longest string.
>
> Python provides difflib, which implements a similar algorithm, so
> I used that as well (kinda awkwardly). I used
> difflib.get_close_matches to get candidates, and then
> difflib.SequenceMatcher to provide me a score measuring the
> closeness.
>
> matches = difflib.get_close_matches(s1, s2)
> for m in matches:
>     scorer = difflib.SequenceMatcher(None, s1, m)
>     ratio = scorer.ratio()
>     if ratio == 0.0:
>         # perfect match
>     if ratio > MAX_RATIO: # You gotta choose this. I used 0.1
>         # close match
>
> The two algorithms come up with different guesses, and I pass on
> their suggestions for fixes to a human being. Both versions of
> the program take roughly 5 minutes to run the comparison on
> 2000-12000 records between the two files.
>
> I like the results of Levenshtein distance a little better, but
> difflib finds some stuff that it misses.
>
> In your case, the name is munged horribly in one of the files so
> you'll first have to first sort it out somehow.
>
> --
> Neil Cerutti
> --
> http://mail.python.org/mailman/listinfo/python-list


Thank you all for the help, but I figured that out and the program now
works perfect. I would appreciate if you have some notes about my
script as I'm noob :)
Here is the code:

import csv

origf = open('c:/Working/Test_phonebook.csv', 'rt')
secfile = open('c:/Working/phones.csv', 'rt')

phonelist = []
namelist = []

names = csv.reader(origf, delimiter=';')
phones = csv.reader(secfile, delimiter=';')
for tel in phones:
    phonelist.append(tel)



def finder(name_row,rows):
    for ex_phone in phonelist:
        telstr = ex_phone[0].lower()
        if telstr.find(name_row) >= 0:
            print "\nName found: %s" % name_row
            namelist[rows][-1] = ex_phone[-1].lower()
        else:
            pass
    return

def name_find():
    rows = 0
    for row in names:
        namelist.append(row)
        name_row = row[0].lower()
        finder(name_row,rows)
        rows = rows+1
name_find()
ofile  = open('c:/Working/ttest.csv', "wb")
writer = csv.writer(wfile, delimiter=';')
for insert in namelist:
    writer.writerow(insert)
wfile.close()

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Re: Compare list entry from csv files Anatoli Hristov <tolidtm@gmail.com> - 2012-11-27 15:24 +0100
  Re: Compare list entry from csv files Neil Cerutti <neilc@norwich.edu> - 2012-11-27 15:05 +0000
    Re: Compare list entry from csv files Anatoli Hristov <tolidtm@gmail.com> - 2012-11-27 19:57 +0100
      Re: Compare list entry from csv files Neil Cerutti <neilc@norwich.edu> - 2012-11-27 20:41 +0000
        Re: Compare list entry from csv files Anatoli Hristov <tolidtm@gmail.com> - 2012-11-29 11:22 +0100
        Re: Compare list entry from csv files Thomas Bach <thbach@students.uni-mainz.de> - 2012-11-29 13:07 +0100
        RE: Compare list entry from csv files "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-11-29 23:13 +0000
        Re: Compare list entry from csv files Dave Angel <d@davea.name> - 2012-11-29 22:17 -0500
        Re: Compare list entry from csv files Anatoli Hristov <tolidtm@gmail.com> - 2012-11-30 10:26 +0100
        Re: Compare list entry from csv files Anatoli Hristov <tolidtm@gmail.com> - 2012-11-30 10:29 +0100
    Re: Compare list entry from csv files Dave Angel <d@davea.name> - 2012-11-27 14:59 -0500

csiph-web