Groups > comp.lang.python > #12021 > unrolled thread

Re: Optimizing Text Similarity Algorithm

Started by	Peter Otten <__peter__@web.de>
First post	2011-08-22 14:53 +0200
Last post	2011-08-22 13:05 +0000
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Optimizing Text Similarity Algorithm Peter Otten <__peter__@web.de> - 2011-08-22 14:53 +0200
    Re: Optimizing Text Similarity Algorithm Neil Cerutti <neilc@norwich.edu> - 2011-08-22 13:05 +0000

#12021 — Re: Optimizing Text Similarity Algorithm

From	Peter Otten <__peter__@web.de>
Date	2011-08-22 14:53 +0200
Subject	Re: Optimizing Text Similarity Algorithm
Message-ID	<mailman.312.1314017633.27778.python-list@python.org>

Yaşar Arabacı wrote:

> I originally posted this question on stackoverflow, you can find it here:
> http://stackoverflow.com/q/7133350/886669
> 
> I just want people check what I am doing and express their opinion about
> the thing I am doing is acceptable, or are there some expects of it that
> could change.

You are using only word frequencies to calculate the similarity of the 
document pairs. You can calculate these frequencies before you enter the the 
expensive 

for documentA in ...:
    for documentB in ...:
        calculate_similarity(documentA, documentB)

loops and therefore bring that portion of your could from O(n*n) to O(n). 
Here's is a sample where I applied that technique blindly to your code. I 
also had to remove the django dependency, so I've changed it to operate on 
text files.

import sys
import math
import re

from collections import Counter
from itertools import combinations

def get_words(text):
    # FIXME
    regex = re.compile("\W+", flags=re.UNICODE)
    return re.split(regex, text)

def pairs(files):
    """Generate (title, wordlist) pairs.

    (filename is used as title)
    """
    for filename in files:
        with open(filename) as f:
            text = f.read()
        yield filename, get_words(text)

def process(pairs):
    triples = []
    total = Counter()
    for title, words in pairs:
        c = Counter(words)
        total.update(c.iterkeys())
        sigma = sum(math.log(freq, 1.6) for freq in c.itervalues())
        triples.append((title, c, sigma))

    for (title1, freq1, sum1), (title2, freq2, sum2) in combinations(
        triples, 2):
        upper_part = 0

        for word in freq1 & freq2:
            adjusted1 = math.log(freq1[word], 1.6)
            adjusted2 = math.log(freq2[word], 1.6)
            upper_part += (
                adjusted1 * adjusted2 * math.log(len(triples)/total[word]))

        lower_part = math.sqrt(sum1 * sum2)

        title1, title2 = sorted([title1, title2])
        yield title1, title2, upper_part/lower_part


def main():
    files = sys.argv[1:]
    
    results = process(pairs(files))
    results = sorted(results, key=lambda x: x[:2])
    results.sort(key=lambda x: x[2], reverse=True)
    print "\n".join("%s and %s => %f" % xcv for xcv in results)

if __name__ == "__main__":
    main()

[toc] | [next] | [standalone]

#12023

From	Neil Cerutti <neilc@norwich.edu>
Date	2011-08-22 13:05 +0000
Message-ID	<9bf2fvFer9U1@mid.individual.net>
In reply to	#12021

On 2011-08-22, Peter Otten <__peter__@web.de> wrote:
> Ya??ar Arabac?? wrote:
>> I originally posted this question on stackoverflow, you can find it here:
>> http://stackoverflow.com/q/7133350/886669
>> 
>> I just want people check what I am doing and express their opinion about
>> the thing I am doing is acceptable, or are there some expects of it that
>> could change.

Perhaps check out difflib.SequenceMatcher.ratio to see if the
library function is good enough.

-- 
Neil Cerutti

[toc] | [prev] | [standalone]

csiph-web

Re: Optimizing Text Similarity Algorithm

Contents

#12021 — Re: Optimizing Text Similarity Algorithm

#12023