Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #12021 > unrolled thread
| Started by | Peter Otten <__peter__@web.de> |
|---|---|
| First post | 2011-08-22 14:53 +0200 |
| Last post | 2011-08-22 13:05 +0000 |
| Articles | 2 — 2 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Optimizing Text Similarity Algorithm Peter Otten <__peter__@web.de> - 2011-08-22 14:53 +0200
Re: Optimizing Text Similarity Algorithm Neil Cerutti <neilc@norwich.edu> - 2011-08-22 13:05 +0000
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2011-08-22 14:53 +0200 |
| Subject | Re: Optimizing Text Similarity Algorithm |
| Message-ID | <mailman.312.1314017633.27778.python-list@python.org> |
Yaşar Arabacı wrote:
> I originally posted this question on stackoverflow, you can find it here:
> http://stackoverflow.com/q/7133350/886669
>
> I just want people check what I am doing and express their opinion about
> the thing I am doing is acceptable, or are there some expects of it that
> could change.
You are using only word frequencies to calculate the similarity of the
document pairs. You can calculate these frequencies before you enter the the
expensive
for documentA in ...:
for documentB in ...:
calculate_similarity(documentA, documentB)
loops and therefore bring that portion of your could from O(n*n) to O(n).
Here's is a sample where I applied that technique blindly to your code. I
also had to remove the django dependency, so I've changed it to operate on
text files.
import sys
import math
import re
from collections import Counter
from itertools import combinations
def get_words(text):
# FIXME
regex = re.compile("\W+", flags=re.UNICODE)
return re.split(regex, text)
def pairs(files):
"""Generate (title, wordlist) pairs.
(filename is used as title)
"""
for filename in files:
with open(filename) as f:
text = f.read()
yield filename, get_words(text)
def process(pairs):
triples = []
total = Counter()
for title, words in pairs:
c = Counter(words)
total.update(c.iterkeys())
sigma = sum(math.log(freq, 1.6) for freq in c.itervalues())
triples.append((title, c, sigma))
for (title1, freq1, sum1), (title2, freq2, sum2) in combinations(
triples, 2):
upper_part = 0
for word in freq1 & freq2:
adjusted1 = math.log(freq1[word], 1.6)
adjusted2 = math.log(freq2[word], 1.6)
upper_part += (
adjusted1 * adjusted2 * math.log(len(triples)/total[word]))
lower_part = math.sqrt(sum1 * sum2)
title1, title2 = sorted([title1, title2])
yield title1, title2, upper_part/lower_part
def main():
files = sys.argv[1:]
results = process(pairs(files))
results = sorted(results, key=lambda x: x[:2])
results.sort(key=lambda x: x[2], reverse=True)
print "\n".join("%s and %s => %f" % xcv for xcv in results)
if __name__ == "__main__":
main()
[toc] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2011-08-22 13:05 +0000 |
| Message-ID | <9bf2fvFer9U1@mid.individual.net> |
| In reply to | #12021 |
On 2011-08-22, Peter Otten <__peter__@web.de> wrote: > Ya??ar Arabac?? wrote: >> I originally posted this question on stackoverflow, you can find it here: >> http://stackoverflow.com/q/7133350/886669 >> >> I just want people check what I am doing and express their opinion about >> the thing I am doing is acceptable, or are there some expects of it that >> could change. Perhaps check out difflib.SequenceMatcher.ratio to see if the library function is good enough. -- Neil Cerutti
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web