Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!eternal-september.org!feeder.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Dan Sommers Newsgroups: comp.lang.python Subject: Re: Python's re module and genealogy problem Date: Sat, 14 Jun 2014 05:14:50 +0000 (UTC) Organization: A noiseless patient Spider Lines: 34 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Date: Sat, 14 Jun 2014 05:14:50 +0000 (UTC) Injection-Info: mx05.eternal-september.org; posting-host="7afa85ad1f051434641aa58aa18198d3"; logging-data="26116"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/0bwI80oskhmD7fiP4g93Fs3heob6dFPk=" User-Agent: Pan/0.139 (Sexual Chocolate; GIT bf56508 git://git.gnome.org/pan2) Cancel-Lock: sha1:fFT5ReZMgyGIsU7wFaV3wiUI5Dw= Xref: csiph.com comp.lang.python:73274 On Fri, 13 Jun 2014 17:17:06 +0200, BrJohan wrote: > Or to put the namevariants in some sequence of sets having elements > like: ("Kristina", "Christina", "Cristine", "Kristine") > Matching is then just applying the 'in' operator. That's definitely a better approach, for the reasons you mentioned. > Comments? A soundex (or similar) algorithm will be better in the long run for the less common, but more often misspelled names. It's fairly simple to guess at a number of common spellings for names that *you* think are common now, but what about names that run in families that aren't yours, or aren't that common outside of that family, or were wildly popular a couple of hundred years ago but have fallen out of favor now? My wife's ancestors (she's the genealogist, I just get to hear the horror stories) are notorious for being somewhat illiterate; for changing their names, on purpose, after a feud, in order to "distance" themselves from their relatives; and also for using not-common-now (or even not-so-common-then) names. Add in somewhat illiterate records keepers and hospital workers (or midwives or neighbors), not to mention bad copies of bad copies of centuries-old smudged documents, and you have an instant soup of names that sound alike but are spelled differently in ways you cannot guess ahead of time. Your users will appreciate *some* sort of fuzzy matching, or runtime extensibility, atop the "obvious" spellings you take the time to include in your software. And that's *not* a comment on your abilities; it's a comment on the abilities and creativity of their ancestors. Dan