Groups > comp.lang.python > #73165 > unrolled thread

Python's re module and genealogy problem

Started by	BrJohan <brjohan@gmail.com>
First post	2014-06-11 14:23 +0200
Last post	2014-06-14 08:35 +0000
Articles	12 — 11 participants

Back to article view | Back to comp.lang.python

  Python's re module and genealogy problem BrJohan <brjohan@gmail.com> - 2014-06-11 14:23 +0200
    Re: Python's re module and genealogy problem Robert Kern <robert.kern@gmail.com> - 2014-06-11 14:26 +0100
      Re: Python's re module and genealogy problem Mark H Harris <harrismh777@gmail.com> - 2014-06-11 09:08 -0500
    Re: Python's re module and genealogy problem Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2014-06-11 15:55 +0200
    Re: Python's re module and genealogy problem Michael Torrie <torriem@gmail.com> - 2014-06-11 09:34 -0600
    Re: Python's re module and genealogy problem Nick Cash <nick.cash@npcinternational.com> - 2014-06-11 16:21 +0000
    Re: Python's re module and genealogy problem Simon Ward <simon@bleah.co.uk> - 2014-06-11 18:21 +0100
    Re: Python's re module and genealogy problem Vlastimil Brom <vlastimil.brom@gmail.com> - 2014-06-11 20:09 +0200
    Re: Python's re module and genealogy problem BrJohan <brjohan@gmail.com> - 2014-06-13 17:17 +0200
      Re: Python's re module and genealogy problem Peter Otten <__peter__@web.de> - 2014-06-13 18:26 +0200
      Re: Python's re module and genealogy problem Dan Sommers <dan@tombstonezero.net> - 2014-06-14 05:14 +0000
    Re: Python's re module and genealogy problem Tony the Tiger <tony@tiger.invalid> - 2014-06-14 08:35 +0000

#73165 — Python's re module and genealogy problem

From	BrJohan <brjohan@gmail.com>
Date	2014-06-11 14:23 +0200
Subject	Python's re module and genealogy problem
Message-ID	<bvr01iFu926U1@mid.individual.net>

For some genealogical purposes I consider using Python's re module.

Rather many names can be spelled in a number of similar ways, and in 
order to match names even if they are spelled differently, I will build 
regular expressions, each of which is supposed to match  a number of 
similar names.

I guess that there will be a few hundred such regular expressions 
covering most popular names.

Now, my problem: Is there a way to decide whether any two - or more - of 
those regular expressions will match the same string?

Or, stated a little differently:

Can it, for a pair of regular expressions be decided whether at least 
one string matching both of those regular expressions, can be constructed?

If it is possible to make such a decision, then how? Anyone aware of an 
algorithm for this?

[toc] | [next] | [standalone]

#73170

From	Robert Kern <robert.kern@gmail.com>
Date	2014-06-11 14:26 +0100
Message-ID	<mailman.11008.1402493220.18130.python-list@python.org>
In reply to	#73165

On 2014-06-11 13:23, BrJohan wrote:
> For some genealogical purposes I consider using Python's re module.
>
> Rather many names can be spelled in a number of similar ways, and in order to
> match names even if they are spelled differently, I will build regular
> expressions, each of which is supposed to match  a number of similar names.
>
> I guess that there will be a few hundred such regular expressions covering most
> popular names.
>
> Now, my problem: Is there a way to decide whether any two - or more - of those
> regular expressions will match the same string?
>
> Or, stated a little differently:
>
> Can it, for a pair of regular expressions be decided whether at least one string
> matching both of those regular expressions, can be constructed?
>
> If it is possible to make such a decision, then how? Anyone aware of an
> algorithm for this?

And if that isn't the best straight line for the old saying, I don't know what is.

   http://en.wikiquote.org/wiki/Jamie_Zawinski

Anyways, to your new problem, yes it's possible. Search for "regular expression 
intersection" for possible approaches. You will probably have to translate the 
regular expression to a different formalism or at least a different library to 
implement this.

Consider just listing out the different possibilities. All of your regexes 
should be "well-behaved" given the constraints of the domain (tightly bounded, 
at least). There are tools that help generate matching strings from a Python 
regex. This will help you QA your regexes, too, to be sure that they match what 
you expect them to and not match non-names.

   https://github.com/asciimoo/exrex

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

[toc] | [prev] | [next] | [standalone]

#73175

From	Mark H Harris <harrismh777@gmail.com>
Date	2014-06-11 09:08 -0500
Message-ID	<ln9nsl$b4o$1@speranza.aioe.org>
In reply to	#73170

On 6/11/14 8:26 AM, Robert Kern wrote:
> Anyways, to your new problem, yes it's possible. Search for "regular
> expression intersection" for possible approaches.

I agree,  I would not use a decision (decision tree) but would consider 
a set of filters from most specific to least specific.

marcus

[toc] | [prev] | [next] | [standalone]

#73174

From	Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de>
Date	2014-06-11 15:55 +0200
Message-ID	<ln9n4b$pog$1@r01.glglgl.de>
In reply to	#73165

Am 11.06.2014 14:23 schrieb BrJohan:

> Can it, for a pair of regular expressions be decided whether at least
> one string matching both of those regular expressions, can be constructed?
>
> If it is possible to make such a decision, then how? Anyone aware of an
> algorithm for this?

Just a feeling-based statement: I don't think that is easily possible.

Every RE can potentially match an infinite number of statements.

Just have a look at

re1 = re.compile('A43.*')
re2 = re.compile('.*[0-9][A-F]')

It can easily seen that the area these REs work on is different; they 
are orthogonal.

So there is an infinite number of strings these REs match.


Thomas

[toc] | [prev] | [next] | [standalone]

#73177

From	Michael Torrie <torriem@gmail.com>
Date	2014-06-11 09:34 -0600
Message-ID	<mailman.11012.1402500909.18130.python-list@python.org>
In reply to	#73165

On 06/11/2014 06:23 AM, BrJohan wrote:
> For some genealogical purposes I consider using Python's re module.
> 
> Rather many names can be spelled in a number of similar ways, and in 
> order to match names even if they are spelled differently, I will build 
> regular expressions, each of which is supposed to match  a number of 
> similar names.
> 
> I guess that there will be a few hundred such regular expressions 
> covering most popular names.
> 
> Now, my problem: Is there a way to decide whether any two - or more - of 
> those regular expressions will match the same string?
> 
> Or, stated a little differently:
> 
> Can it, for a pair of regular expressions be decided whether at least 
> one string matching both of those regular expressions, can be constructed?
> 
> If it is possible to make such a decision, then how? Anyone aware of an 
> algorithm for this?

You might want to search for fuzzy matching algorithms. Years ago, there
was an algorithm called soundex that would generate fuzzy fingerprints
for words that would hide differences in spelling, etc.  Unfortunately
such an algorithm would be language dependent.  The problem you are
trying to solve is one of those very hard problems in computers and math.

[toc] | [prev] | [next] | [standalone]

#73180

From	Nick Cash <nick.cash@npcinternational.com>
Date	2014-06-11 16:21 +0000
Message-ID	<mailman.11014.1402504629.18130.python-list@python.org>
In reply to	#73165

On 06/11/2014 10:35 AM, Michael Torrie wrote:
> On 06/11/2014 06:23 AM, BrJohan wrote:
>> For some genealogical purposes I consider using Python's re module.
>>
>> Rather many names can be spelled in a number of similar ways, and in 
>> order to match names even if they are spelled differently, I will build 
>> regular expressions, each of which is supposed to match  a number of 
>> similar names.
> You might want to search for fuzzy matching algorithms. Years ago, there
> was an algorithm called soundex that would generate fuzzy fingerprints
> for words that would hide differences in spelling, etc.  Unfortunately
> such an algorithm would be language dependent.  The problem you are
> trying to solve is one of those very hard problems in computers and math.
>

Soundex is actually not horrible, but it is definitely only for English
names. Newer variants of Metaphone
(http://en.wikipedia.org/wiki/Metaphone) are significantly better, and
support quite a few other languages.  Either one would most likely be
better than the regex approach.

Side note: if your data happens to be in MySQL then it has a builtin
"sounds_like()" function that compares strings using soundex.

[toc] | [prev] | [next] | [standalone]

#73183

From	Simon Ward <simon@bleah.co.uk>
Date	2014-06-11 18:21 +0100
Message-ID	<mailman.11017.1402508398.18130.python-list@python.org>
In reply to	#73165

On 11 June 2014 13:23:14 BST, BrJohan <brjohan@gmail.com> wrote:
>For some genealogical purposes I consider using Python's re module.
>
>Rather many names can be spelled in a number of similar ways, and in 
>order to match names even if they are spelled differently, I will build
>
>regular expressions, each of which is supposed to match  a number of 
>similar names.

As has been mentioned, you probably want to look at fuzzy matching algorithms rather than aiming at regular expressions, although a quick search suggests there has been some work on fuzzy matching with regular expressions[1].

>Now, my problem: Is there a way to decide whether any two - or more -
>of 
>those regular expressions will match the same string?

If your regexes are truly regular expressions (see [2]*) then they represent regular languages[3], which are really sets. The intersection of these,  is another regular language. If you test the string against this it will also match both original languages.

(*this only mentions back references, but I think the look-ahead/behind assertions are also non-regular)

[1]: http://laurikari.net/tre/about/
[2]: https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages
[3]: https://en.wikipedia.org/wiki/Regular_language

Simon
-- 
Sent from Kaiten Mail. Please excuse my brevity.

[toc] | [prev] | [next] | [standalone]

#73185

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2014-06-11 20:09 +0200
Message-ID	<mailman.11018.1402510564.18130.python-list@python.org>
In reply to	#73165

2014-06-11 14:23 GMT+02:00 BrJohan <brjohan@gmail.com>:
> For some genealogical purposes I consider using Python's re module.
>...
>
> Now, my problem: Is there a way to decide whether any two - or more - of
> those regular expressions will match the same string?
>
> Or, stated a little differently:
>
> Can it, for a pair of regular expressions be decided whether at least one
> string matching both of those regular expressions, can be constructed?
> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
i guess, you could reuse some available generators for strings
matching a given regular expression, see e.g.:
http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python/
for example a pyparsing recipe:
http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python/5006339#5006339

which might be general enough for your needs - of course, you cannot
use unbound quantifiers, backreferences, etc.

Then you can test for identical strings in the generated outputs -
e.g. using the set(...) and its intersection method.

You might also check a much more powerful regex library
https://pypi.python.org/pypi/regex

which, beyond other features, also supports the mentioned fuzzy matches, cf.

>>> regex.findall(r"\bSm(?:ith){e<3}\b", "Smith Smithe Smyth Smythe Smijth")
['Smith', 'Smithe', 'Smyth', 'Smythe', 'Smijth']
>>>
(but, of course, you will have to be careful with this feature in
order to reduce false positives)

hth,
   vbr

[toc] | [prev] | [next] | [standalone]

#73267

From	BrJohan <brjohan@gmail.com>
Date	2014-06-13 17:17 +0200
Message-ID	<c00ivgF5cjpU1@mid.individual.net>
In reply to	#73165

On 11/06/2014 14:23, BrJohan wrote:
> For some genealogical purposes I consider using Python's re module.
>
> Rather many names can be spelled in a number of similar ways, and in
> order to match names even if they are spelled differently, I will build
> regular expressions, each of which is supposed to match  a number of
> similar names.
>
> I guess that there will be a few hundred such regular expressions
> covering most popular names.
>
> Now, my problem: Is there a way to decide whether any two - or more - of
> those regular expressions will match the same string?
>
> Or, stated a little differently:
>
> Can it, for a pair of regular expressions be decided whether at least
> one string matching both of those regular expressions, can be constructed?
>
> If it is possible to make such a decision, then how? Anyone aware of an
> algorithm for this?

Thank you all for valuable input and interesting thoughts.

After having reconsidered my problem, it might be better to approach it 
a little differently.

Either to state the regexps simply like:
"(Kristina)|(Christina)|(Cristine)|(Kristine)"
instead of "((K|(Ch))ristina)|([CK]ristine)"

Or to put the namevariants in some sequence of sets having elements like:
("Kristina", "Christina", "Cristine", "Kristine")
Matching is then just applying the 'in' operator.

I see two distinct advantages.
1. Readability and maintainability
2. Any namevariant occurring in just one regexp or set means no risk of 
erroneous matching.

Comments?

[toc] | [prev] | [next] | [standalone]

#73269

From	Peter Otten <__peter__@web.de>
Date	2014-06-13 18:26 +0200
Message-ID	<mailman.11059.1402676838.18130.python-list@python.org>
In reply to	#73267

BrJohan wrote:

> On 11/06/2014 14:23, BrJohan wrote:
>> For some genealogical purposes I consider using Python's re module.
>>
>> Rather many names can be spelled in a number of similar ways, and in
>> order to match names even if they are spelled differently, I will build
>> regular expressions, each of which is supposed to match  a number of
>> similar names.
>>
>> I guess that there will be a few hundred such regular expressions
>> covering most popular names.
>>
>> Now, my problem: Is there a way to decide whether any two - or more - of
>> those regular expressions will match the same string?
>>
>> Or, stated a little differently:
>>
>> Can it, for a pair of regular expressions be decided whether at least
>> one string matching both of those regular expressions, can be
>> constructed?
>>
>> If it is possible to make such a decision, then how? Anyone aware of an
>> algorithm for this?
> 
> Thank you all for valuable input and interesting thoughts.
> 
> After having reconsidered my problem, it might be better to approach it
> a little differently.
> 
> Either to state the regexps simply like:
> "(Kristina)|(Christina)|(Cristine)|(Kristine)"
> instead of "((K|(Ch))ristina)|([CK]ristine)"
> 
> Or to put the namevariants in some sequence of sets having elements like:
> ("Kristina", "Christina", "Cristine", "Kristine")
> Matching is then just applying the 'in' operator.
> 
> I see two distinct advantages.
> 1. Readability and maintainability
> 2. Any namevariant occurring in just one regexp or set means no risk of
> erroneous matching.
> 
> Comments?

I like the simple variant

kristinas = ("Kristina", "Christina", "Cristine", "Kristine")

But instead of matching with "in" you could build a dict that maps the name 
variants to a normalised name

normalized_names = {
    "Kristina": "Kristina",
    "Christina": "Kristina",
    ...
    "John": "John",
    "Johann": "John",
    ...
}
def normalized(name):
    return normalized_names.get(name, name)

If you put persons in another dict or a database indexed by the normalised 
name 

lookup = {
    "Kristina": ["Kristina Smith", "Christina Miller"],
    ...
}

you can find all Kristinas with two look-ups:

>>> lookup[normalized("Kristine")]
['Kristina Smith', 'Christina Miller']

PS: A problem with this approach might be that (name in nameset_A) and (name 
in nameset_B) implies nameset_A == nameset_B

[toc] | [prev] | [next] | [standalone]

#73274

From	Dan Sommers <dan@tombstonezero.net>
Date	2014-06-14 05:14 +0000
Message-ID	<lnglo9$pg4$1@dont-email.me>
In reply to	#73267

On Fri, 13 Jun 2014 17:17:06 +0200, BrJohan wrote:

> Or to put the namevariants in some sequence of sets having elements
> like:  ("Kristina", "Christina", "Cristine", "Kristine")

> Matching is then just applying the 'in' operator.

That's definitely a better approach, for the reasons you mentioned.

> Comments?

A soundex (or similar) algorithm will be better in the long run for the
less common, but more often misspelled names.  It's fairly simple to
guess at a number of common spellings for names that *you* think are
common now, but what about names that run in families that aren't yours,
or aren't that common outside of that family, or were wildly popular a
couple of hundred years ago but have fallen out of favor now?

My wife's ancestors (she's the genealogist, I just get to hear the
horror stories) are notorious for being somewhat illiterate; for
changing their names, on purpose, after a feud, in order to "distance"
themselves from their relatives; and also for using not-common-now (or
even not-so-common-then) names.  Add in somewhat illiterate records
keepers and hospital workers (or midwives or neighbors), not to mention
bad copies of bad copies of centuries-old smudged documents, and you
have an instant soup of names that sound alike but are spelled
differently in ways you cannot guess ahead of time.

Your users will appreciate *some* sort of fuzzy matching, or runtime
extensibility, atop the "obvious" spellings you take the time to include
in your software.  And that's *not* a comment on your abilities; it's a
comment on the abilities and creativity of their ancestors.

Dan

[toc] | [prev] | [next] | [standalone]

#73277

From	Tony the Tiger <tony@tiger.invalid>
Date	2014-06-14 08:35 +0000
Message-ID	<539c0969$0$59978$c3e8da3$26f9d57d@news.astraweb.com>
In reply to	#73165

On Wed, 11 Jun 2014 14:23:14 +0200, BrJohan wrote:

> I guess that there will be a few hundred such regular expressions
> covering most popular names.

Perhaps

http://code.activestate.com/recipes/52213-soundex-algorithm/
https://en.wikipedia.org/wiki/Soundex

will work?


 /Grrr
-- 
          ___                  ___
 (\_--_/)  | _ ._    _|_|_  _   |o _  _ ._
 ( 9  9 )  |(_)| |\/  |_| |(/_  ||(_|(/_|
 stripes are forever - as overripe ferrets

[toc] | [prev] | [standalone]

csiph-web

Python's re module and genealogy problem

Contents

#73165 — Python's re module and genealogy problem

#73170

#73175

#73174

#73177

#73180

#73183

#73185

#73267

#73269

#73274

#73277