Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #73165 > unrolled thread
| Started by | BrJohan <brjohan@gmail.com> |
|---|---|
| First post | 2014-06-11 14:23 +0200 |
| Last post | 2014-06-14 08:35 +0000 |
| Articles | 12 — 11 participants |
Back to article view | Back to comp.lang.python
Python's re module and genealogy problem BrJohan <brjohan@gmail.com> - 2014-06-11 14:23 +0200
Re: Python's re module and genealogy problem Robert Kern <robert.kern@gmail.com> - 2014-06-11 14:26 +0100
Re: Python's re module and genealogy problem Mark H Harris <harrismh777@gmail.com> - 2014-06-11 09:08 -0500
Re: Python's re module and genealogy problem Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2014-06-11 15:55 +0200
Re: Python's re module and genealogy problem Michael Torrie <torriem@gmail.com> - 2014-06-11 09:34 -0600
Re: Python's re module and genealogy problem Nick Cash <nick.cash@npcinternational.com> - 2014-06-11 16:21 +0000
Re: Python's re module and genealogy problem Simon Ward <simon@bleah.co.uk> - 2014-06-11 18:21 +0100
Re: Python's re module and genealogy problem Vlastimil Brom <vlastimil.brom@gmail.com> - 2014-06-11 20:09 +0200
Re: Python's re module and genealogy problem BrJohan <brjohan@gmail.com> - 2014-06-13 17:17 +0200
Re: Python's re module and genealogy problem Peter Otten <__peter__@web.de> - 2014-06-13 18:26 +0200
Re: Python's re module and genealogy problem Dan Sommers <dan@tombstonezero.net> - 2014-06-14 05:14 +0000
Re: Python's re module and genealogy problem Tony the Tiger <tony@tiger.invalid> - 2014-06-14 08:35 +0000
| From | BrJohan <brjohan@gmail.com> |
|---|---|
| Date | 2014-06-11 14:23 +0200 |
| Subject | Python's re module and genealogy problem |
| Message-ID | <bvr01iFu926U1@mid.individual.net> |
For some genealogical purposes I consider using Python's re module. Rather many names can be spelled in a number of similar ways, and in order to match names even if they are spelled differently, I will build regular expressions, each of which is supposed to match a number of similar names. I guess that there will be a few hundred such regular expressions covering most popular names. Now, my problem: Is there a way to decide whether any two - or more - of those regular expressions will match the same string? Or, stated a little differently: Can it, for a pair of regular expressions be decided whether at least one string matching both of those regular expressions, can be constructed? If it is possible to make such a decision, then how? Anyone aware of an algorithm for this?
[toc] | [next] | [standalone]
| From | Robert Kern <robert.kern@gmail.com> |
|---|---|
| Date | 2014-06-11 14:26 +0100 |
| Message-ID | <mailman.11008.1402493220.18130.python-list@python.org> |
| In reply to | #73165 |
On 2014-06-11 13:23, BrJohan wrote: > For some genealogical purposes I consider using Python's re module. > > Rather many names can be spelled in a number of similar ways, and in order to > match names even if they are spelled differently, I will build regular > expressions, each of which is supposed to match a number of similar names. > > I guess that there will be a few hundred such regular expressions covering most > popular names. > > Now, my problem: Is there a way to decide whether any two - or more - of those > regular expressions will match the same string? > > Or, stated a little differently: > > Can it, for a pair of regular expressions be decided whether at least one string > matching both of those regular expressions, can be constructed? > > If it is possible to make such a decision, then how? Anyone aware of an > algorithm for this? And if that isn't the best straight line for the old saying, I don't know what is. http://en.wikiquote.org/wiki/Jamie_Zawinski Anyways, to your new problem, yes it's possible. Search for "regular expression intersection" for possible approaches. You will probably have to translate the regular expression to a different formalism or at least a different library to implement this. Consider just listing out the different possibilities. All of your regexes should be "well-behaved" given the constraints of the domain (tightly bounded, at least). There are tools that help generate matching strings from a Python regex. This will help you QA your regexes, too, to be sure that they match what you expect them to and not match non-names. https://github.com/asciimoo/exrex -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
[toc] | [prev] | [next] | [standalone]
| From | Mark H Harris <harrismh777@gmail.com> |
|---|---|
| Date | 2014-06-11 09:08 -0500 |
| Message-ID | <ln9nsl$b4o$1@speranza.aioe.org> |
| In reply to | #73170 |
On 6/11/14 8:26 AM, Robert Kern wrote: > Anyways, to your new problem, yes it's possible. Search for "regular > expression intersection" for possible approaches. I agree, I would not use a decision (decision tree) but would consider a set of filters from most specific to least specific. marcus
[toc] | [prev] | [next] | [standalone]
| From | Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> |
|---|---|
| Date | 2014-06-11 15:55 +0200 |
| Message-ID | <ln9n4b$pog$1@r01.glglgl.de> |
| In reply to | #73165 |
Am 11.06.2014 14:23 schrieb BrJohan:
> Can it, for a pair of regular expressions be decided whether at least
> one string matching both of those regular expressions, can be constructed?
>
> If it is possible to make such a decision, then how? Anyone aware of an
> algorithm for this?
Just a feeling-based statement: I don't think that is easily possible.
Every RE can potentially match an infinite number of statements.
Just have a look at
re1 = re.compile('A43.*')
re2 = re.compile('.*[0-9][A-F]')
It can easily seen that the area these REs work on is different; they
are orthogonal.
So there is an infinite number of strings these REs match.
Thomas
[toc] | [prev] | [next] | [standalone]
| From | Michael Torrie <torriem@gmail.com> |
|---|---|
| Date | 2014-06-11 09:34 -0600 |
| Message-ID | <mailman.11012.1402500909.18130.python-list@python.org> |
| In reply to | #73165 |
On 06/11/2014 06:23 AM, BrJohan wrote: > For some genealogical purposes I consider using Python's re module. > > Rather many names can be spelled in a number of similar ways, and in > order to match names even if they are spelled differently, I will build > regular expressions, each of which is supposed to match a number of > similar names. > > I guess that there will be a few hundred such regular expressions > covering most popular names. > > Now, my problem: Is there a way to decide whether any two - or more - of > those regular expressions will match the same string? > > Or, stated a little differently: > > Can it, for a pair of regular expressions be decided whether at least > one string matching both of those regular expressions, can be constructed? > > If it is possible to make such a decision, then how? Anyone aware of an > algorithm for this? You might want to search for fuzzy matching algorithms. Years ago, there was an algorithm called soundex that would generate fuzzy fingerprints for words that would hide differences in spelling, etc. Unfortunately such an algorithm would be language dependent. The problem you are trying to solve is one of those very hard problems in computers and math.
[toc] | [prev] | [next] | [standalone]
| From | Nick Cash <nick.cash@npcinternational.com> |
|---|---|
| Date | 2014-06-11 16:21 +0000 |
| Message-ID | <mailman.11014.1402504629.18130.python-list@python.org> |
| In reply to | #73165 |
On 06/11/2014 10:35 AM, Michael Torrie wrote: > On 06/11/2014 06:23 AM, BrJohan wrote: >> For some genealogical purposes I consider using Python's re module. >> >> Rather many names can be spelled in a number of similar ways, and in >> order to match names even if they are spelled differently, I will build >> regular expressions, each of which is supposed to match a number of >> similar names. > You might want to search for fuzzy matching algorithms. Years ago, there > was an algorithm called soundex that would generate fuzzy fingerprints > for words that would hide differences in spelling, etc. Unfortunately > such an algorithm would be language dependent. The problem you are > trying to solve is one of those very hard problems in computers and math. > Soundex is actually not horrible, but it is definitely only for English names. Newer variants of Metaphone (http://en.wikipedia.org/wiki/Metaphone) are significantly better, and support quite a few other languages. Either one would most likely be better than the regex approach. Side note: if your data happens to be in MySQL then it has a builtin "sounds_like()" function that compares strings using soundex.
[toc] | [prev] | [next] | [standalone]
| From | Simon Ward <simon@bleah.co.uk> |
|---|---|
| Date | 2014-06-11 18:21 +0100 |
| Message-ID | <mailman.11017.1402508398.18130.python-list@python.org> |
| In reply to | #73165 |
On 11 June 2014 13:23:14 BST, BrJohan <brjohan@gmail.com> wrote: >For some genealogical purposes I consider using Python's re module. > >Rather many names can be spelled in a number of similar ways, and in >order to match names even if they are spelled differently, I will build > >regular expressions, each of which is supposed to match a number of >similar names. As has been mentioned, you probably want to look at fuzzy matching algorithms rather than aiming at regular expressions, although a quick search suggests there has been some work on fuzzy matching with regular expressions[1]. >Now, my problem: Is there a way to decide whether any two - or more - >of >those regular expressions will match the same string? If your regexes are truly regular expressions (see [2]*) then they represent regular languages[3], which are really sets. The intersection of these, is another regular language. If you test the string against this it will also match both original languages. (*this only mentions back references, but I think the look-ahead/behind assertions are also non-regular) [1]: http://laurikari.net/tre/about/ [2]: https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages [3]: https://en.wikipedia.org/wiki/Regular_language Simon -- Sent from Kaiten Mail. Please excuse my brevity.
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2014-06-11 20:09 +0200 |
| Message-ID | <mailman.11018.1402510564.18130.python-list@python.org> |
| In reply to | #73165 |
2014-06-11 14:23 GMT+02:00 BrJohan <brjohan@gmail.com>:
> For some genealogical purposes I consider using Python's re module.
>...
>
> Now, my problem: Is there a way to decide whether any two - or more - of
> those regular expressions will match the same string?
>
> Or, stated a little differently:
>
> Can it, for a pair of regular expressions be decided whether at least one
> string matching both of those regular expressions, can be constructed?
> --
> https://mail.python.org/mailman/listinfo/python-list
Hi,
i guess, you could reuse some available generators for strings
matching a given regular expression, see e.g.:
http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python/
for example a pyparsing recipe:
http://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python/5006339#5006339
which might be general enough for your needs - of course, you cannot
use unbound quantifiers, backreferences, etc.
Then you can test for identical strings in the generated outputs -
e.g. using the set(...) and its intersection method.
You might also check a much more powerful regex library
https://pypi.python.org/pypi/regex
which, beyond other features, also supports the mentioned fuzzy matches, cf.
>>> regex.findall(r"\bSm(?:ith){e<3}\b", "Smith Smithe Smyth Smythe Smijth")
['Smith', 'Smithe', 'Smyth', 'Smythe', 'Smijth']
>>>
(but, of course, you will have to be careful with this feature in
order to reduce false positives)
hth,
vbr
[toc] | [prev] | [next] | [standalone]
| From | BrJohan <brjohan@gmail.com> |
|---|---|
| Date | 2014-06-13 17:17 +0200 |
| Message-ID | <c00ivgF5cjpU1@mid.individual.net> |
| In reply to | #73165 |
On 11/06/2014 14:23, BrJohan wrote:
> For some genealogical purposes I consider using Python's re module.
>
> Rather many names can be spelled in a number of similar ways, and in
> order to match names even if they are spelled differently, I will build
> regular expressions, each of which is supposed to match a number of
> similar names.
>
> I guess that there will be a few hundred such regular expressions
> covering most popular names.
>
> Now, my problem: Is there a way to decide whether any two - or more - of
> those regular expressions will match the same string?
>
> Or, stated a little differently:
>
> Can it, for a pair of regular expressions be decided whether at least
> one string matching both of those regular expressions, can be constructed?
>
> If it is possible to make such a decision, then how? Anyone aware of an
> algorithm for this?
Thank you all for valuable input and interesting thoughts.
After having reconsidered my problem, it might be better to approach it
a little differently.
Either to state the regexps simply like:
"(Kristina)|(Christina)|(Cristine)|(Kristine)"
instead of "((K|(Ch))ristina)|([CK]ristine)"
Or to put the namevariants in some sequence of sets having elements like:
("Kristina", "Christina", "Cristine", "Kristine")
Matching is then just applying the 'in' operator.
I see two distinct advantages.
1. Readability and maintainability
2. Any namevariant occurring in just one regexp or set means no risk of
erroneous matching.
Comments?
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-06-13 18:26 +0200 |
| Message-ID | <mailman.11059.1402676838.18130.python-list@python.org> |
| In reply to | #73267 |
BrJohan wrote:
> On 11/06/2014 14:23, BrJohan wrote:
>> For some genealogical purposes I consider using Python's re module.
>>
>> Rather many names can be spelled in a number of similar ways, and in
>> order to match names even if they are spelled differently, I will build
>> regular expressions, each of which is supposed to match a number of
>> similar names.
>>
>> I guess that there will be a few hundred such regular expressions
>> covering most popular names.
>>
>> Now, my problem: Is there a way to decide whether any two - or more - of
>> those regular expressions will match the same string?
>>
>> Or, stated a little differently:
>>
>> Can it, for a pair of regular expressions be decided whether at least
>> one string matching both of those regular expressions, can be
>> constructed?
>>
>> If it is possible to make such a decision, then how? Anyone aware of an
>> algorithm for this?
>
> Thank you all for valuable input and interesting thoughts.
>
> After having reconsidered my problem, it might be better to approach it
> a little differently.
>
> Either to state the regexps simply like:
> "(Kristina)|(Christina)|(Cristine)|(Kristine)"
> instead of "((K|(Ch))ristina)|([CK]ristine)"
>
> Or to put the namevariants in some sequence of sets having elements like:
> ("Kristina", "Christina", "Cristine", "Kristine")
> Matching is then just applying the 'in' operator.
>
> I see two distinct advantages.
> 1. Readability and maintainability
> 2. Any namevariant occurring in just one regexp or set means no risk of
> erroneous matching.
>
> Comments?
I like the simple variant
kristinas = ("Kristina", "Christina", "Cristine", "Kristine")
But instead of matching with "in" you could build a dict that maps the name
variants to a normalised name
normalized_names = {
"Kristina": "Kristina",
"Christina": "Kristina",
...
"John": "John",
"Johann": "John",
...
}
def normalized(name):
return normalized_names.get(name, name)
If you put persons in another dict or a database indexed by the normalised
name
lookup = {
"Kristina": ["Kristina Smith", "Christina Miller"],
...
}
you can find all Kristinas with two look-ups:
>>> lookup[normalized("Kristine")]
['Kristina Smith', 'Christina Miller']
PS: A problem with this approach might be that (name in nameset_A) and (name
in nameset_B) implies nameset_A == nameset_B
[toc] | [prev] | [next] | [standalone]
| From | Dan Sommers <dan@tombstonezero.net> |
|---|---|
| Date | 2014-06-14 05:14 +0000 |
| Message-ID | <lnglo9$pg4$1@dont-email.me> |
| In reply to | #73267 |
On Fri, 13 Jun 2014 17:17:06 +0200, BrJohan wrote:
> Or to put the namevariants in some sequence of sets having elements
> like: ("Kristina", "Christina", "Cristine", "Kristine")
> Matching is then just applying the 'in' operator.
That's definitely a better approach, for the reasons you mentioned.
> Comments?
A soundex (or similar) algorithm will be better in the long run for the
less common, but more often misspelled names. It's fairly simple to
guess at a number of common spellings for names that *you* think are
common now, but what about names that run in families that aren't yours,
or aren't that common outside of that family, or were wildly popular a
couple of hundred years ago but have fallen out of favor now?
My wife's ancestors (she's the genealogist, I just get to hear the
horror stories) are notorious for being somewhat illiterate; for
changing their names, on purpose, after a feud, in order to "distance"
themselves from their relatives; and also for using not-common-now (or
even not-so-common-then) names. Add in somewhat illiterate records
keepers and hospital workers (or midwives or neighbors), not to mention
bad copies of bad copies of centuries-old smudged documents, and you
have an instant soup of names that sound alike but are spelled
differently in ways you cannot guess ahead of time.
Your users will appreciate *some* sort of fuzzy matching, or runtime
extensibility, atop the "obvious" spellings you take the time to include
in your software. And that's *not* a comment on your abilities; it's a
comment on the abilities and creativity of their ancestors.
Dan
[toc] | [prev] | [next] | [standalone]
| From | Tony the Tiger <tony@tiger.invalid> |
|---|---|
| Date | 2014-06-14 08:35 +0000 |
| Message-ID | <539c0969$0$59978$c3e8da3$26f9d57d@news.astraweb.com> |
| In reply to | #73165 |
On Wed, 11 Jun 2014 14:23:14 +0200, BrJohan wrote:
> I guess that there will be a few hundred such regular expressions
> covering most popular names.
Perhaps
http://code.activestate.com/recipes/52213-soundex-algorithm/
https://en.wikipedia.org/wiki/Soundex
will work?
/Grrr
--
___ ___
(\_--_/) | _ ._ _|_|_ _ |o _ _ ._
( 9 9 ) |(_)| |\/ |_| |(/_ ||(_|(/_|
stripes are forever - as overripe ferrets
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web