Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #73269

Re: Python's re module and genealogy problem

Path csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!news.stack.nl!newsfeed.xs4all.nl!newsfeed4a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'algorithm': 0.04; 'subject:Python': 0.06; 'expressions': 0.07; 'problem:': 0.07; 'string': 0.09; 'comments?': 0.09; 'differently.': 0.09; 'lookup': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'spelled': 0.09; 'subject:module': 0.09; 'variant': 0.09; 'def': 0.12; '"in"': 0.16; 'decision,': 0.16; 'dict': 0.16; 'distinct': 0.16; 'expressions,': 0.16; 'implies': 0.16; 'name)': 0.16; 'operator.': 0.16; 'readability': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-ipconnect.de': 0.16; 'variants': 0.16; 'elements': 0.16; 'wrote:': 0.18; "python's": 0.19; '>>>': 0.22; 'input': 0.22; 'this?': 0.23; 'header:User-Agent:1': 0.23; 'skip:l 30': 0.24; 'subject:problem': 0.24; 'decide': 0.24; 'purposes': 0.26; 'skip:" 30': 0.26; 'skip:" 40': 0.26; 'least': 0.26; 'header:X-Complaints-To:1': 0.27; 'matching': 0.30; 'sets': 0.30; 'names.': 0.31; 'anyone': 0.31; 'supposed': 0.32; 'regular': 0.32; 'another': 0.32; 'guess': 0.33; 'could': 0.34; 'problem': 0.35; 'but': 0.35; 'there': 0.35; 'indexed': 0.36; 'module.': 0.36; 'sequence': 0.36; 'possible': 0.36; 'similar': 0.36; 'two': 0.37; 'thank': 0.38; 'to:addr:python-list': 0.38; 'rather': 0.38; 'little': 0.38; 'to:addr:python.org': 0.39; 'either': 0.39; 'received:org': 0.40; 'even': 0.60; 'most': 0.60; 'simply': 0.61; 'simple': 0.61; 'name': 0.63; 'such': 0.63; 'valuable': 0.63; 'skip:n 10': 0.64; 'decided': 0.64; 'more': 0.64; 'stated': 0.69; 'risk': 0.72; 'applying': 0.72; 'differently:': 0.84; 'regexp': 0.84; 'hundred': 0.95
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Peter Otten <__peter__@web.de>
Subject Re: Python's re module and genealogy problem
Date Fri, 13 Jun 2014 18:26:55 +0200
Organization None
References <bvr01iFu926U1@mid.individual.net> <c00ivgF5cjpU1@mid.individual.net>
Mime-Version 1.0
Content-Type text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding 7Bit
X-Gmane-NNTP-Posting-Host p57bd9019.dip0.t-ipconnect.de
User-Agent KNode/4.11.5
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.11059.1402676838.18130.python-list@python.org> (permalink)
Lines 79
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1402676838 news.xs4all.nl 2863 [2001:888:2000:d::a6]:43510
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:73269

Show key headers only | View raw


BrJohan wrote:

> On 11/06/2014 14:23, BrJohan wrote:
>> For some genealogical purposes I consider using Python's re module.
>>
>> Rather many names can be spelled in a number of similar ways, and in
>> order to match names even if they are spelled differently, I will build
>> regular expressions, each of which is supposed to match  a number of
>> similar names.
>>
>> I guess that there will be a few hundred such regular expressions
>> covering most popular names.
>>
>> Now, my problem: Is there a way to decide whether any two - or more - of
>> those regular expressions will match the same string?
>>
>> Or, stated a little differently:
>>
>> Can it, for a pair of regular expressions be decided whether at least
>> one string matching both of those regular expressions, can be
>> constructed?
>>
>> If it is possible to make such a decision, then how? Anyone aware of an
>> algorithm for this?
> 
> Thank you all for valuable input and interesting thoughts.
> 
> After having reconsidered my problem, it might be better to approach it
> a little differently.
> 
> Either to state the regexps simply like:
> "(Kristina)|(Christina)|(Cristine)|(Kristine)"
> instead of "((K|(Ch))ristina)|([CK]ristine)"
> 
> Or to put the namevariants in some sequence of sets having elements like:
> ("Kristina", "Christina", "Cristine", "Kristine")
> Matching is then just applying the 'in' operator.
> 
> I see two distinct advantages.
> 1. Readability and maintainability
> 2. Any namevariant occurring in just one regexp or set means no risk of
> erroneous matching.
> 
> Comments?

I like the simple variant

kristinas = ("Kristina", "Christina", "Cristine", "Kristine")

But instead of matching with "in" you could build a dict that maps the name 
variants to a normalised name

normalized_names = {
    "Kristina": "Kristina",
    "Christina": "Kristina",
    ...
    "John": "John",
    "Johann": "John",
    ...
}
def normalized(name):
    return normalized_names.get(name, name)

If you put persons in another dict or a database indexed by the normalised 
name 

lookup = {
    "Kristina": ["Kristina Smith", "Christina Miller"],
    ...
}

you can find all Kristinas with two look-ups:

>>> lookup[normalized("Kristine")]
['Kristina Smith', 'Christina Miller']

PS: A problem with this approach might be that (name in nameset_A) and (name 
in nameset_B) implies nameset_A == nameset_B

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Python's re module and genealogy problem BrJohan <brjohan@gmail.com> - 2014-06-11 14:23 +0200
  Re: Python's re module and genealogy problem Robert Kern <robert.kern@gmail.com> - 2014-06-11 14:26 +0100
    Re: Python's re module and genealogy problem Mark H Harris <harrismh777@gmail.com> - 2014-06-11 09:08 -0500
  Re: Python's re module and genealogy problem Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2014-06-11 15:55 +0200
  Re: Python's re module and genealogy problem Michael Torrie <torriem@gmail.com> - 2014-06-11 09:34 -0600
  Re: Python's re module and genealogy problem Nick Cash <nick.cash@npcinternational.com> - 2014-06-11 16:21 +0000
  Re: Python's re module and genealogy problem Simon Ward <simon@bleah.co.uk> - 2014-06-11 18:21 +0100
  Re: Python's re module and genealogy problem Vlastimil Brom <vlastimil.brom@gmail.com> - 2014-06-11 20:09 +0200
  Re: Python's re module and genealogy problem BrJohan <brjohan@gmail.com> - 2014-06-13 17:17 +0200
    Re: Python's re module and genealogy problem Peter Otten <__peter__@web.de> - 2014-06-13 18:26 +0200
    Re: Python's re module and genealogy problem Dan Sommers <dan@tombstonezero.net> - 2014-06-14 05:14 +0000
  Re: Python's re module and genealogy problem Tony the Tiger <tony@tiger.invalid> - 2014-06-14 08:35 +0000

csiph-web