Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #73269
| Path | csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!news.stack.nl!newsfeed.xs4all.nl!newsfeed4a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <python-python-list@m.gmane.org> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.000 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'algorithm': 0.04; 'subject:Python': 0.06; 'expressions': 0.07; 'problem:': 0.07; 'string': 0.09; 'comments?': 0.09; 'differently.': 0.09; 'lookup': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'spelled': 0.09; 'subject:module': 0.09; 'variant': 0.09; 'def': 0.12; '"in"': 0.16; 'decision,': 0.16; 'dict': 0.16; 'distinct': 0.16; 'expressions,': 0.16; 'implies': 0.16; 'name)': 0.16; 'operator.': 0.16; 'readability': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-ipconnect.de': 0.16; 'variants': 0.16; 'elements': 0.16; 'wrote:': 0.18; "python's": 0.19; '>>>': 0.22; 'input': 0.22; 'this?': 0.23; 'header:User-Agent:1': 0.23; 'skip:l 30': 0.24; 'subject:problem': 0.24; 'decide': 0.24; 'purposes': 0.26; 'skip:" 30': 0.26; 'skip:" 40': 0.26; 'least': 0.26; 'header:X-Complaints-To:1': 0.27; 'matching': 0.30; 'sets': 0.30; 'names.': 0.31; 'anyone': 0.31; 'supposed': 0.32; 'regular': 0.32; 'another': 0.32; 'guess': 0.33; 'could': 0.34; 'problem': 0.35; 'but': 0.35; 'there': 0.35; 'indexed': 0.36; 'module.': 0.36; 'sequence': 0.36; 'possible': 0.36; 'similar': 0.36; 'two': 0.37; 'thank': 0.38; 'to:addr:python-list': 0.38; 'rather': 0.38; 'little': 0.38; 'to:addr:python.org': 0.39; 'either': 0.39; 'received:org': 0.40; 'even': 0.60; 'most': 0.60; 'simply': 0.61; 'simple': 0.61; 'name': 0.63; 'such': 0.63; 'valuable': 0.63; 'skip:n 10': 0.64; 'decided': 0.64; 'more': 0.64; 'stated': 0.69; 'risk': 0.72; 'applying': 0.72; 'differently:': 0.84; 'regexp': 0.84; 'hundred': 0.95 |
| X-Injected-Via-Gmane | http://gmane.org/ |
| To | python-list@python.org |
| From | Peter Otten <__peter__@web.de> |
| Subject | Re: Python's re module and genealogy problem |
| Date | Fri, 13 Jun 2014 18:26:55 +0200 |
| Organization | None |
| References | <bvr01iFu926U1@mid.individual.net> <c00ivgF5cjpU1@mid.individual.net> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset="ISO-8859-1" |
| Content-Transfer-Encoding | 7Bit |
| X-Gmane-NNTP-Posting-Host | p57bd9019.dip0.t-ipconnect.de |
| User-Agent | KNode/4.11.5 |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.15 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.11059.1402676838.18130.python-list@python.org> (permalink) |
| Lines | 79 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1402676838 news.xs4all.nl 2863 [2001:888:2000:d::a6]:43510 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:73269 |
Show key headers only | View raw
BrJohan wrote:
> On 11/06/2014 14:23, BrJohan wrote:
>> For some genealogical purposes I consider using Python's re module.
>>
>> Rather many names can be spelled in a number of similar ways, and in
>> order to match names even if they are spelled differently, I will build
>> regular expressions, each of which is supposed to match a number of
>> similar names.
>>
>> I guess that there will be a few hundred such regular expressions
>> covering most popular names.
>>
>> Now, my problem: Is there a way to decide whether any two - or more - of
>> those regular expressions will match the same string?
>>
>> Or, stated a little differently:
>>
>> Can it, for a pair of regular expressions be decided whether at least
>> one string matching both of those regular expressions, can be
>> constructed?
>>
>> If it is possible to make such a decision, then how? Anyone aware of an
>> algorithm for this?
>
> Thank you all for valuable input and interesting thoughts.
>
> After having reconsidered my problem, it might be better to approach it
> a little differently.
>
> Either to state the regexps simply like:
> "(Kristina)|(Christina)|(Cristine)|(Kristine)"
> instead of "((K|(Ch))ristina)|([CK]ristine)"
>
> Or to put the namevariants in some sequence of sets having elements like:
> ("Kristina", "Christina", "Cristine", "Kristine")
> Matching is then just applying the 'in' operator.
>
> I see two distinct advantages.
> 1. Readability and maintainability
> 2. Any namevariant occurring in just one regexp or set means no risk of
> erroneous matching.
>
> Comments?
I like the simple variant
kristinas = ("Kristina", "Christina", "Cristine", "Kristine")
But instead of matching with "in" you could build a dict that maps the name
variants to a normalised name
normalized_names = {
"Kristina": "Kristina",
"Christina": "Kristina",
...
"John": "John",
"Johann": "John",
...
}
def normalized(name):
return normalized_names.get(name, name)
If you put persons in another dict or a database indexed by the normalised
name
lookup = {
"Kristina": ["Kristina Smith", "Christina Miller"],
...
}
you can find all Kristinas with two look-ups:
>>> lookup[normalized("Kristine")]
['Kristina Smith', 'Christina Miller']
PS: A problem with this approach might be that (name in nameset_A) and (name
in nameset_B) implies nameset_A == nameset_B
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Python's re module and genealogy problem BrJohan <brjohan@gmail.com> - 2014-06-11 14:23 +0200
Re: Python's re module and genealogy problem Robert Kern <robert.kern@gmail.com> - 2014-06-11 14:26 +0100
Re: Python's re module and genealogy problem Mark H Harris <harrismh777@gmail.com> - 2014-06-11 09:08 -0500
Re: Python's re module and genealogy problem Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2014-06-11 15:55 +0200
Re: Python's re module and genealogy problem Michael Torrie <torriem@gmail.com> - 2014-06-11 09:34 -0600
Re: Python's re module and genealogy problem Nick Cash <nick.cash@npcinternational.com> - 2014-06-11 16:21 +0000
Re: Python's re module and genealogy problem Simon Ward <simon@bleah.co.uk> - 2014-06-11 18:21 +0100
Re: Python's re module and genealogy problem Vlastimil Brom <vlastimil.brom@gmail.com> - 2014-06-11 20:09 +0200
Re: Python's re module and genealogy problem BrJohan <brjohan@gmail.com> - 2014-06-13 17:17 +0200
Re: Python's re module and genealogy problem Peter Otten <__peter__@web.de> - 2014-06-13 18:26 +0200
Re: Python's re module and genealogy problem Dan Sommers <dan@tombstonezero.net> - 2014-06-14 05:14 +0000
Re: Python's re module and genealogy problem Tony the Tiger <tony@tiger.invalid> - 2014-06-14 08:35 +0000
csiph-web