Path: csiph.com!usenet.pasdenom.info!aioe.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: BrJohan Newsgroups: comp.lang.python Subject: Re: Python's re module and genealogy problem Date: Fri, 13 Jun 2014 17:17:06 +0200 Lines: 43 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Trace: individual.net zxt1bB32Vzl07XtLXjBjYAUFcj5kIZHt4jrYwnogOL8r4ou2g= Cancel-Lock: sha1:iWoorQaN/v2x7tbxd53pG5hEBmQ= User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 In-Reply-To: Xref: csiph.com comp.lang.python:73267 On 11/06/2014 14:23, BrJohan wrote: > For some genealogical purposes I consider using Python's re module. > > Rather many names can be spelled in a number of similar ways, and in > order to match names even if they are spelled differently, I will build > regular expressions, each of which is supposed to match a number of > similar names. > > I guess that there will be a few hundred such regular expressions > covering most popular names. > > Now, my problem: Is there a way to decide whether any two - or more - of > those regular expressions will match the same string? > > Or, stated a little differently: > > Can it, for a pair of regular expressions be decided whether at least > one string matching both of those regular expressions, can be constructed? > > If it is possible to make such a decision, then how? Anyone aware of an > algorithm for this? Thank you all for valuable input and interesting thoughts. After having reconsidered my problem, it might be better to approach it a little differently. Either to state the regexps simply like: "(Kristina)|(Christina)|(Cristine)|(Kristine)" instead of "((K|(Ch))ristina)|([CK]ristine)" Or to put the namevariants in some sequence of sets having elements like: ("Kristina", "Christina", "Cristine", "Kristine") Matching is then just applying the 'in' operator. I see two distinct advantages. 1. Readability and maintainability 2. Any namevariant occurring in just one regexp or set means no risk of erroneous matching. Comments?