X-Received: by 10.68.189.101 with SMTP id gh5mr10151275pbc.12.1434581736094; Wed, 17 Jun 2015 15:55:36 -0700 (PDT) X-Received: by 10.50.56.6 with SMTP id w6mr319148igp.16.1434581736057; Wed, 17 Jun 2015 15:55:36 -0700 (PDT) Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!h15no4744721igd.0!news-out.google.com!7ni1404igs.0!nntp.google.com!h15no3628994igd.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.python Date: Wed, 17 Jun 2015 15:55:35 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=117.120.16.139; posting-account=ZAg6xAoAAAAmY8bBi3VzYjWntm8Ct1P8 NNTP-Posting-Host: 117.120.16.139 References: User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: Subject: Re: Generating list of unique search sub-phrases From: Nick Mellor Injection-Date: Wed, 17 Jun 2015 22:55:36 +0000 Content-Type: text/plain; charset=ISO-8859-1 Xref: csiph.com comp.lang.python:92789 On Saturday, 30 May 2015 06:39:44 UTC+10, Nick Mellor wrote: > Hi all, > > My own solution works but I'm sure it could be simpler or read better. How would you do it? > > Say you've got a list of companies: > > Aerosonde Ltd > Amcor > ANCA > Austal Ships > Australia Post > Australian Air Express > Australian Defence Industries > Australian Railroad Group > Australian Submarine Corporation > > and you need to extract phrases from the company names that uniquely identify that company. The results for the above list of companies should be: > > Company: 'Aerosonde Ltd' > Aliases: Aerosonde,Ltd,Aerosonde Ltd > > Company: 'Amcor' > Aliases: Amcor > > Company: 'ANCA' > Aliases: ANCA > > Company: 'Austal Ships' > Aliases: Austal,Ships,Austal Ships > > Company: 'Australia Post' > Aliases: Post,Australia Post > > Company: 'Australian Air Express' > Aliases: Air,Express,Australian Air,Air Express,Australian Air Express > > Company: 'Australian Defence Industries' > Aliases: Defence,Industries,Australian Defence,Defence Industries,Australian Defence Industries > > Company: 'Australian Railroad Group' > Aliases: Railroad,Group,Australian Railroad,Railroad Group,Australian Railroad Group > > Company: 'Australian Submarine Corporation' > Aliases: Submarine,Corporation,Australian Submarine,Submarine Corporation,Australian Submarine Corporation > > Here's my solution: > > from itertools import combinations, chain > > companies = [ > "Aerosonde Ltd", > "Amcor", > "ANCA", > "Austal Ships", > "Australia Post", > "Australian Air Express", > "Australian Defence Industries", > "Australian Railroad Group", > "Australian Submarine Corporation", > ] > > def flatten(i): > return list(chain.from_iterable(i)) > > companies_as_text_stream = ' '.join(companies) > for company in companies: > word_combinations = [list(combinations(company.split(), r)) for r in range(1, len(company))] > phrases = [' '.join(phrase) for phrase in flatten(word_combinations)] > unique_phrases = [phrase for phrase in phrases if companies_as_text_stream.count(phrase) == 1] > aliases = ','.join(unique_phrases) > print("Company: '{0}'\n Aliases: {1}\n".format(company, aliases)) Great reply, Peter, thank you. Lots to think about. Cheers, Nick