Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Peter Otten <__peter__@web.de> Newsgroups: comp.lang.python Subject: Re: Whittle it on down Date: Fri, 06 May 2016 09:45:18 +0200 Organization: None Lines: 81 Message-ID: References: <1462426755.15465.598690257.42990546@webmail.messagingengine.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7Bit X-Trace: news.uni-berlin.de bRC/Ft1bgqtKx1S8giu+sw8UBaKM0SgBrTkznSaeRyXA== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.015 X-Spam-Evidence: '*H*': 0.97; '*S*': 0.00; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'fitness': 0.13; 'output': 0.13; 'skip:p 40': 0.15; "'about": 0.16; 'adult': 0.16; 'comma.': 0.16; 'commas,': 0.16; 'courts': 0.16; 'dfs': 0.16; 'earlier.': 0.16; 'list"': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:io': 0.16; 'received:plane.gmane.org': 0.16; 'received:psf.io': 0.16; 'received:t-ipconnect.de': 0.16; 'set()': 0.16; 'soup': 0.16; 'wrote:': 0.16; 'looked': 0.16; 'input': 0.18; '>>>': 0.20; 'candidates': 0.21; 'stephen': 0.22; 'am,': 0.23; 'seems': 0.23; 'import': 0.24; 'header:User-Agent:1': 0.26; 'header:X-Complaints- To:1': 0.26; 'compare': 0.27; "skip:' 10": 0.28; 'about.': 0.29; 'talked': 0.29; 'skip:[ 10': 0.31; 'fixed': 0.31; "skip:' 20": 0.34; 'list': 0.34; 'gets': 0.35; 'skip:b 50': 0.35; 'instead': 0.36; 'there': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'thanks': 0.37; 'received:org': 0.37; 'resources': 0.39; 'to:addr:python.org': 0.40; 'received:de': 0.40; 'your': 0.60; 'services,': 0.61; 'skip:u 10': 0.61; 'profile': 0.63; 'city': 0.65; '>>>>>': 0.66; 'services': 0.67; 'production': 0.67; 'clubs': 0.72; 'physical': 0.72; 'advertising': 0.74; "'2',": 0.84; "'3',": 0.84; 'adoption': 0.84; 'aids': 0.84; 'rentals': 0.84; 'subject:down': 0.84; 'url:cat': 0.84; 'aircraft': 0.91 X-Injected-Via-Gmane: http://gmane.org/ X-Gmane-NNTP-Posting-Host: p57bd925d.dip0.t-ipconnect.de User-Agent: KNode/4.13.3 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: X-Mailman-Original-References: <1462426755.15465.598690257.42990546@webmail.messagingengine.com> Xref: csiph.com comp.lang.python:108218 DFS wrote: > On 5/5/2016 1:39 AM, Stephen Hansen wrote: > >> Given: >> >>>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & >>>>> Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city >>>>> guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & >>>>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', >>>>> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', >>>>> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', >>>>> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', >>>>> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & >>>>> PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & >>>>> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', >>>>> '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', >>>>> 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add > /Update Listing', 'Business Profile Login', 'F.A.Q.'] >> >> Then: >> >>>>> pattern = re.compile(r"^[A-Z\s&]+$") >>>>> output = [x for x in list if pattern.match(x)] >>>>> output > >> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', >> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', >> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS >> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS >> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS >> & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] > > > Should've looked earlier. Their master list of categories > http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, > and the ampersands we talked about. > > "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma. > > "AUTOMOBILE - DEALERS" gets removed because of the dash. > > I updated your regex and it seems to have fixed it. > > orig: (r"^[A-Z\s&]+$") > new : (r"^[A-Z\s&,-]+$") > > > Thanks again. If there is a "master list" compare your candidates against it instead of using a heuristic, i. e. categories = set(master_list) output = [category for category in input if category in categories] You can find the categories with >>> import urllib.request >>> import bs4 >>> soup = bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read()) >>> categories = set() >>> for li in soup.find_all("li"): ... assert li.parent.parent["class"][0].startswith("category_items") ... categories.add(li.text) ... >>> print("\n".join(sorted(categories)[:10])) Accounting & Bookkeeping Services Adoption Services Adult Entertainment Advertising Agricultural Equipment & Supplies Agricultural Production Agricultural Services Aids Resources Aircraft Charters & Rentals Aircraft Dealers & Services