Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #108218
| Path | csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail |
|---|---|
| From | Peter Otten <__peter__@web.de> |
| Newsgroups | comp.lang.python |
| Subject | Re: Whittle it on down |
| Date | Fri, 06 May 2016 09:45:18 +0200 |
| Organization | None |
| Lines | 81 |
| Message-ID | <mailman.428.1462520743.32212.python-list@python.org> (permalink) |
| References | <ngejmj$gc4$1@dont-email.me> <1462426755.15465.598690257.42990546@webmail.messagingengine.com> <mailman.397.1462426759.32212.python-list@python.org> <nggku4$p6n$1@dont-email.me> <nghi2f$e3j$1@ger.gmane.org> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset="ISO-8859-1" |
| Content-Transfer-Encoding | 7Bit |
| X-Trace | news.uni-berlin.de bRC/Ft1bgqtKx1S8giu+sw8UBaKM0SgBrTkznSaeRyXA== |
| Return-Path | <python-python-list@m.gmane.org> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.015 |
| X-Spam-Evidence | '*H*': 0.97; '*S*': 0.00; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'fitness': 0.13; 'output': 0.13; 'skip:p 40': 0.15; "'about": 0.16; 'adult': 0.16; 'comma.': 0.16; 'commas,': 0.16; 'courts': 0.16; 'dfs': 0.16; 'earlier.': 0.16; 'list"': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:io': 0.16; 'received:plane.gmane.org': 0.16; 'received:psf.io': 0.16; 'received:t-ipconnect.de': 0.16; 'set()': 0.16; 'soup': 0.16; 'wrote:': 0.16; 'looked': 0.16; 'input': 0.18; '>>>': 0.20; 'candidates': 0.21; 'stephen': 0.22; 'am,': 0.23; 'seems': 0.23; 'import': 0.24; 'header:User-Agent:1': 0.26; 'header:X-Complaints- To:1': 0.26; 'compare': 0.27; "skip:' 10": 0.28; 'about.': 0.29; 'talked': 0.29; 'skip:[ 10': 0.31; 'fixed': 0.31; "skip:' 20": 0.34; 'list': 0.34; 'gets': 0.35; 'skip:b 50': 0.35; 'instead': 0.36; 'there': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'thanks': 0.37; 'received:org': 0.37; 'resources': 0.39; 'to:addr:python.org': 0.40; 'received:de': 0.40; 'your': 0.60; 'services,': 0.61; 'skip:u 10': 0.61; 'profile': 0.63; 'city': 0.65; '>>>>>': 0.66; 'services': 0.67; 'production': 0.67; 'clubs': 0.72; 'physical': 0.72; 'advertising': 0.74; "'2',": 0.84; "'3',": 0.84; 'adoption': 0.84; 'aids': 0.84; 'rentals': 0.84; 'subject:down': 0.84; 'url:cat': 0.84; 'aircraft': 0.91 |
| X-Injected-Via-Gmane | http://gmane.org/ |
| X-Gmane-NNTP-Posting-Host | p57bd925d.dip0.t-ipconnect.de |
| User-Agent | KNode/4.13.3 |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.22 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| X-Mailman-Original-Message-ID | <nghi2f$e3j$1@ger.gmane.org> |
| X-Mailman-Original-References | <ngejmj$gc4$1@dont-email.me> <1462426755.15465.598690257.42990546@webmail.messagingengine.com> <mailman.397.1462426759.32212.python-list@python.org> <nggku4$p6n$1@dont-email.me> |
| Xref | csiph.com comp.lang.python:108218 |
Show key headers only | View raw
DFS wrote:
> On 5/5/2016 1:39 AM, Stephen Hansen wrote:
>
>> Given:
>>
>>>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs &
>>>>> Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city
>>>>> guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS &
>>>>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS',
>>>>> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE',
>>>>> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS',
>>>>> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com',
>>>>> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE &
>>>>> PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS &
>>>>> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS',
>>>>> '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us',
>>>>> 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add
> /Update Listing', 'Business Profile Login', 'F.A.Q.']
>>
>> Then:
>>
>>>>> pattern = re.compile(r"^[A-Z\s&]+$")
>>>>> output = [x for x in list if pattern.match(x)]
>>>>> output
>
>> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS',
>> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS',
>> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
>> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS
>> & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']
>
>
> Should've looked earlier. Their master list of categories
> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
> and the ampersands we talked about.
>
> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.
>
> "AUTOMOBILE - DEALERS" gets removed because of the dash.
>
> I updated your regex and it seems to have fixed it.
>
> orig: (r"^[A-Z\s&]+$")
> new : (r"^[A-Z\s&,-]+$")
>
>
> Thanks again.
If there is a "master list" compare your candidates against it instead of
using a heuristic, i. e.
categories = set(master_list)
output = [category for category in input if category in categories]
You can find the categories with
>>> import urllib.request
>>> import bs4
>>> soup =
bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
>>> categories = set()
>>> for li in soup.find_all("li"):
... assert li.parent.parent["class"][0].startswith("category_items")
... categories.add(li.text)
...
>>> print("\n".join(sorted(categories)[:10]))
Accounting & Bookkeeping Services
Adoption Services
Adult Entertainment
Advertising
Agricultural Equipment & Supplies
Agricultural Production
Agricultural Services
Aids Resources
Aircraft Charters & Rentals
Aircraft Dealers & Services
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 00:58 -0400
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-04 22:39 -0700
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:44 -0400
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 19:31 -0400
Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-06 09:45 +0200
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 09:58 -0400
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 10:41 -0400
Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-06 17:44 +0200
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 18:43 -0400
Re: Whittle it on down alister <alister.ware@ntlworld.com> - 2016-05-06 10:01 +0000
Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 08:53 +0300
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:57 -0400
Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 16:04 +1000
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-04 23:46 -0700
Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 17:04 +1000
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 00:34 -0700
Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 18:41 +1000
Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 09:13 -0400
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:13 +1000
Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 17:36 +1000
Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-05 10:17 +0200
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 01:39 +1000
Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 09:21 -0400
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 04:03 +1000
Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 14:52 -0400
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 12:09 -0700
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 06:32 -0700
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 10:36 -0400
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:43 +1000
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 11:55 -0700
Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 20:49 +0300
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 04:14 +1000
Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 21:27 +0300
Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 14:54 -0400
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 10:57 +1000
Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-06 07:19 +0300
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:31 -0400
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:54 +1000
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 17:36 -0400
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 11:56 -0700
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 17:45 -0400
csiph-web