Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #108158 > unrolled thread
| Started by | DFS <nospam@dfs.com> |
|---|---|
| First post | 2016-05-05 00:58 -0400 |
| Last post | 2016-05-05 17:45 -0400 |
| Articles | 20 on this page of 41 — 8 participants |
Back to article view | Back to comp.lang.python
Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 00:58 -0400
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-04 22:39 -0700
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:44 -0400
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 19:31 -0400
Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-06 09:45 +0200
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 09:58 -0400
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 10:41 -0400
Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-06 17:44 +0200
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 18:43 -0400
Re: Whittle it on down alister <alister.ware@ntlworld.com> - 2016-05-06 10:01 +0000
Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 08:53 +0300
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:57 -0400
Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 16:04 +1000
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-04 23:46 -0700
Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 17:04 +1000
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 00:34 -0700
Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 18:41 +1000
Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 09:13 -0400
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:13 +1000
Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 17:36 +1000
Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-05 10:17 +0200
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 01:39 +1000
Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 09:21 -0400
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 04:03 +1000
Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 14:52 -0400
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 12:09 -0700
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 06:32 -0700
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 10:36 -0400
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:43 +1000
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 11:55 -0700
Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 20:49 +0300
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 04:14 +1000
Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 21:27 +0300
Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 14:54 -0400
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 10:57 +1000
Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-06 07:19 +0300
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:31 -0400
Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:54 +1000
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 17:36 -0400
Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 11:56 -0700
Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 17:45 -0400
Page 1 of 3 [1] 2 3 Next page →
| From | DFS <nospam@dfs.com> |
|---|---|
| Date | 2016-05-05 00:58 -0400 |
| Subject | Whittle it on down |
| Message-ID | <ngejmj$gc4$1@dont-email.me> |
Want to whittle a list like this: [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.'] down to ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] Want to keep all elements containing only upper case letters or upper case letters and ampersand (where ampersand is surrounded by spaces) Is it easier to extract elements meeting those conditions, or remove elements meeting the following conditions: * elements with a lower-case letter in them * elements with a number in them * elements with a period in them ? So far all I figured out is remove items with a period: newlist = [ x for x in oldlist if "." not in x ] Thanks for help, python gurus.
[toc] | [next] | [standalone]
| From | Stephen Hansen <me+python@ixokai.io> |
|---|---|
| Date | 2016-05-04 22:39 -0700 |
| Message-ID | <mailman.397.1462426759.32212.python-list@python.org> |
| In reply to | #108158 |
On Wed, May 4, 2016, at 09:58 PM, DFS wrote: > Want to whittle a list like this: > > [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & > Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', > 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', > 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', > 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', > 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', > 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', > 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & > PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & > GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', > '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', > 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', > 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.'] > > down to > > ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', > 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', > 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS > TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS > PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS > & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] Sometimes regular expressions are the tool to do the job: Given: >>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.'] Then: >>> pattern = re.compile(r"^[A-Z\s&]+$") >>> output = [x for x in list if pattern.match(x)] >>> output ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] -- Stephen Hansen m e @ i x o k a i . i o
[toc] | [prev] | [next] | [standalone]
| From | DFS <nospam@dfs.com> |
|---|---|
| Date | 2016-05-05 08:44 -0400 |
| Message-ID | <ngff1f$649$1@dont-email.me> |
| In reply to | #108160 |
On 5/5/2016 1:39 AM, Stephen Hansen wrote: > pattern = re.compile(r"^[A-Z\s&]+$") > output = [x for x in list if pattern.match(x)] Holy Shr"^[A-Z\s&]+$" One line of parsing! I was figuring a few list comprehensions would do it - this is better. (note: the reason I specified 'spaces around ampersand' is so it would remove 'Q&A' if that ever came up - but some people write 'Q & A', so I'll live with that exception, or try to tweak it myself. You're the man, man. Thank you!
[toc] | [prev] | [next] | [standalone]
| From | DFS <nospam@dfs.com> |
|---|---|
| Date | 2016-05-05 19:31 -0400 |
| Message-ID | <nggku4$p6n$1@dont-email.me> |
| In reply to | #108160 |
On 5/5/2016 1:39 AM, Stephen Hansen wrote: > Given: > >>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.'] > > Then: > >>>> pattern = re.compile(r"^[A-Z\s&]+$") >>>> output = [x for x in list if pattern.match(x)] >>>> output > ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', > 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', > 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS > TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS > PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS > & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] Should've looked earlier. Their master list of categories http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, and the ampersands we talked about. "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma. "AUTOMOBILE - DEALERS" gets removed because of the dash. I updated your regex and it seems to have fixed it. orig: (r"^[A-Z\s&]+$") new : (r"^[A-Z\s&,-]+$") Thanks again.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2016-05-06 09:45 +0200 |
| Message-ID | <mailman.428.1462520743.32212.python-list@python.org> |
| In reply to | #108202 |
DFS wrote:
> On 5/5/2016 1:39 AM, Stephen Hansen wrote:
>
>> Given:
>>
>>>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs &
>>>>> Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city
>>>>> guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS &
>>>>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS',
>>>>> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE',
>>>>> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS',
>>>>> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com',
>>>>> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE &
>>>>> PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS &
>>>>> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS',
>>>>> '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us',
>>>>> 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add
> /Update Listing', 'Business Profile Login', 'F.A.Q.']
>>
>> Then:
>>
>>>>> pattern = re.compile(r"^[A-Z\s&]+$")
>>>>> output = [x for x in list if pattern.match(x)]
>>>>> output
>
>> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS',
>> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS',
>> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
>> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS
>> & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']
>
>
> Should've looked earlier. Their master list of categories
> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
> and the ampersands we talked about.
>
> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.
>
> "AUTOMOBILE - DEALERS" gets removed because of the dash.
>
> I updated your regex and it seems to have fixed it.
>
> orig: (r"^[A-Z\s&]+$")
> new : (r"^[A-Z\s&,-]+$")
>
>
> Thanks again.
If there is a "master list" compare your candidates against it instead of
using a heuristic, i. e.
categories = set(master_list)
output = [category for category in input if category in categories]
You can find the categories with
>>> import urllib.request
>>> import bs4
>>> soup =
bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
>>> categories = set()
>>> for li in soup.find_all("li"):
... assert li.parent.parent["class"][0].startswith("category_items")
... categories.add(li.text)
...
>>> print("\n".join(sorted(categories)[:10]))
Accounting & Bookkeeping Services
Adoption Services
Adult Entertainment
Advertising
Agricultural Equipment & Supplies
Agricultural Production
Agricultural Services
Aids Resources
Aircraft Charters & Rentals
Aircraft Dealers & Services
[toc] | [prev] | [next] | [standalone]
| From | DFS <nospam@dfs.com> |
|---|---|
| Date | 2016-05-06 09:58 -0400 |
| Message-ID | <ngi7nr$6iu$1@dont-email.me> |
| In reply to | #108218 |
On 5/6/2016 3:45 AM, Peter Otten wrote:
> DFS wrote:
>> Should've looked earlier. Their master list of categories
>> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
>> and the ampersands we talked about.
>>
>> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.
>>
>> "AUTOMOBILE - DEALERS" gets removed because of the dash.
>>
>> I updated your regex and it seems to have fixed it.
>>
>> orig: (r"^[A-Z\s&]+$")
>> new : (r"^[A-Z\s&,-]+$")
>>
>>
>> Thanks again.
>
> If there is a "master list" compare your candidates against it instead of
> using a heuristic, i. e.
>
> categories = set(master_list)
> output = [category for category in input if category in categories]
>
> You can find the categories with
>
>>>> import urllib.request
>>>> import bs4
>>>> soup =
> bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
>>>> categories = set()
>>>> for li in soup.find_all("li"):
> ... assert li.parent.parent["class"][0].startswith("category_items")
> ... categories.add(li.text)
> ...
>>>> print("\n".join(sorted(categories)[:10]))
"import urllib.request
ImportError: No module named request"
I'm on python 2.7.11
> Accounting & Bookkeeping Services
> Adoption Services
> Adult Entertainment
> Advertising
> Agricultural Equipment & Supplies
> Agricultural Production
> Agricultural Services
> Aids Resources
> Aircraft Charters & Rentals
> Aircraft Dealers & Services
Yeah, I actually did something like that last night. Was trying to get
their full tree structure, which goes 4 levels deep: ie
Arts & Entertainment
Newpapers
News Dealers
Prepess Services
What I referred to as their 'master list' is actually just 2 levels
deep. My bad.
So far I haven't come across one that had anything in it but letters,
dashes, commas or ampersands.
Thanks
[toc] | [prev] | [next] | [standalone]
| From | DFS <nospam@dfs.com> |
|---|---|
| Date | 2016-05-06 10:41 -0400 |
| Message-ID | <ngia8b$g1l$1@dont-email.me> |
| In reply to | #108226 |
On 5/6/2016 9:58 AM, DFS wrote:
> On 5/6/2016 3:45 AM, Peter Otten wrote:
>> DFS wrote:
>
>>> Should've looked earlier. Their master list of categories
>>> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
>>> and the ampersands we talked about.
>>>
>>> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the
>>> comma.
>>>
>>> "AUTOMOBILE - DEALERS" gets removed because of the dash.
>>>
>>> I updated your regex and it seems to have fixed it.
>>>
>>> orig: (r"^[A-Z\s&]+$")
>>> new : (r"^[A-Z\s&,-]+$")
>>>
>>>
>>> Thanks again.
>>
>> If there is a "master list" compare your candidates against it instead of
>> using a heuristic, i. e.
>>
>> categories = set(master_list)
>> output = [category for category in input if category in categories]
>>
>> You can find the categories with
>>
>>>>> import urllib.request
>>>>> import bs4
>>>>> soup =
>> bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
>>
>>>>> categories = set()
>>>>> for li in soup.find_all("li"):
>> ... assert li.parent.parent["class"][0].startswith("category_items")
>> ... categories.add(li.text)
>> ...
>>>>> print("\n".join(sorted(categories)[:10]))
>
>
>
> "import urllib.request
> ImportError: No module named request"
Figured it out using urllib2. Your code returns 411 categories from
that first page.
There are up to 4 levels of categorization:
Level 1: Arts & Entertainment
Level 2: Newspapers
Level 3: Newspaper Brokers
Level 3: Newspaper Dealers Back Number
Level 3: Newspaper Delivery
Level 3: Newspaper Distributors
Level 3: Newsracks
Level 3: Printers Newspapers
Level 3: Newspaper Dealers
Level 3: News Dealers
Level 4: News Dealers Wholesale
Level 4: Shoppers News Publications
Level 3: News Service
Level 4: Newspaper Feature Syndicates
Level 4: Prepress Services
http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390
Level 2. To get the Level 3 and 4 you have to drill-down using the
hyperlinks.
How to do it in python code is beyond my skills at this point. Get the
hrefs and load them and parse, then get the next level and load them and
parse, etc.?
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2016-05-06 17:44 +0200 |
| Message-ID | <mailman.434.1462549484.32212.python-list@python.org> |
| In reply to | #108229 |
DFS wrote: > There are up to 4 levels of categorization: > http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390 > Level 2. To get the Level 3 and 4 you have to drill-down using the > hyperlinks. > > How to do it in python code is beyond my skills at this point. Get the > hrefs and load them and parse, then get the next level and load them and > parse, etc.? Yes, that should work ;)
[toc] | [prev] | [next] | [standalone]
| From | DFS <nospam@dfs.com> |
|---|---|
| Date | 2016-05-06 18:43 -0400 |
| Message-ID | <ngj6fc$r00$1@dont-email.me> |
| In reply to | #108232 |
On 5/6/2016 11:44 AM, Peter Otten wrote: > DFS wrote: > >> There are up to 4 levels of categorization: > >> http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390 >> Level 2. To get the Level 3 and 4 you have to drill-down using the >> hyperlinks. >> >> How to do it in python code is beyond my skills at this point. Get the >> hrefs and load them and parse, then get the next level and load them and >> parse, etc.? > > Yes, that should work ;) How about you do it, and I'll tell you if you did it right? ha!
[toc] | [prev] | [next] | [standalone]
| From | alister <alister.ware@ntlworld.com> |
|---|---|
| Date | 2016-05-06 10:01 +0000 |
| Message-ID | <AXZWy.264624$GG.250375@fx36.am4> |
| In reply to | #108202 |
On Thu, 05 May 2016 19:31:33 -0400, DFS wrote: > On 5/5/2016 1:39 AM, Stephen Hansen wrote: > >> Given: >> >>>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs >>>>> & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city >>>>> guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & >>>>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', >>>>> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', >>>>> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', >>>>> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', >>>>> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE >>>>> & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & >>>>> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS >>>>> TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', >>>>> 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy', >>>>> 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login', >>>>> 'F.A.Q.'] >> >> Then: >> >>>>> pattern = re.compile(r"^[A-Z\s&]+$") >>>>> output = [x for x in list if pattern.match(x)] >>>>> output > >> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & >> GYMNASIUMS', >> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', >> 'GYMNASIUMS', >> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS >> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS >> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH >> CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] > > > Should've looked earlier. Their master list of categories > http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, > and the ampersands we talked about. > > "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the > comma. > > "AUTOMOBILE - DEALERS" gets removed because of the dash. > > I updated your regex and it seems to have fixed it. > > orig: (r"^[A-Z\s&]+$") > new : (r"^[A-Z\s&,-]+$") > > > Thanks again. it looks to me like this system is trying to prevent SQL injection attacks by blacklisting certain characters. this is not the correct way to block such attacks & is probably not a good indicator to the quality of the rest of the application. -- When love is gone, there's always justice. And when justice is gone, there's always force. And when force is gone, there's always Mom. Hi, Mom! -- Laurie Anderson
[toc] | [prev] | [next] | [standalone]
| From | Jussi Piitulainen <jussi.piitulainen@helsinki.fi> |
|---|---|
| Date | 2016-05-05 08:53 +0300 |
| Message-ID | <lf5mvo5dp8t.fsf@ling.helsinki.fi> |
| In reply to | #108158 |
DFS writes: . . > Want to keep all elements containing only upper case letters or upper > case letters and ampersand (where ampersand is surrounded by spaces) > > Is it easier to extract elements meeting those conditions, or remove > elements meeting the following conditions: > > * elements with a lower-case letter in them > * elements with a number in them > * elements with a period in them > > ? > > > So far all I figured out is remove items with a period: > newlist = [ x for x in oldlist if "." not in x ] > Either way is easy to approximate with a regex: import re upper = re.compile(r'[A-Z &]+') lower = re.compile(r'[^A-Z &]') print([datum for datum in data if upper.fullmatch(datum)]) print([datum for datum in data if not lower.search(datum)]) I've skipped testing that the ampersand is between spaces, and I've skipped the period. Adjust. This considers only ASCII upper case letters. You can add individual letters that matter to you, or you can reach for the documentation to find if there is some generic notation for all upper case letters. The newer regex package on PyPI supports POSIX character classes like [:upper:], I think, and there may or may not be notation for Unicode character categories in re or regex - LU would be Letter, Uppercase.
[toc] | [prev] | [next] | [standalone]
| From | DFS <nospam@dfs.com> |
|---|---|
| Date | 2016-05-05 08:57 -0400 |
| Message-ID | <ngffpa$8lb$1@dont-email.me> |
| In reply to | #108161 |
On 5/5/2016 1:53 AM, Jussi Piitulainen wrote: > Either way is easy to approximate with a regex: > > import re > upper = re.compile(r'[A-Z &]+') > lower = re.compile(r'[^A-Z &]') > print([datum for datum in data if upper.fullmatch(datum)]) > print([datum for datum in data if not lower.search(datum)]) This is similar to Hansen's solution. > I've skipped testing that the ampersand is between spaces, and I've > skipped the period. Adjust. Will do. > This considers only ASCII upper case letters. You can add individual > letters that matter to you, or you can reach for the documentation to > find if there is some generic notation for all upper case letters. > > The newer regex package on PyPI supports POSIX character classes like > [:upper:], I think, and there may or may not be notation for Unicode > character categories in re or regex - LU would be Letter, Uppercase. Thanks.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2016-05-05 16:04 +1000 |
| Message-ID | <572ae25f$0$2821$c3e8da3$76491128@news.astraweb.com> |
| In reply to | #108158 |
On Thursday 05 May 2016 14:58, DFS wrote:
> Want to whittle a list like this:
[...]
> Want to keep all elements containing only upper case letters or upper
> case letters and ampersand (where ampersand is surrounded by spaces)
Start by writing a function or a regex that will distinguish strings that
match your conditions from those that don't. A regex might be faster, but
here's a function version.
def isupperalpha(string):
return string.isalpha() and string.isupper()
def check(string):
if isupperalpha(string):
return True
parts = string.split("&")
if len(parts) < 2:
return False
# Don't strip leading spaces from the start of the string.
parts[0] = parts[0].rstrip(" ")
# Or trailing spaces from the end of the string.
parts[-1] = parts[-1].lstrip(" ")
# But strip leading and trailing spaces from the middle parts
# (if any).
for i in range(1, len(parts)-1):
parts[i] = parts[i].strip(" ")
return all(isupperalpha(part) for part in parts)
Now you have two ways of filtering this. The obvious way is to extract
elements which meet the condition. Here are two ways:
# List comprehension.
newlist = [item for item in oldlist if check(item)]
# Filter, Python 2 version
newlist = filter(check, oldlist)
# Filter, Python 3 version
newlist = list(filter(check, oldlist))
In practice, this is the best (fastest, simplest) way. But if you fear that
you will run out of memory dealing with absolutely humongous lists with
hundreds of millions or billions of strings, you can remove items in place:
def remove(func, alist):
for i in range(len(alist)-1, -1, -1):
if not func(alist[i]):
del alist[i]
Note the magic incantation to iterate from the end of the list towards the
front. If you do it the other way, Bad Things happen. Note that this will
use less memory than extracting the items, but it will be much slower.
You can combine the best of both words. Here is a version that uses a
temporary list to modify the original in place:
# works in both Python 2 and 3
def remove(func, alist):
# Modify list in place, the fast way.
alist[:] = filter(check, alist)
--
Steve
[toc] | [prev] | [next] | [standalone]
| From | Stephen Hansen <me+python@ixokai.io> |
|---|---|
| Date | 2016-05-04 23:46 -0700 |
| Message-ID | <mailman.398.1462430769.32212.python-list@python.org> |
| In reply to | #108162 |
On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: > Start by writing a function or a regex that will distinguish strings that > match your conditions from those that don't. A regex might be faster, but > here's a function version. > ... snip ... Yikes. I'm all for the idea that one shouldn't go to regex when Python's powerful string type can answer the problem more clearly, but this seems to go out of its way to do otherwise. I don't even care about faster: Its overly complicated. Sometimes a regular expression really is the clearest way to solve a problem. -- Stephen Hansen m e @ i x o k a i . i o
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2016-05-05 17:04 +1000 |
| Message-ID | <572af09d$0$1508$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #108163 |
On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
>> Start by writing a function or a regex that will distinguish strings that
>> match your conditions from those that don't. A regex might be faster, but
>> here's a function version.
>> ... snip ...
>
> Yikes. I'm all for the idea that one shouldn't go to regex when Python's
> powerful string type can answer the problem more clearly, but this seems
> to go out of its way to do otherwise.
>
> I don't even care about faster: Its overly complicated. Sometimes a
> regular expression really is the clearest way to solve a problem.
You're probably right, but I find it easier to reason about matching in
Python rather than the overly terse, cryptic regular expression mini-
language.
I haven't tested my function version, but I'm 95% sure that it is correct.
It trickiest part of it is the logic about splitting around ampersands. And
I'll cheerfully admit that it isn't easy to extend to (say) "ampersand, or
at signs". But your regex solution:
r"^[A-Z\s&]+$"
is much smaller and more compact, but *wrong*. For instance, your regex
wrongly accepts both "&&&&&" and " " as valid strings, and wrongly
rejects "ΔΣΘΛ". Your Greek customers will be sad...
Oh, I just realised, I should have looked more closely at the examples
given. because the specification given by DFS does not match the examples.
DFS says that only uppercase letters and ampersands are allowed, but their
examples include strings with spaces, e.g. 'FITNESS CENTERS' despite the
lack of ampersands. (I read the spec literally as spaces only allowed if
they surround an ampersand.) Oops, mea culpa. That makes the check function
much simpler and easier to extend:
def check(string):
string = string.replace("&", "").replace(" ", "")
return string.isalpha() and string.isupper()
and now I'm 95% confident it is correct without testing, this time for sure!
;-)
--
Steve
[toc] | [prev] | [next] | [standalone]
| From | Stephen Hansen <me+python@ixokai.io> |
|---|---|
| Date | 2016-05-05 00:34 -0700 |
| Message-ID | <mailman.401.1462433672.32212.python-list@python.org> |
| In reply to | #108164 |
On Thu, May 5, 2016, at 12:04 AM, Steven D'Aprano wrote: > On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > > > On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote: > >> Start by writing a function or a regex that will distinguish strings that > >> match your conditions from those that don't. A regex might be faster, but > >> here's a function version. > >> ... snip ... > > > > Yikes. I'm all for the idea that one shouldn't go to regex when Python's > > powerful string type can answer the problem more clearly, but this seems > > to go out of its way to do otherwise. > > > > I don't even care about faster: Its overly complicated. Sometimes a > > regular expression really is the clearest way to solve a problem. > > You're probably right, but I find it easier to reason about matching in > Python rather than the overly terse, cryptic regular expression mini- > language. > > I haven't tested my function version, but I'm 95% sure that it is > correct. > It trickiest part of it is the logic about splitting around ampersands. > And > I'll cheerfully admit that it isn't easy to extend to (say) "ampersand, > or > at signs". But your regex solution: > > r"^[A-Z\s&]+$" > > is much smaller and more compact, but *wrong*. For instance, your regex > wrongly accepts both "&&&&&" and " " as valid strings, and wrongly > rejects "ΔΣΘΛ". Your Greek customers will be sad... Meh. You have a pedantic definition of wrong. Given the inputs, it produced right output. Very often that's enough. Perfect is the enemy of good, it's said. There's no situation where "&&&&&" and " " will exist in the given dataset, and recognizing that is important. You don't have to account for every bit of nonsense. If the OP needs a unicode-aware solution that redefines "A-Z" as perhaps "\w" with an isupper call. Its still far simpler then you're suggesting. -- Stephen Hansen m e @ i x o k a i . i o
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2016-05-05 18:41 +1000 |
| Message-ID | <572b073e$0$1611$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #108165 |
On Thursday 05 May 2016 17:34, Stephen Hansen wrote: > Meh. You have a pedantic definition of wrong. Given the inputs, it > produced right output. Very often that's enough. Perfect is the enemy of > good, it's said. And this is a *perfect* example of why we have things like this: http://www.bbc.com/future/story/20160325-the-names-that-break-computer- systems "Nobody will ever be called Null." "Nobody has quotation marks in their name." "Nobody will have a + sign in their email address." "Nobody has a legal gender other than Male or Female." "Nobody will lean on the keyboard and enter gobbledygook into our form." "Nobody will try to write more data than the space they allocated for it." > There's no situation where "&&&&&" and " " will exist in the given > dataset, and recognizing that is important. You don't have to account > for every bit of nonsense. Whenever a programmer says "This case will never happen", ten thousand computers crash. http://www.kr41.net/2016/05-03-shit_driven_development.html -- Steven D'Aprano
[toc] | [prev] | [next] | [standalone]
| From | Random832 <random832@fastmail.com> |
|---|---|
| Date | 2016-05-05 09:13 -0400 |
| Message-ID | <mailman.404.1462454017.32212.python-list@python.org> |
| In reply to | #108168 |
On Thu, May 5, 2016, at 04:41, Steven D'Aprano wrote: > > There's no situation where "&&&&&" and " " will exist in the given > > dataset, and recognizing that is important. You don't have to account > > for every bit of nonsense. > > Whenever a programmer says "This case will never happen", ten thousand > computers crash. What crash can including such an entry in the output list cause? Should the regex also ensure that the data only includes *english words* separated by space-ampersand-space?
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-05-06 03:13 +1000 |
| Message-ID | <572b7f41$0$1598$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #108175 |
On Thu, 5 May 2016 11:13 pm, Random832 wrote: > On Thu, May 5, 2016, at 04:41, Steven D'Aprano wrote: >> > There's no situation where "&&&&&" and " " will exist in the given >> > dataset, and recognizing that is important. You don't have to account >> > for every bit of nonsense. >> >> Whenever a programmer says "This case will never happen", ten thousand >> computers crash. > > What crash can including such an entry in the output list cause? How do I know? It depends what you do with that list. But if you assume that your list contains alphabetical strings, and pass it on to code that expects alphabetical strings, why is it so hard to believe that it might choke when it receives a non-alphabetical string? > Should the regex also ensure that the data only includes *english words* > separated by space-ampersand-space? That wasn't part of the specification. But for some applications, yes, you should ensure the data includes only English words. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2016-05-05 17:36 +1000 |
| Message-ID | <572af811$0$1608$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #108163 |
Oh, a further thought...
On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
>> Start by writing a function or a regex that will distinguish strings that
>> match your conditions from those that don't. A regex might be faster, but
>> here's a function version.
>> ... snip ...
>
> Yikes. I'm all for the idea that one shouldn't go to regex when Python's
> powerful string type can answer the problem more clearly, but this seems
> to go out of its way to do otherwise.
>
> I don't even care about faster: Its overly complicated. Sometimes a
> regular expression really is the clearest way to solve a problem.
Putting non-ASCII letters aside for the moment, how would you match these
specs as a regular expression?
- All uppercase ASCII letters (A to Z only), optionally separated into words
by either a bare ampersand (e.g. "AAA&AAA") or an ampersand with leading and
trailing spaces (spaces only, not arbitrary whitespace): "AAA & AAA".
- The number of spaces on either side of the ampersands need not be the
same: "AAA& BBB & CCC" should match.
- Leading or trailing spaces, or spaces not surrounding an ampersand, must
not match: "AAA BBB" must be rejected.
- Leading or trailing ampersands must also be rejected. This includes the
case where the string is nothing but ampersands.
- Consecutive ampersands "AAA&&&BBB" and the empty string must be rejected.
I get something like this:
r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
but it fails on strings like "AA & A & A". What am I doing wrong?
For the record, here's my brief test suite:
def test(pat):
for s in ("", " ", "&" "A A", "A&", "&A", "A&&A", "A& &A"):
assert re.match(pat, s) is None
for s in ("A", "A & A", "AA&A", "AA & A & A"):
assert re.match(pat, s)
--
Steve
[toc] | [prev] | [next] | [standalone]
Page 1 of 3 [1] 2 3 Next page →
Back to top | Article view | comp.lang.python
csiph-web