Groups > comp.lang.python > #108158 > unrolled thread

Whittle it on down

Started by	DFS <nospam@dfs.com>
First post	2016-05-05 00:58 -0400
Last post	2016-05-05 17:45 -0400
Articles	20 on this page of 41 — 8 participants

Back to article view | Back to comp.lang.python

  Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 00:58 -0400
    Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-04 22:39 -0700
      Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:44 -0400
      Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 19:31 -0400
        Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-06 09:45 +0200
          Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 09:58 -0400
            Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 10:41 -0400
              Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-06 17:44 +0200
                Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 18:43 -0400
        Re: Whittle it on down alister <alister.ware@ntlworld.com> - 2016-05-06 10:01 +0000
    Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 08:53 +0300
      Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:57 -0400
    Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 16:04 +1000
      Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-04 23:46 -0700
        Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 17:04 +1000
          Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 00:34 -0700
            Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 18:41 +1000
              Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 09:13 -0400
                Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:13 +1000
        Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 17:36 +1000
          Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-05 10:17 +0200
            Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 01:39 +1000
          Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 09:21 -0400
            Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 04:03 +1000
              Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 14:52 -0400
              Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 12:09 -0700
          Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 06:32 -0700
            Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 10:36 -0400
            Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:43 +1000
              Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 11:55 -0700
          Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 20:49 +0300
            Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 04:14 +1000
              Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 21:27 +0300
                Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 14:54 -0400
                Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 10:57 +1000
                  Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-06 07:19 +0300
      Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:31 -0400
        Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:54 +1000
          Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 17:36 -0400
        Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 11:56 -0700
          Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 17:45 -0400

Page 1 of 3 [1] 2 3 Next page →

#108158 — Whittle it on down

From	DFS <nospam@dfs.com>
Date	2016-05-05 00:58 -0400
Subject	Whittle it on down
Message-ID	<ngejmj$gc4$1@dont-email.me>

Want to whittle a list like this:

[u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & 
Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 
'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 
'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 
'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', 
'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 
'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 
'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & 
PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & 
GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', 
'5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', 
'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 
'Add/Update Listing', 'Business Profile Login', 'F.A.Q.']

down to

['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 
'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', 
'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS',  'PERSONAL FITNESS 
TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS 
PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS 
& GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']



Want to keep all elements containing only upper case letters or upper 
case letters and ampersand (where ampersand is surrounded by spaces)

Is it easier to extract elements meeting those conditions, or remove 
elements meeting the following conditions:

* elements with a lower-case letter in them
* elements with a number in them
* elements with a period in them

?


So far all I figured out is remove items with a period:
newlist = [ x for x in oldlist if "." not in x ]


Thanks for help, python gurus.

[toc] | [next] | [standalone]

#108160

From	Stephen Hansen <me+python@ixokai.io>
Date	2016-05-04 22:39 -0700
Message-ID	<mailman.397.1462426759.32212.python-list@python.org>
In reply to	#108158

On Wed, May 4, 2016, at 09:58 PM, DFS wrote:
> Want to whittle a list like this:
> 
> [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & 
> Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 
> 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 
> 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 
> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', 
> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 
> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 
> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & 
> PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & 
> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', 
> '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', 
> 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 
> 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.']
> 
> down to
> 
> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 
> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', 
> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS',  'PERSONAL FITNESS 
> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS 
> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS 
> & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']

Sometimes regular expressions are the tool to do the job:

Given:

>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.']

Then:

>>> pattern = re.compile(r"^[A-Z\s&]+$")
>>> output = [x for x in list if pattern.match(x)]
>>> output
['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS',
'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS',
'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS
& GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]

#108173

From	DFS <nospam@dfs.com>
Date	2016-05-05 08:44 -0400
Message-ID	<ngff1f$649$1@dont-email.me>
In reply to	#108160

On 5/5/2016 1:39 AM, Stephen Hansen wrote:

> pattern = re.compile(r"^[A-Z\s&]+$")

> output = [x for x in list if pattern.match(x)]

Holy Shr"^[A-Z\s&]+$"  One line of parsing!

I was figuring a few list comprehensions would do it - this is better.

(note: the reason I specified 'spaces around ampersand' is so it would
remove 'Q&A' if that ever came up - but some people write 'Q & A', so
I'll live with that exception, or try to tweak it myself.

You're the man, man.

Thank you!

[toc] | [prev] | [next] | [standalone]

#108202

From	DFS <nospam@dfs.com>
Date	2016-05-05 19:31 -0400
Message-ID	<nggku4$p6n$1@dont-email.me>
In reply to	#108160

On 5/5/2016 1:39 AM, Stephen Hansen wrote:

> Given:
>
>>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login', 'F.A.Q.']
>
> Then:
>
>>>> pattern = re.compile(r"^[A-Z\s&]+$")
>>>> output = [x for x in list if pattern.match(x)]
>>>> output

> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS',
> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS',
> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS
> & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']


Should've looked earlier.  Their master list of categories 
http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, 
and the ampersands we talked about.

"OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.

"AUTOMOBILE - DEALERS" gets removed because of the dash.

I updated your regex and it seems to have fixed it.

orig: (r"^[A-Z\s&]+$")
new : (r"^[A-Z\s&,-]+$")


Thanks again.

[toc] | [prev] | [next] | [standalone]

#108218

From	Peter Otten <__peter__@web.de>
Date	2016-05-06 09:45 +0200
Message-ID	<mailman.428.1462520743.32212.python-list@python.org>
In reply to	#108202

DFS wrote:

> On 5/5/2016 1:39 AM, Stephen Hansen wrote:
> 
>> Given:
>>
>>>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs &
>>>>> Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city
>>>>> guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS &
>>>>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS',
>>>>> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE',
>>>>> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS',
>>>>> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com',
>>>>> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE &
>>>>> PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS &
>>>>> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS',
>>>>> '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us',
>>>>> 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add
>  /Update Listing', 'Business Profile Login', 'F.A.Q.']
>>
>> Then:
>>
>>>>> pattern = re.compile(r"^[A-Z\s&]+$")
>>>>> output = [x for x in list if pattern.match(x)]
>>>>> output
> 
>> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS',
>> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS',
>> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
>> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS
>> & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']
> 
> 
> Should've looked earlier.  Their master list of categories
> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
> and the ampersands we talked about.
> 
> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.
> 
> "AUTOMOBILE - DEALERS" gets removed because of the dash.
> 
> I updated your regex and it seems to have fixed it.
> 
> orig: (r"^[A-Z\s&]+$")
> new : (r"^[A-Z\s&,-]+$")
> 
> 
> Thanks again.

If there is a "master list" compare your candidates against it instead of 
using a heuristic, i. e.

categories = set(master_list)
output = [category for category in input if category in categories]

You can find the categories with

>>> import urllib.request
>>> import bs4
>>> soup = 
bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
>>> categories = set()
>>> for li in soup.find_all("li"):
...     assert li.parent.parent["class"][0].startswith("category_items")
...     categories.add(li.text)
... 
>>> print("\n".join(sorted(categories)[:10]))
Accounting & Bookkeeping Services
Adoption Services
Adult Entertainment
Advertising
Agricultural Equipment & Supplies
Agricultural Production
Agricultural Services
Aids Resources
Aircraft Charters & Rentals
Aircraft Dealers & Services

[toc] | [prev] | [next] | [standalone]

#108226

From	DFS <nospam@dfs.com>
Date	2016-05-06 09:58 -0400
Message-ID	<ngi7nr$6iu$1@dont-email.me>
In reply to	#108218

On 5/6/2016 3:45 AM, Peter Otten wrote:
> DFS wrote:

>> Should've looked earlier.  Their master list of categories
>> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
>> and the ampersands we talked about.
>>
>> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma.
>>
>> "AUTOMOBILE - DEALERS" gets removed because of the dash.
>>
>> I updated your regex and it seems to have fixed it.
>>
>> orig: (r"^[A-Z\s&]+$")
>> new : (r"^[A-Z\s&,-]+$")
>>
>>
>> Thanks again.
>
> If there is a "master list" compare your candidates against it instead of
> using a heuristic, i. e.
>
> categories = set(master_list)
> output = [category for category in input if category in categories]
>
> You can find the categories with
>
>>>> import urllib.request
>>>> import bs4
>>>> soup =
> bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
>>>> categories = set()
>>>> for li in soup.find_all("li"):
> ...     assert li.parent.parent["class"][0].startswith("category_items")
> ...     categories.add(li.text)
> ...
>>>> print("\n".join(sorted(categories)[:10]))



"import urllib.request
ImportError: No module named request"


I'm on python 2.7.11





> Accounting & Bookkeeping Services
> Adoption Services
> Adult Entertainment
> Advertising
> Agricultural Equipment & Supplies
> Agricultural Production
> Agricultural Services
> Aids Resources
> Aircraft Charters & Rentals
> Aircraft Dealers & Services




Yeah, I actually did something like that last night.  Was trying to get
their full tree structure, which goes 4 levels deep: ie

Arts & Entertainment
   Newpapers
    News Dealers
     Prepess Services


What I referred to as their 'master list' is actually just 2 levels 
deep.  My bad.

So far I haven't come across one that had anything in it but letters, 
dashes, commas or ampersands.

Thanks

[toc] | [prev] | [next] | [standalone]

#108229

From	DFS <nospam@dfs.com>
Date	2016-05-06 10:41 -0400
Message-ID	<ngia8b$g1l$1@dont-email.me>
In reply to	#108226

On 5/6/2016 9:58 AM, DFS wrote:
> On 5/6/2016 3:45 AM, Peter Otten wrote:
>> DFS wrote:
>
>>> Should've looked earlier.  Their master list of categories
>>> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
>>> and the ampersands we talked about.
>>>
>>> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the
>>> comma.
>>>
>>> "AUTOMOBILE - DEALERS" gets removed because of the dash.
>>>
>>> I updated your regex and it seems to have fixed it.
>>>
>>> orig: (r"^[A-Z\s&]+$")
>>> new : (r"^[A-Z\s&,-]+$")
>>>
>>>
>>> Thanks again.
>>
>> If there is a "master list" compare your candidates against it instead of
>> using a heuristic, i. e.
>>
>> categories = set(master_list)
>> output = [category for category in input if category in categories]
>>
>> You can find the categories with
>>
>>>>> import urllib.request
>>>>> import bs4
>>>>> soup =
>> bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read())
>>
>>>>> categories = set()
>>>>> for li in soup.find_all("li"):
>> ...     assert li.parent.parent["class"][0].startswith("category_items")
>> ...     categories.add(li.text)
>> ...
>>>>> print("\n".join(sorted(categories)[:10]))
>
>
>
> "import urllib.request
> ImportError: No module named request"


Figured it out using urllib2.  Your code returns 411 categories from 
that first page.

There are up to 4 levels of categorization:


Level 1: Arts & Entertainment
Level 2:   Newspapers

Level 3:     Newspaper Brokers
Level 3:     Newspaper Dealers Back Number
Level 3:     Newspaper Delivery
Level 3:     Newspaper Distributors
Level 3:     Newsracks
Level 3:     Printers Newspapers
Level 3:     Newspaper Dealers

Level 3:     News Dealers
Level 4:       News Dealers Wholesale
Level 4:       Shoppers News Publications

Level 3:     News Service
Level 4:       Newspaper Feature Syndicates
Level 4:       Prepress Services




http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390 
Level 2.  To get the Level 3 and 4 you have to drill-down using the 
hyperlinks.

How to do it in python code is beyond my skills at this point.  Get the 
hrefs and load them and parse, then get the next level and load them and 
parse, etc.?

[toc] | [prev] | [next] | [standalone]

#108232

From	Peter Otten <__peter__@web.de>
Date	2016-05-06 17:44 +0200
Message-ID	<mailman.434.1462549484.32212.python-list@python.org>
In reply to	#108229

DFS wrote:

> There are up to 4 levels of categorization:
 
> http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390
> Level 2.  To get the Level 3 and 4 you have to drill-down using the
> hyperlinks.
> 
> How to do it in python code is beyond my skills at this point.  Get the
> hrefs and load them and parse, then get the next level and load them and
> parse, etc.?

Yes, that should work ;)

[toc] | [prev] | [next] | [standalone]

#108242

From	DFS <nospam@dfs.com>
Date	2016-05-06 18:43 -0400
Message-ID	<ngj6fc$r00$1@dont-email.me>
In reply to	#108232

On 5/6/2016 11:44 AM, Peter Otten wrote:
> DFS wrote:
>
>> There are up to 4 levels of categorization:
>
>> http://www.usdirectory.com/cat/g0 shows 21 Level 1 categories, and 390
>> Level 2.  To get the Level 3 and 4 you have to drill-down using the
>> hyperlinks.
>>
>> How to do it in python code is beyond my skills at this point.  Get the
>> hrefs and load them and parse, then get the next level and load them and
>> parse, etc.?
>
> Yes, that should work ;)


How about you do it, and I'll tell you if you did it right?

ha!

[toc] | [prev] | [next] | [standalone]

#108222

From	alister <alister.ware@ntlworld.com>
Date	2016-05-06 10:01 +0000
Message-ID	<AXZWy.264624$GG.250375@fx36.am4>
In reply to	#108202

On Thu, 05 May 2016 19:31:33 -0400, DFS wrote:

> On 5/5/2016 1:39 AM, Stephen Hansen wrote:
> 
>> Given:
>>
>>>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs
>>>>> & Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city
>>>>> guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS &
>>>>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS',
>>>>> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE',
>>>>> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS',
>>>>> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com',
>>>>> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE
>>>>> & PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS &
>>>>> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS
>>>>> TRAINERS', '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us',
>>>>> 'Contact Us', 'Support', 'Terms of Use', 'Privacy Policy',
>>>>> 'Advertise With Us', 'Add/Update Listing', 'Business Profile Login',
>>>>> 'F.A.Q.']
>>
>> Then:
>>
>>>>> pattern = re.compile(r"^[A-Z\s&]+$")
>>>>> output = [x for x in list if pattern.match(x)]
>>>>> output
> 
>> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS &
>> GYMNASIUMS',
>> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE',
>> 'GYMNASIUMS',
>> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS
>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS
>> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH
>> CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS']
> 
> 
> Should've looked earlier.  Their master list of categories
> http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes,
> and the ampersands we talked about.
> 
> "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the
> comma.
> 
> "AUTOMOBILE - DEALERS" gets removed because of the dash.
> 
> I updated your regex and it seems to have fixed it.
> 
> orig: (r"^[A-Z\s&]+$")
> new : (r"^[A-Z\s&,-]+$")
> 
> 
> Thanks again.

it looks to me like this system is trying to prevent SQL injection 
attacks by blacklisting certain characters.
this is not the correct way to block such attacks & is probably not a 
good indicator to the quality of the rest of the application.



-- 
When love is gone, there's always justice.
And when justice is gone, there's always force.
And when force is gone, there's always Mom.
Hi, Mom!
		-- Laurie Anderson

[toc] | [prev] | [next] | [standalone]

#108161

From	Jussi Piitulainen <jussi.piitulainen@helsinki.fi>
Date	2016-05-05 08:53 +0300
Message-ID	<lf5mvo5dp8t.fsf@ling.helsinki.fi>
In reply to	#108158

DFS writes:

. .

> Want to keep all elements containing only upper case letters or upper
> case letters and ampersand (where ampersand is surrounded by spaces)
>
> Is it easier to extract elements meeting those conditions, or remove
> elements meeting the following conditions:
>
> * elements with a lower-case letter in them
> * elements with a number in them
> * elements with a period in them
>
> ?
>
>
> So far all I figured out is remove items with a period:
> newlist = [ x for x in oldlist if "." not in x ]
>

Either way is easy to approximate with a regex:

import re
upper = re.compile(r'[A-Z &]+')
lower = re.compile(r'[^A-Z &]')
print([datum for datum in data if upper.fullmatch(datum)])
print([datum for datum in data if not lower.search(datum)])

I've skipped testing that the ampersand is between spaces, and I've
skipped the period. Adjust.

This considers only ASCII upper case letters. You can add individual
letters that matter to you, or you can reach for the documentation to
find if there is some generic notation for all upper case letters.

The newer regex package on PyPI supports POSIX character classes like
[:upper:], I think, and there may or may not be notation for Unicode
character categories in re or regex - LU would be Letter, Uppercase.

[toc] | [prev] | [next] | [standalone]

#108174

From	DFS <nospam@dfs.com>
Date	2016-05-05 08:57 -0400
Message-ID	<ngffpa$8lb$1@dont-email.me>
In reply to	#108161

On 5/5/2016 1:53 AM, Jussi Piitulainen wrote:


> Either way is easy to approximate with a regex:
>
> import re
> upper = re.compile(r'[A-Z &]+')
> lower = re.compile(r'[^A-Z &]')
> print([datum for datum in data if upper.fullmatch(datum)])
> print([datum for datum in data if not lower.search(datum)])

This is similar to Hansen's solution.



> I've skipped testing that the ampersand is between spaces, and I've
> skipped the period. Adjust.

Will do.


> This considers only ASCII upper case letters. You can add individual
> letters that matter to you, or you can reach for the documentation to
> find if there is some generic notation for all upper case letters.
>
> The newer regex package on PyPI supports POSIX character classes like
> [:upper:], I think, and there may or may not be notation for Unicode
> character categories in re or regex - LU would be Letter, Uppercase.

Thanks.

[toc] | [prev] | [next] | [standalone]

#108162

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2016-05-05 16:04 +1000
Message-ID	<572ae25f$0$2821$c3e8da3$76491128@news.astraweb.com>
In reply to	#108158

On Thursday 05 May 2016 14:58, DFS wrote:

> Want to whittle a list like this:
[...]
> Want to keep all elements containing only upper case letters or upper
> case letters and ampersand (where ampersand is surrounded by spaces)

Start by writing a function or a regex that will distinguish strings that 
match your conditions from those that don't. A regex might be faster, but 
here's a function version.

def isupperalpha(string):
    return string.isalpha() and string.isupper()

def check(string):
    if isupperalpha(string):
        return True
    parts = string.split("&")
    if len(parts) < 2:
        return False
    # Don't strip leading spaces from the start of the string.
    parts[0] = parts[0].rstrip(" ")
    # Or trailing spaces from the end of the string.
    parts[-1] = parts[-1].lstrip(" ")
    # But strip leading and trailing spaces from the middle parts
    # (if any).
    for i in range(1, len(parts)-1):
        parts[i] = parts[i].strip(" ")
     return all(isupperalpha(part) for part in parts)

Now you have two ways of filtering this. The obvious way is to extract 
elements which meet the condition. Here are two ways:

# List comprehension.
newlist = [item for item in oldlist if check(item)]

# Filter, Python 2 version
newlist = filter(check, oldlist)

# Filter, Python 3 version
newlist = list(filter(check, oldlist))

In practice, this is the best (fastest, simplest) way. But if you fear that 
you will run out of memory dealing with absolutely humongous lists with 
hundreds of millions or billions of strings, you can remove items in place:

def remove(func, alist):
    for i in range(len(alist)-1, -1, -1):
        if not func(alist[i]):
            del alist[i]

Note the magic incantation to iterate from the end of the list towards the 
front. If you do it the other way, Bad Things happen. Note that this will 
use less memory than extracting the items, but it will be much slower.

You can combine the best of both words. Here is a version that uses a 
temporary list to modify the original in place:

# works in both Python 2 and 3
def remove(func, alist):
    # Modify list in place, the fast way.
    alist[:] = filter(check, alist)

-- 
Steve

[toc] | [prev] | [next] | [standalone]

#108163

From	Stephen Hansen <me+python@ixokai.io>
Date	2016-05-04 23:46 -0700
Message-ID	<mailman.398.1462430769.32212.python-list@python.org>
In reply to	#108162

On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
> Start by writing a function or a regex that will distinguish strings that 
> match your conditions from those that don't. A regex might be faster, but 
> here's a function version.
> ... snip ...

Yikes. I'm all for the idea that one shouldn't go to regex when Python's
powerful string type can answer the problem more clearly, but this seems
to go out of its way to do otherwise.

I don't even care about faster: Its overly complicated. Sometimes a
regular expression really is the clearest way to solve a problem.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]

#108164

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2016-05-05 17:04 +1000
Message-ID	<572af09d$0$1508$c3e8da3$5496439d@news.astraweb.com>
In reply to	#108163

On Thursday 05 May 2016 16:46, Stephen Hansen wrote:

> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
>> Start by writing a function or a regex that will distinguish strings that
>> match your conditions from those that don't. A regex might be faster, but
>> here's a function version.
>> ... snip ...
> 
> Yikes. I'm all for the idea that one shouldn't go to regex when Python's
> powerful string type can answer the problem more clearly, but this seems
> to go out of its way to do otherwise.
> 
> I don't even care about faster: Its overly complicated. Sometimes a
> regular expression really is the clearest way to solve a problem.

You're probably right, but I find it easier to reason about matching in 
Python rather than the overly terse, cryptic regular expression mini-
language.

I haven't tested my function version, but I'm 95% sure that it is correct. 
It trickiest part of it is the logic about splitting around ampersands. And 
I'll cheerfully admit that it isn't easy to extend to (say) "ampersand, or 
at signs". But your regex solution:

r"^[A-Z\s&]+$"

is much smaller and more compact, but *wrong*. For instance, your regex 
wrongly accepts both "&&&&&" and "      " as valid strings, and wrongly 
rejects "ΔΣΘΛ". Your Greek customers will be sad...

Oh, I just realised, I should have looked more closely at the examples 
given. because the specification given by DFS does not match the examples. 
DFS says that only uppercase letters and ampersands are allowed, but their 
examples include strings with spaces, e.g. 'FITNESS CENTERS' despite the 
lack of ampersands. (I read the spec literally as spaces only allowed if 
they surround an ampersand.) Oops, mea culpa. That makes the check function 
much simpler and easier to extend:

def check(string):
    string = string.replace("&", "").replace(" ", "")
    return string.isalpha() and string.isupper()

and now I'm 95% confident it is correct without testing, this time for sure!

;-)

-- 
Steve

[toc] | [prev] | [next] | [standalone]

#108165

From	Stephen Hansen <me+python@ixokai.io>
Date	2016-05-05 00:34 -0700
Message-ID	<mailman.401.1462433672.32212.python-list@python.org>
In reply to	#108164

On Thu, May 5, 2016, at 12:04 AM, Steven D'Aprano wrote:
> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> > > On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
> >> Start by writing a function or a regex that will distinguish strings that
> >> match your conditions from those that don't. A regex might be faster, but
> >> here's a function version.
> >> ... snip ...
> > 
> > Yikes. I'm all for the idea that one shouldn't go to regex when Python's
> > powerful string type can answer the problem more clearly, but this seems
> > to go out of its way to do otherwise.
> > 
> > I don't even care about faster: Its overly complicated. Sometimes a
> > regular expression really is the clearest way to solve a problem.
> 
> You're probably right, but I find it easier to reason about matching in 
> Python rather than the overly terse, cryptic regular expression mini-
> language.
> 
> I haven't tested my function version, but I'm 95% sure that it is
> correct. 
> It trickiest part of it is the logic about splitting around ampersands.
> And 
> I'll cheerfully admit that it isn't easy to extend to (say) "ampersand,
> or 
> at signs". But your regex solution:
> 
> r"^[A-Z\s&]+$"
> 
> is much smaller and more compact, but *wrong*. For instance, your regex 
> wrongly accepts both "&&&&&" and "      " as valid strings, and wrongly 
> rejects "ΔΣΘΛ". Your Greek customers will be sad...

Meh. You have a pedantic definition of wrong. Given the inputs, it
produced right output. Very often that's enough. Perfect is the enemy of
good, it's said. 

There's no situation where "&&&&&" and "     " will exist in the given
dataset, and recognizing that is important. You don't have to account
for every bit of nonsense. 

If the OP needs a unicode-aware solution that redefines "A-Z" as perhaps
"\w" with an isupper call. Its still far simpler then you're suggesting.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]

#108168

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2016-05-05 18:41 +1000
Message-ID	<572b073e$0$1611$c3e8da3$5496439d@news.astraweb.com>
In reply to	#108165

On Thursday 05 May 2016 17:34, Stephen Hansen wrote:


> Meh. You have a pedantic definition of wrong. Given the inputs, it
> produced right output. Very often that's enough. Perfect is the enemy of
> good, it's said.

And this is a *perfect* example of why we have things like this:

http://www.bbc.com/future/story/20160325-the-names-that-break-computer-
systems

"Nobody will ever be called Null."

"Nobody has quotation marks in their name."

"Nobody will have a + sign in their email address."

"Nobody has a legal gender other than Male or Female."

"Nobody will lean on the keyboard and enter gobbledygook into our form."

"Nobody will try to write more data than the space they allocated for it."


> There's no situation where "&&&&&" and "     " will exist in the given
> dataset, and recognizing that is important. You don't have to account
> for every bit of nonsense.

Whenever a programmer says "This case will never happen", ten thousand 
computers crash.

http://www.kr41.net/2016/05-03-shit_driven_development.html


-- 
Steven D'Aprano

[toc] | [prev] | [next] | [standalone]

#108175

From	Random832 <random832@fastmail.com>
Date	2016-05-05 09:13 -0400
Message-ID	<mailman.404.1462454017.32212.python-list@python.org>
In reply to	#108168

On Thu, May 5, 2016, at 04:41, Steven D'Aprano wrote:
> > There's no situation where "&&&&&" and "     " will exist in the given
> > dataset, and recognizing that is important. You don't have to account
> > for every bit of nonsense.
> 
> Whenever a programmer says "This case will never happen", ten thousand 
> computers crash.

What crash can including such an entry in the output list cause?

Should the regex also ensure that the data only includes *english words*
separated by space-ampersand-space?

[toc] | [prev] | [next] | [standalone]

#108181

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-05-06 03:13 +1000
Message-ID	<572b7f41$0$1598$c3e8da3$5496439d@news.astraweb.com>
In reply to	#108175

On Thu, 5 May 2016 11:13 pm, Random832 wrote:

> On Thu, May 5, 2016, at 04:41, Steven D'Aprano wrote:
>> > There's no situation where "&&&&&" and "     " will exist in the given
>> > dataset, and recognizing that is important. You don't have to account
>> > for every bit of nonsense.
>> 
>> Whenever a programmer says "This case will never happen", ten thousand
>> computers crash.
> 
> What crash can including such an entry in the output list cause?

How do I know? It depends what you do with that list.

But if you assume that your list contains alphabetical strings, and pass it
on to code that expects alphabetical strings, why is it so hard to believe
that it might choke when it receives a non-alphabetical string?

> Should the regex also ensure that the data only includes *english words*
> separated by space-ampersand-space?

That wasn't part of the specification. But for some applications, yes, you
should ensure the data includes only English words.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#108166

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2016-05-05 17:36 +1000
Message-ID	<572af811$0$1608$c3e8da3$5496439d@news.astraweb.com>
In reply to	#108163

Oh, a further thought...


On Thursday 05 May 2016 16:46, Stephen Hansen wrote:

> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
>> Start by writing a function or a regex that will distinguish strings that
>> match your conditions from those that don't. A regex might be faster, but
>> here's a function version.
>> ... snip ...
> 
> Yikes. I'm all for the idea that one shouldn't go to regex when Python's
> powerful string type can answer the problem more clearly, but this seems
> to go out of its way to do otherwise.
> 
> I don't even care about faster: Its overly complicated. Sometimes a
> regular expression really is the clearest way to solve a problem.

Putting non-ASCII letters aside for the moment, how would you match these 
specs as a regular expression?

- All uppercase ASCII letters (A to Z only), optionally separated into words 
by either a bare ampersand (e.g. "AAA&AAA") or an ampersand with leading and 
trailing spaces (spaces only, not arbitrary whitespace): "AAA   & AAA".

- The number of spaces on either side of the ampersands need not be the 
same: "AAA&   BBB &       CCC" should match.

- Leading or trailing spaces, or spaces not surrounding an ampersand, must 
not match: "AAA BBB" must be rejected.

- Leading or trailing ampersands must also be rejected. This includes the 
case where the string is nothing but ampersands.

- Consecutive ampersands "AAA&&&BBB" and the empty string must be rejected.


I get something like this:

r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"


but it fails on strings like "AA   &  A &  A". What am I doing wrong?


For the record, here's my brief test suite:


def test(pat):
    for s in ("", " ", "&" "A A", "A&", "&A", "A&&A", "A& &A"):
        assert re.match(pat, s) is None
    for s in ("A", "A & A", "AA&A", "AA   &  A &  A"):
        assert re.match(pat, s)




-- 
Steve

[toc] | [prev] | [next] | [standalone]

Page 1 of 3 [1] 2 3 Next page →

csiph-web

Whittle it on down

Contents

#108158 — Whittle it on down

#108160

#108173

#108202

#108218

#108226

#108229

#108232

#108242

#108222

#108161

#108174

#108162

#108163

#108164

#108165

#108168

#108175

#108181

#108166