Groups > comp.lang.python > #108018 > unrolled thread

Best way to clean up list items?

Started by	DFS <nospam@dfs.com>
First post	2016-05-02 12:33 -0400
Last post	2016-05-02 19:30 +0200
Articles	10 — 5 participants

Back to article view | Back to comp.lang.python

  Best way to clean up list items? DFS <nospam@dfs.com> - 2016-05-02 12:33 -0400
    Re: Best way to clean up list items? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-02 19:57 +0300
      Re: Best way to clean up list items? justin walters <walters.justin01@gmail.com> - 2016-05-02 10:10 -0700
      Re: Best way to clean up list items? DFS <nospam@dfs.com> - 2016-05-02 14:06 -0400
        Re: Best way to clean up list items? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-02 21:27 +0300
          Re: Best way to clean up list items? DFS <nospam@dfs.com> - 2016-05-02 15:04 -0400
    Re: Best way to clean up list items? Stephen Hansen <me+python@ixokai.io> - 2016-05-02 10:25 -0700
      Re: Best way to clean up list items? DFS <nospam@dfs.com> - 2016-05-02 14:09 -0400
        Re: Best way to clean up list items? Stephen Hansen <me+python@ixokai.io> - 2016-05-02 11:23 -0700
    Re: Best way to clean up list items? Peter Otten <__peter__@web.de> - 2016-05-02 19:30 +0200

#108018 — Best way to clean up list items?

From	DFS <nospam@dfs.com>
Date	2016-05-02 12:33 -0400
Subject	Best way to clean up list items?
Message-ID	<ng7v9d$ld8$1@dont-email.me>

Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
Want: list1 = ['Item 1','Item 2']


I wrote this, which works fine, but maybe it can be tidier?

1. list2 = [t.replace("\r\n", "") for t in list1]   #remove \r\n
2. list3 = [t.strip(' ') for t in list2]            #trim whitespace
3. list1  = filter(None, list3)                     #remove empty items


After each step:

1. list2 = ['   Item 1  ','  Item 2  ','  ']   #remove \r\n
2. list3 = ['Item 1','Item 2','']              #trim whitespace
3. list1 = ['Item 1','Item 2']                 #remove empty items


Thanks!

[toc] | [next] | [standalone]

#108020

From	Jussi Piitulainen <jussi.piitulainen@helsinki.fi>
Date	2016-05-02 19:57 +0300
Message-ID	<lf5shy05rfg.fsf@ling.helsinki.fi>
In reply to	#108018

DFS writes:

> Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
> Want: list1 = ['Item 1','Item 2']
>
>
> I wrote this, which works fine, but maybe it can be tidier?
>
> 1. list2 = [t.replace("\r\n", "") for t in list1]   #remove \r\n
> 2. list3 = [t.strip(' ') for t in list2]            #trim whitespace
> 3. list1  = filter(None, list3)                     #remove empty items
>
> After each step:
>
> 1. list2 = ['   Item 1  ','  Item 2  ','  ']   #remove \r\n
> 2. list3 = ['Item 1','Item 2','']              #trim whitespace
> 3. list1 = ['Item 1','Item 2']                 #remove empty items

Try filter(None, (t.strip() for t in list1)). The default.

Funny-looking data you have.

[toc] | [prev] | [next] | [standalone]

#108021

From	justin walters <walters.justin01@gmail.com>
Date	2016-05-02 10:10 -0700
Message-ID	<mailman.324.1462209011.32212.python-list@python.org>
In reply to	#108020

On May 2, 2016 10:03 AM, "Jussi Piitulainen" <jussi.piitulainen@helsinki.fi>
wrote:
>
> DFS writes:
>
> > Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
> > Want: list1 = ['Item 1','Item 2']
> >
> >
> > I wrote this, which works fine, but maybe it can be tidier?
> >
> > 1. list2 = [t.replace("\r\n", "") for t in list1]   #remove \r\n
> > 2. list3 = [t.strip(' ') for t in list2]            #trim whitespace
> > 3. list1  = filter(None, list3)                     #remove empty items
> >
> > After each step:
> >
> > 1. list2 = ['   Item 1  ','  Item 2  ','  ']   #remove \r\n
> > 2. list3 = ['Item 1','Item 2','']              #trim whitespace
> > 3. list1 = ['Item 1','Item 2']                 #remove empty items

You could also try compiled regex to remove unwanted characters.

Then loop through the list and do a replace for each item.

[toc] | [prev] | [next] | [standalone]

#108025

From	DFS <nospam@dfs.com>
Date	2016-05-02 14:06 -0400
Message-ID	<ng84oq$dsu$1@dont-email.me>
In reply to	#108020

On 5/2/2016 12:57 PM, Jussi Piitulainen wrote:
> DFS writes:
>
>> Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
>> Want: list1 = ['Item 1','Item 2']
>>
>>
>> I wrote this, which works fine, but maybe it can be tidier?
>>
>> 1. list2 = [t.replace("\r\n", "") for t in list1]   #remove \r\n
>> 2. list3 = [t.strip(' ') for t in list2]            #trim whitespace
>> 3. list1  = filter(None, list3)                     #remove empty items
>>
>> After each step:
>>
>> 1. list2 = ['   Item 1  ','  Item 2  ','  ']   #remove \r\n
>> 2. list3 = ['Item 1','Item 2','']              #trim whitespace
>> 3. list1 = ['Item 1','Item 2']                 #remove empty items
>
> Try filter(None, (t.strip() for t in list1)). The default.

Works and drops a line of code.  Thx.



> Funny-looking data you have.

I know - sadly, it's actual data:

--------------------------------------------------------------------
from lxml import html
import requests

webpage = 
"http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2"

page  = requests.get(webpage)
tree  = html.fromstring(page.content)
addr1 = tree.xpath('//span[@class="text3"]/text()')
print 'Addresses: ', addr1
--------------------------------------------------------------------

I couldn't figure out a better way to extract it from the HTML (maybe 
XML and DOM?)

[toc] | [prev] | [next] | [standalone]

#108028

From	Jussi Piitulainen <jussi.piitulainen@helsinki.fi>
Date	2016-05-02 21:27 +0300
Message-ID	<lf5oa8o5n96.fsf@ling.helsinki.fi>
In reply to	#108025

DFS writes:

> On 5/2/2016 12:57 PM, Jussi Piitulainen wrote:
>> DFS writes:
>>
>>> Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
>>> Want: list1 = ['Item 1','Item 2']

. .

>> Funny-looking data you have.
>
> I know - sadly, it's actual data:
>
> --------------------------------------------------------------------
> from lxml import html
> import requests
>
> webpage =
> "http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2"
>
> page  = requests.get(webpage)
> tree  = html.fromstring(page.content)
> addr1 = tree.xpath('//span[@class="text3"]/text()')
> print 'Addresses: ', addr1
> --------------------------------------------------------------------
>
> I couldn't figure out a better way to extract it from the HTML (maybe
> XML and DOM?)

I should have guessed :) But now I'm a bit worried about those spaces
inside your items. Can it happen that item text is split into strings in
the middle? Then the above sanitation does the wrong thing.

If someone has the right solution, I'm watching, too.

[toc] | [prev] | [next] | [standalone]

#108029

From	DFS <nospam@dfs.com>
Date	2016-05-02 15:04 -0400
Message-ID	<ng884q$sf8$1@dont-email.me>
In reply to	#108028

On 5/2/2016 2:27 PM, Jussi Piitulainen wrote:
> DFS writes:
>
>> On 5/2/2016 12:57 PM, Jussi Piitulainen wrote:
>>> DFS writes:
>>>
>>>> Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
>>>> Want: list1 = ['Item 1','Item 2']
>
> . .
>
>>> Funny-looking data you have.
>>
>> I know - sadly, it's actual data:
>>
>> --------------------------------------------------------------------
>> from lxml import html
>> import requests
>>
>> webpage =
>> "http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2"
>>
>> page  = requests.get(webpage)
>> tree  = html.fromstring(page.content)
>> addr1 = tree.xpath('//span[@class="text3"]/text()')
>> print 'Addresses: ', addr1
>> --------------------------------------------------------------------
>>
>> I couldn't figure out a better way to extract it from the HTML (maybe
>> XML and DOM?)
>
> I should have guessed :) But now I'm a bit worried about those spaces
> inside your items. Can it happen that item text is split into strings in
> the middle?

Meaning split by me, or comes 'malformed' from the data source?


> Then the above sanitation does the wrong thing.
>
> If someone has the right solution, I'm watching, too.


Here's the raw data as stored in the tree:

---------------------------------------------------------------------------
1st page

['\r\n                        ', '\r\n                        1918 W End 
Ave, Nashville, TN 37203', '\r\n
               ', '\r\n                        1806 Hayes St, Nashville, 
TN 37203', '\r\n                        ', '\r\n 
1701 Broadway, Nashville, TN 37203', '\r\n                        ', '\r\n
             209 10th Ave S, Nashville, TN 37203', '\r\n 
        ', '\r\n                        907 20th Ave S, Nashville, TN 
37212', '\r\n                        ', '\r\n                        911 
20th Ave S, Nashville, TN 37212', '\r\n                        ', '\r\n 
                       1722 W End Ave, Nashville, TN 37203', '\r\n 
                  ', '\r\n                        1905 Hayes St, 
Nashville, TN 37203', '\r\n
               ', '\r\n                        2000 W End Ave, 
Nashville, TN 37203']

---------------------------------------------------------------------------

Next page

['\r\n                        ', '\r\n                        120 19th 
Ave N, Nashville, TN 37203', '\r\n
               ', '\r\n                        1719 W End Ave Ste 101, 
Nashville, TN 37203', '\r\n
       ', '\r\n                        1922 W End Ave, Nashville, TN 
37203', '\r\n                        ', '\r\n
                       909 20th Ave S, Nashville, TN 37212', '\r\n 
                  ', '\r\n
       1807 Church St, Nashville, TN 37203', '\r\n 
  ', '\r\n                        1721 Church St, Nashville, TN 37203', 
'\r\n                        ', '\r\n                        718 
Division St, Nashville, TN 37203', '\r\n                        ', '\r\n 
                        907 12th Ave S, Nashville, TN 37203', '\r\n 
                   ', '\r\n                        204 21st Ave S, 
Nashville, TN 37203', '\r\n
           ', '\r\n                        1811 Division St, Nashville, 
TN 37203', '\r\n                        ', '\r\n 
903 Gleaves St, Nashville, TN 37203', '\r\n                        ', '\r\n
             1720 W End Ave Ste 530, Nashville, TN 37203', '\r\n 
                ', '\r\n
     1200 Division St Ste 100-A, Nashville, TN 37203', '\r\n 
            ', '\r\n
422 7th Ave S, Nashville, TN 37203', '\r\n                        ', 
'\r\n                        605 8th Ave S, Nashville, TN 37203']

and so on
---------------------------------------------------------------------------

I've checked a couple hundred addresses visually, and so far I've only 
seen 2 formats:

1. '\r\n            '
2. '\r\n   address  '

[toc] | [prev] | [next] | [standalone]

#108022

From	Stephen Hansen <me+python@ixokai.io>
Date	2016-05-02 10:25 -0700
Message-ID	<mailman.325.1462209932.32212.python-list@python.org>
In reply to	#108018

On Mon, May 2, 2016, at 09:33 AM, DFS wrote:
> Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']

I'm curious how you got to this point, it seems like you can solve the
problem in how this is generated.

> Want: list1 = ['Item 1','Item 2']

That said:

list1 = [t.strip() for t in list1 if t and not t.isspace()]

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]

#108026

From	DFS <nospam@dfs.com>
Date	2016-05-02 14:09 -0400
Message-ID	<ng84uj$emf$1@dont-email.me>
In reply to	#108022

On 5/2/2016 1:25 PM, Stephen Hansen wrote:
> On Mon, May 2, 2016, at 09:33 AM, DFS wrote:
>> Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
>
> I'm curious how you got to this point, it seems like you can solve the
> problem in how this is generated.

--------------------------------------------------------------------
from lxml import html
import requests

webpage = 
"http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=TN&wqhqn=2&qc=Nashville&rg=30&qhqn=restaurant&sb=zipdisc&ap=2"

page  = requests.get(webpage)
tree  = html.fromstring(page.content)
addr1 = tree.xpath('//span[@class="text3"]/text()')
print 'Addresses: ', addr1
--------------------------------------------------------------------

I'd prefer to get clean data in the first place, but I don't know a 
better way to extract it from the HTML.

[toc] | [prev] | [next] | [standalone]

#108027

From	Stephen Hansen <me+python@ixokai.io>
Date	2016-05-02 11:23 -0700
Message-ID	<mailman.328.1462213433.32212.python-list@python.org>
In reply to	#108026

On Mon, May 2, 2016, at 11:09 AM, DFS wrote:
> I'd prefer to get clean data in the first place, but I don't know a 
> better way to extract it from the HTML.

Ah, right. I didn't know you were scraping HTML. Scraping HTML is rarely
clean so you have to do a lot of cleanup.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]

#108023

From	Peter Otten <__peter__@web.de>
Date	2016-05-02 19:30 +0200
Message-ID	<mailman.326.1462210254.32212.python-list@python.org>
In reply to	#108018

DFS wrote:

> Have: list1 = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
> Want: list1 = ['Item 1','Item 2']
> 
> 
> I wrote this, which works fine, but maybe it can be tidier?
> 
> 1. list2 = [t.replace("\r\n", "") for t in list1]   #remove \r\n
> 2. list3 = [t.strip(' ') for t in list2]            #trim whitespace
> 3. list1  = filter(None, list3)                     #remove empty items
> 
> 
> After each step:
> 
> 1. list2 = ['   Item 1  ','  Item 2  ','  ']   #remove \r\n
> 2. list3 = ['Item 1','Item 2','']              #trim whitespace
> 3. list1 = ['Item 1','Item 2']                 #remove empty items
> 
> 
> Thanks!

s.strip() strips all whitespace, so you can combine steps 1 and 2:

>>> items = ['\r\n   Item 1  ','  Item 2  ','\r\n  ']
>>> stripped = (s.strip() for s in items)

The (...) instead of [...] denote a generator expression, so the iteration 
has not started yet. The final step uses a list comprehension instead of 
filter():

>>> [s for s in stripped if s]
['Item 1', 'Item 2']

That way the same code works with both Python 2 and Python 3. Note that you 
can iterate over the generator expression only once; if you try it again 
you'll end empty-handed:

>>> [s for s in stripped if s]
[]

If you want to do it in one step here are two options that both involve some 
duplicate work:

>>> [s.strip() for s in items if s and not s.isspace()]
['Item 1', 'Item 2']
>>> [s.strip() for s in items if s.strip()]
['Item 1', 'Item 2']

[toc] | [prev] | [standalone]

csiph-web

Best way to clean up list items?

Contents

#108018 — Best way to clean up list items?

#108020

#108021

#108025

#108028

#108029

#108022

#108026

#108027

#108023