Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #99930 > unrolled thread

filter a list of strings

Started by<c.buhtz@posteo.jp>
First post2015-12-03 02:15 +0100
Last post2015-12-03 13:17 +0100
Articles 13 — 10 participants

Back to article view | Back to comp.lang.python


Contents

  filter a list of strings <c.buhtz@posteo.jp> - 2015-12-03 02:15 +0100
    Re: filter a list of strings Jussi Piitulainen <harvesting@is.invalid> - 2015-12-03 08:32 +0200
      Re: filter a list of strings <c.buhtz@posteo.jp> - 2015-12-03 10:27 +0100
        Re: filter a list of strings Jussi Piitulainen <harvesting@is.invalid> - 2015-12-03 13:53 +0200
        Re: filter a list of strings Peter Pearson <pkpearson@nowhere.invalid> - 2015-12-05 19:42 +0000
      Re: filter a list of strings Chris Angelico <rosuav@gmail.com> - 2015-12-03 20:40 +1100
      Re: filter a list of strings Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2015-12-03 10:46 +0100
      Re: filter a list of strings Laura Creighton <lac@openend.se> - 2015-12-03 10:53 +0100
      Re: filter a list of strings jmp <jeanmichel@sequans.com> - 2015-12-03 11:03 +0100
      Re: filter a list of strings Peter Otten <__peter__@web.de> - 2015-12-03 11:13 +0100
      Re: filter a list of strings Denis McMahon <denismfmcmahon@gmail.com> - 2015-12-03 14:16 +0000
        Re: filter a list of strings Jussi Piitulainen <harvesting@is.invalid> - 2015-12-03 17:02 +0200
    Re: filter a list of strings Grobu <snailcoder@retrosite.invalid> - 2015-12-03 13:17 +0100

#99930 — filter a list of strings

From<c.buhtz@posteo.jp>
Date2015-12-03 02:15 +0100
Subjectfilter a list of strings
Message-ID<mailman.155.1449122975.14615.python-list@python.org>
I would like to know how this could be done more elegant/pythonic.

I have a big list (over 10.000 items) with strings (each 100 to 300
chars long) and want to filter them.

list = .....

for item in list[:]:
  if 'Banana' in item:
     list.remove(item)
  if 'Car' in item:
     list.remove(item)

There are a lot of more conditions of course. This is just example code.
It doesn't look nice to me. To much redundance.

btw: Is it correct to iterate over a copy (list[:]) of that string list
and not the original one?
-- 
GnuPGP-Key ID 0751A8EC

[toc] | [next] | [standalone]


#99934

FromJussi Piitulainen <harvesting@is.invalid>
Date2015-12-03 08:32 +0200
Message-ID<lf51tb459f2.fsf@ling.helsinki.fi>
In reply to#99930
<c.buhtz@posteo.jp> writes:

> I would like to know how this could be done more elegant/pythonic.
>
> I have a big list (over 10.000 items) with strings (each 100 to 300
> chars long) and want to filter them.
>
> list = .....
>
> for item in list[:]:
>   if 'Banana' in item:
>      list.remove(item)
>   if 'Car' in item:
>      list.remove(item)
>
> There are a lot of more conditions of course. This is just example
> code.  It doesn't look nice to me. To much redundance.

Yes. The initial copy is redundant and the repeated .remove calls are
not only expensive (quadratic time loop that could have been linear),
they are also incorrect if there are duplicates in the list. You want to
copy and filter in one go:

list = ...
list = [ item for item in list
         if ( 'Banana' not in item and
              'Car' not in item ) ]

It's better to use another name, since "list" is the name of a built-in
function. It may be a good idea to define a complex condition as a
separate function:

def isbad(item):
    return ( 'Banana' in item or
             'Car' in item )

def isgood(item)
    return not isbad(item)

items = ...
items = [ item for item in items if isgood(item) ]

Then there's also filter, which is easy to use now that the condition is
already a named function:

items = list(filter(isgood, items))

> btw: Is it correct to iterate over a copy (list[:]) of that string
> list and not the original one?

I think it's a good idea to iterate over a copy if you are modifying the
original during the iteration, but the above suggestions are better for
other reasons.

[toc] | [prev] | [next] | [standalone]


#99946

From<c.buhtz@posteo.jp>
Date2015-12-03 10:27 +0100
Message-ID<mailman.165.1449134847.14615.python-list@python.org>
In reply to#99934
Thank you for your suggestion. This will help a lot.

On 2015-12-03 08:32 Jussi Piitulainen <harvesting@is.invalid> wrote:
> list = [ item for item in list
>          if ( 'Banana' not in item and
>               'Car' not in item ) ]

I often saw constructions like this
  x for x in y if ...
But I don't understand that combination of the Python keywords (for,
in, if) I allready know. It is to complex to imagine what there really
happen.

I understand this
  for x in y:
    if ...

But what is about the 'x' in front of all that?
-- 
GnuPGP-Key ID 0751A8EC

[toc] | [prev] | [next] | [standalone]


#99953

FromJussi Piitulainen <harvesting@is.invalid>
Date2015-12-03 13:53 +0200
Message-ID<lf5twnzsq8m.fsf@ling.helsinki.fi>
In reply to#99946
<c.buhtz@posteo.jp> writes:

> Thank you for your suggestion. This will help a lot.
>
> On 2015-12-03 08:32 Jussi Piitulainen wrote:
>> list = [ item for item in list
>>          if ( 'Banana' not in item and
>>               'Car' not in item ) ]
>
> I often saw constructions like this
>   x for x in y if ...
> But I don't understand that combination of the Python keywords (for,
> in, if) I allready know. It is to complex to imagine what there really
> happen.

Others have given the crucial search word, "list comprehension".

The brackets are part of the notation. Without brackets, or grouped in
parentheses, it would be a generator expression, whose value would yield
the items on demand. Curly braces would make it a set or dict
comprehension; the latter also uses a colon.

> I understand this
>   for x in y:
>     if ...
>
> But what is about the 'x' in front of all that?

You can understand the notation as collecting the values from nested
for-loops and conditions, just like you are attempting here, together
with a fresh list that will be the result. The "x" in front can be any
expression involving the loop variables; it corresponds to a
result.append(x) inside the nested loops and conditions. Roughly:

result = []
for x in xs:
    for y in ys:
        if x != y:
            result.append((x,y))
==>

result = [(x,y) for x in xs for y in ys if x != y]

On python.org, this information seems to be in the tutorial but not in
the language reference.

[toc] | [prev] | [next] | [standalone]


#100044

FromPeter Pearson <pkpearson@nowhere.invalid>
Date2015-12-05 19:42 +0000
Message-ID<dcgt0cF9d3fU2@mid.individual.net>
In reply to#99946
On Thu, 3 Dec 2015 10:27:19 +0100, <c.buhtz@posteo.jp> wrote:
[snip]
> I often saw constructions like this
>   x for x in y if ...
> But I don't understand that combination of the Python keywords (for,
> in, if) I allready know. It is to complex to imagine what there really
> happen.

Don't give up!  List comprehensions are one of the coolest things
in Python.  Maybe this simple example will make it click for you:

>>> [x**2 for x in [1,2,3,4] if x != 2]
[1, 9, 16]

-- 
To email me, substitute nowhere->runbox, invalid->com.

[toc] | [prev] | [next] | [standalone]


#99947

FromChris Angelico <rosuav@gmail.com>
Date2015-12-03 20:40 +1100
Message-ID<mailman.166.1449135660.14615.python-list@python.org>
In reply to#99934
On Thu, Dec 3, 2015 at 8:27 PM,  <c.buhtz@posteo.jp> wrote:
> Thank you for your suggestion. This will help a lot.
>
> On 2015-12-03 08:32 Jussi Piitulainen <harvesting@is.invalid> wrote:
>> list = [ item for item in list
>>          if ( 'Banana' not in item and
>>               'Car' not in item ) ]
>
> I often saw constructions like this
>   x for x in y if ...
> But I don't understand that combination of the Python keywords (for,
> in, if) I allready know. It is to complex to imagine what there really
> happen.
>
> I understand this
>   for x in y:
>     if ...
>
> But what is about the 'x' in front of all that?

It's called a *list comprehension*. The code Jussi posted is broadly
equivalent to this:

list = []
for item in list:
    if ( 'Banana' not in item and
            'Car' not in item ):
        list.append(item)

I recently came across this blog post, which visualizes comprehensions
fairly well.

http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/

The bit at the beginning (before the first 'for') goes inside a
list.append(...) call, and then everything else is basically the same.

ChrisA

[toc] | [prev] | [next] | [standalone]


#99948

FromWolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>
Date2015-12-03 10:46 +0100
Message-ID<mailman.167.1449136035.14615.python-list@python.org>
In reply to#99934
On 03.12.2015 10:27, c.buhtz@posteo.jp wrote:
 >
 > I often saw constructions like this
 >    x for x in y if ...
 > But I don't understand that combination of the Python keywords (for,
 > in, if) I allready know. It is to complex to imagine what there really
 > happen.
 >
 > I understand this
 >    for x in y:
 >      if ...
 >
 > But what is about the 'x' in front of all that?
 >

The leading x states which value you want to put in the new list. This 
may seem obvious in the simple case, but quite often its not the 
original x-ses found in y that you want to store, but some 
transformation of it, e.g.:

[x**2 for x in y]

is equivalent to:

squares = []
for x in y:
     squares.append(x**2)

[toc] | [prev] | [next] | [standalone]


#99949

FromLaura Creighton <lac@openend.se>
Date2015-12-03 10:53 +0100
Message-ID<mailman.168.1449136443.14615.python-list@python.org>
In reply to#99934
In a message of Thu, 03 Dec 2015 10:27:19 +0100, c.buhtz@posteo.jp writes:
>Thank you for your suggestion. This will help a lot.
>
>On 2015-12-03 08:32 Jussi Piitulainen <harvesting@is.invalid> wrote:
>> list = [ item for item in list
>>          if ( 'Banana' not in item and
>>               'Car' not in item ) ]
>
>I often saw constructions like this
>  x for x in y if ...
>But I don't understand that combination of the Python keywords (for,
>in, if) I allready know. It is to complex to imagine what there really
>happen.
>
>I understand this
>  for x in y:
>    if ...
>
>But what is about the 'x' in front of all that?

This is a list comprehension.
see: https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions

But I would solve your problem like this:

things_I_do_not_want = ['Car', 'Banana', <add all of them here>]
things_I_want = []

for item in list_of_everything_I_started_with:
    if item not in things_I_do_not_want:
       things_I_want.append(item)

Laura

[toc] | [prev] | [next] | [standalone]


#99951

Fromjmp <jeanmichel@sequans.com>
Date2015-12-03 11:03 +0100
Message-ID<mailman.169.1449137041.14615.python-list@python.org>
In reply to#99934
On 12/03/2015 10:27 AM, c.buhtz@posteo.jp wrote:
> I often saw constructions like this
>    x for x in y if ...
> But I don't understand that combination of the Python keywords (for,
> in, if) I allready know. It is to complex to imagine what there really
> happen.
>
> I understand this
>    for x in y:
>      if ...
>
> But what is about the 'x' in front of all that?
>

I'd advise you insist on understanding this construct as it is a very 
common (and useful) construct in python. It's a list comprehension, you 
can google it to get some clues about it.

consider this example
[2*i for i in [0,1,2,3,4] if i%2] == [2,6]

you can split it in 3 parts:
1/ for i in [0,1,2,3,4]
2/ if i/2
3/ 2*i

1/ I'm assuming you understand this one
2/ this is the filter part
3/ this is the mapping part, it applies a function to each element


To go back to your question "what is about the 'x' in front of all 
that". The x  is the mapping part, but the function applied is the 
function identity which simply keeps the element as is.

# map each element, no filter
[2*i for i in [0,1,2,3,4]] == [0, 2, 4, 6, 8]

# no mapping, keeping only odd elements
[i for i in [0,1,2,3,4] if i%2] == [1,3]

JM

[toc] | [prev] | [next] | [standalone]


#99952

FromPeter Otten <__peter__@web.de>
Date2015-12-03 11:13 +0100
Message-ID<mailman.170.1449137650.14615.python-list@python.org>
In reply to#99934
Laura Creighton wrote:

> In a message of Thu, 03 Dec 2015 10:27:19 +0100, c.buhtz@posteo.jp writes:
>>Thank you for your suggestion. This will help a lot.
>>
>>On 2015-12-03 08:32 Jussi Piitulainen <harvesting@is.invalid> wrote:
>>> list = [ item for item in list
>>>          if ( 'Banana' not in item and
>>>               'Car' not in item ) ]
>>
>>I often saw constructions like this
>>  x for x in y if ...
>>But I don't understand that combination of the Python keywords (for,
>>in, if) I allready know. It is to complex to imagine what there really
>>happen.
>>
>>I understand this
>>  for x in y:
>>    if ...
>>
>>But what is about the 'x' in front of all that?
> 
> This is a list comprehension.
> see:
> https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions
> 
> But I would solve your problem like this:
> 
> things_I_do_not_want = ['Car', 'Banana', <add all of them here>]
> things_I_want = []
> 
> for item in list_of_everything_I_started_with:
>     if item not in things_I_do_not_want:
>        things_I_want.append(item)

Note that unlike the original code your variant will not reject
"Blue Banana". If the OP wants to preserve the '"Banana" in item' test he 
can use

for item in list_of_everything_I_started_with:
    for unwanted in things_I_do_not_want:
        if unwanted in item:
            break
    else: # executed unless the for loop exits with break
        things_I_want.append(item)

or

things_I_want = [
    item for item in list_of_everything_I_started_with
    if not any(unwanted in item for unwanted in things_I_do_not_want)
]

[toc] | [prev] | [next] | [standalone]


#99959

FromDenis McMahon <denismfmcmahon@gmail.com>
Date2015-12-03 14:16 +0000
Message-ID<n3pis2$aa6$1@dont-email.me>
In reply to#99934
On Thu, 03 Dec 2015 08:32:49 +0200, Jussi Piitulainen wrote:

> def isbad(item):
>     return ( 'Banana' in item or
>              'Car' in item )
>
> def isgood(item)
>     return not isbad(item)

badthings = [ 'Banana', 'Car', ........]

def isgood(item)
    for thing in badthings:
        if thing in item:
            return False
    return True

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]


#99960

FromJussi Piitulainen <harvesting@is.invalid>
Date2015-12-03 17:02 +0200
Message-ID<lf5oae7r2xc.fsf@ling.helsinki.fi>
In reply to#99959
Denis McMahon writes:

> On Thu, 03 Dec 2015 08:32:49 +0200, Jussi Piitulainen wrote:
>
>> def isbad(item):
>>     return ( 'Banana' in item or
>>              'Car' in item )
>>
>> def isgood(item)
>>     return not isbad(item)
>
> badthings = [ 'Banana', 'Car', ........]
>
> def isgood(item)
>     for thing in badthings:
>         if thing in item:
>             return False
>     return True

As long as all conditions are of that shape.

[toc] | [prev] | [next] | [standalone]


#99954

FromGrobu <snailcoder@retrosite.invalid>
Date2015-12-03 13:17 +0100
Message-ID<n3pbns$i4e$1@dont-email.me>
In reply to#99930
On 03/12/15 02:15, c.buhtz@posteo.jp wrote:
> I would like to know how this could be done more elegant/pythonic.
>
> I have a big list (over 10.000 items) with strings (each 100 to 300
> chars long) and want to filter them.
>
> list = .....
>
> for item in list[:]:
>    if 'Banana' in item:
>       list.remove(item)
>    if 'Car' in item:
>       list.remove(item)
>
> There are a lot of more conditions of course. This is just example code.
> It doesn't look nice to me. To much redundance.
>
> btw: Is it correct to iterate over a copy (list[:]) of that string list
> and not the original one?
>

No idea how 'Pythonic' this would be considered, but you could use a 
combination of filter() with a regular expression :

# ------------------------------------------------------------------
import re

list = ...

pattern = re.compile( r'banana|car', re.I )
filtered_list = filter( lambda line: not pattern.search(line), list )
# ------------------------------------------------------------------

HTH

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web