Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #34226 > unrolled thread

Good use for itertools.dropwhile and itertools.takewhile

Started byNick Mellor <thebalancepro@gmail.com>
First post2012-12-04 05:57 -0800
Last post2012-12-06 13:29 -0800
Articles 20 on this page of 38 — 12 participants

Back to article view | Back to comp.lang.python


Contents

  Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 05:57 -0800
    Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-04 14:23 +0000
      Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 06:47 -0800
        Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-04 15:17 +0000
    Re: Good use for itertools.dropwhile and itertools.takewhile Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-04 15:31 +0100
      Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 07:24 -0800
        Re: Good use for itertools.dropwhile and itertools.takewhile Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-04 22:08 +0100
      Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 07:24 -0800
        Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-04 18:26 +0000
    Re: Good use for itertools.dropwhile and itertools.takewhile Alexander Blinne <news@blinne.net> - 2012-12-04 18:18 +0100
      Re: Good use for itertools.dropwhile and itertools.takewhile DJC <djc@news.invalid> - 2012-12-04 18:28 +0000
        Re: Good use for itertools.dropwhile and itertools.takewhile Alexander Blinne <news@blinne.net> - 2012-12-04 19:48 +0100
          Re: Good use for itertools.dropwhile and itertools.takewhile Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-04 12:37 -0700
            Re: Good use for itertools.dropwhile and itertools.takewhile Alexander Blinne <news@blinne.net> - 2012-12-04 21:33 +0100
            Re: Good use for itertools.dropwhile and itertools.takewhile Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-04 21:13 +0000
          Re: Good use for itertools.dropwhile and itertools.takewhile MRAB <python@mrabarnett.plus.com> - 2012-12-04 20:17 +0000
    Re: Good use for itertools.dropwhile and itertools.takewhile Terry Reedy <tjreedy@udel.edu> - 2012-12-04 15:44 -0500
      Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 17:17 -0800
        Re: Good use for itertools.dropwhile and itertools.takewhile Chris Angelico <rosuav@gmail.com> - 2012-12-06 00:45 +1100
          Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 14:34 +0000
            Re: Good use for itertools.dropwhile and itertools.takewhile Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-05 08:33 -0700
              Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 16:11 +0000
        Re: Good use for itertools.dropwhile and itertools.takewhile Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-12-05 15:32 +0000
        Re: Good use for itertools.dropwhile and itertools.takewhile Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-05 09:16 -0700
        Re: Good use for itertools.dropwhile and itertools.takewhile MRAB <python@mrabarnett.plus.com> - 2012-12-05 17:57 +0000
      Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 17:17 -0800
        Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 13:29 +0000
          Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-05 09:04 -0800
            Re: Good use for itertools.dropwhile and itertools.takewhile MRAB <python@mrabarnett.plus.com> - 2012-12-05 17:57 +0000
            Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 18:16 +0000
              Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-05 11:01 -0800
                Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 20:13 +0000
                Re: Good use for itertools.dropwhile and itertools.takewhile Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-05 22:36 +0100
                  Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-06 13:06 +0000
                    Re: Good use for itertools.dropwhile and itertools.takewhile Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-06 15:12 +0100
            Re: Good use for itertools.dropwhile and itertools.takewhile Alexander Blinne <news@blinne.net> - 2012-12-06 14:40 +0100
    Re: Good use for itertools.dropwhile and itertools.takewhile Terry Reedy <tjreedy@udel.edu> - 2012-12-04 17:21 -0500
    Re: Good use for itertools.dropwhile and itertools.takewhile Paul Rubin <no.email@nospam.invalid> - 2012-12-06 13:29 -0800

Page 1 of 2  [1] 2  Next page →


#34226 — Good use for itertools.dropwhile and itertools.takewhile

FromNick Mellor <thebalancepro@gmail.com>
Date2012-12-04 05:57 -0800
SubjectGood use for itertools.dropwhile and itertools.takewhile
Message-ID<b80f3ab3-ef81-4806-86db-efd5800d4bb3@googlegroups.com>
Hi,

I came across itertools.dropwhile only today, then shortly afterwards found Raymond Hettinger wondering, in 2007, whether to drop [sic] dropwhile and takewhile from the itertools module.

Fate of itertools.dropwhile() and itertools.takewhile() - Python
bytes.com
http://bit.ly/Vi2PqP

Almost nobody else of the 18 respondents seemed to be using them.

And then 2 hours later, a use case came along. I think. Anyone have any better solutions?

I have a file full of things like this:

"CAPSICUM RED fresh from Queensland"

Product names (all caps, at start of string) and descriptions (mixed case, to end of string) all muddled up in the same field. And I need to split them into two fields. Note that if the text had said:

"CAPSICUM RED fresh from QLD"

I would want QLD in the description, not shunted forwards and put in the product name. So (uncontrived) list comprehensions and regex's are out.

I want to split the above into:

("CAPSICUM RED", "fresh from QLD")

Enter dropwhile and takewhile. 6 lines later:

from itertools import takewhile, dropwhile
def split_product_itertools(s):
    words = s.split()
    allcaps = lambda word: word == word.upper()
    product, description = takewhile(allcaps, words), dropwhile(allcaps, words)
    return " ".join(product), " ".join(description)


When I tried to refactor this code to use while or for loops, I couldn't find any way that felt shorter or more pythonic:

(9 lines: using for)

def split_product_1(s):
    words = s.split()
    product = []
    for word in words:
        if word == word.upper():
            product.append(word)
        else:
            break
    return " ".join(product), " ".join(words[len(product):])


(12 lines: using while)

def split_product_2(s):
    words = s.split()
    i = 0
    product = []
    while 1:
        word = words[i]
        if word == word.upper():
            product.append(word)
            i += 1
        else:
            break
    return " ".join(product), " ".join(words[i:])


Any thoughts?

Nick

[toc] | [next] | [standalone]


#34228

FromNeil Cerutti <neilc@norwich.edu>
Date2012-12-04 14:23 +0000
Message-ID<ai6fb6Fk3vkU7@mid.individual.net>
In reply to#34226
On 2012-12-04, Nick Mellor <thebalancepro@gmail.com> wrote:
> I have a file full of things like this:
>
> "CAPSICUM RED fresh from Queensland"
>
> Product names (all caps, at start of string) and descriptions
> (mixed case, to end of string) all muddled up in the same
> field. And I need to split them into two fields. Note that if
> the text had said:
>
> "CAPSICUM RED fresh from QLD"
>
> I would want QLD in the description, not shunted forwards and
> put in the product name. So (uncontrived) list comprehensions
> and regex's are out.
>
> I want to split the above into:
>
> ("CAPSICUM RED", "fresh from QLD")
>
> Enter dropwhile and takewhile. 6 lines later:
>
> from itertools import takewhile, dropwhile
> def split_product_itertools(s):
>     words = s.split()
>     allcaps = lambda word: word == word.upper()
>     product, description = takewhile(allcaps, words), dropwhile(allcaps, words)
>     return " ".join(product), " ".join(description)
>
> When I tried to refactor this code to use while or for loops, I
> couldn't find any way that felt shorter or more pythonic:

I'm really tempted to import re, and that means takewhile and
dropwhile need to stay. ;)

But seriously, this is a quick implementation of my first thought.

description = s.lstrip(string.ascii_uppercase + ' ')
product = s[:-len(description)-1]

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#34232

FromNick Mellor <thebalancepro@gmail.com>
Date2012-12-04 06:47 -0800
Message-ID<2152911e-50a0-42aa-8956-5eb96803c7c1@googlegroups.com>
In reply to#34228
Hi Neil,

Nice! But fails if the first word of the description starts with a capital letter.

Nick


On Wednesday, 5 December 2012 01:23:34 UTC+11, Neil Cerutti  wrote:
> On 2012-12-04, Nick Mellor <thebalancepro@gmail.com> wrote:
> 
> > I have a file full of things like this:
> 
> >
> 
> > "CAPSICUM RED fresh from Queensland"
> 
> >
> 
> > Product names (all caps, at start of string) and descriptions
> 
> > (mixed case, to end of string) all muddled up in the same
> 
> > field. And I need to split them into two fields. Note that if
> 
> > the text had said:
> 
> >
> 
> > "CAPSICUM RED fresh from QLD"
> 
> >
> 
> > I would want QLD in the description, not shunted forwards and
> 
> > put in the product name. So (uncontrived) list comprehensions
> 
> > and regex's are out.
> 
> >
> 
> > I want to split the above into:
> 
> >
> 
> > ("CAPSICUM RED", "fresh from QLD")
> 
> >
> 
> > Enter dropwhile and takewhile. 6 lines later:
> 
> >
> 
> > from itertools import takewhile, dropwhile
> 
> > def split_product_itertools(s):
> 
> >     words = s.split()
> 
> >     allcaps = lambda word: word == word.upper()
> 
> >     product, description = takewhile(allcaps, words), dropwhile(allcaps, words)
> 
> >     return " ".join(product), " ".join(description)
> 
> >
> 
> > When I tried to refactor this code to use while or for loops, I
> 
> > couldn't find any way that felt shorter or more pythonic:
> 
> 
> 
> I'm really tempted to import re, and that means takewhile and
> 
> dropwhile need to stay. ;)
> 
> 
> 
> But seriously, this is a quick implementation of my first thought.
> 
> 
> 
> description = s.lstrip(string.ascii_uppercase + ' ')
> 
> product = s[:-len(description)-1]
> 
> 
> 
> -- 
> 
> Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#34233

FromNeil Cerutti <neilc@norwich.edu>
Date2012-12-04 15:17 +0000
Message-ID<ai6ifcFlm5qU1@mid.individual.net>
In reply to#34232
On 2012-12-04, Nick Mellor <thebalancepro@gmail.com> wrote:
> Hi Neil,
>
> Nice! But fails if the first word of the description starts
> with a capital letter.

Darn edge cases.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#34229

FromVlastimil Brom <vlastimil.brom@gmail.com>
Date2012-12-04 15:31 +0100
Message-ID<mailman.461.1354631511.29569.python-list@python.org>
In reply to#34226
2012/12/4 Nick Mellor <thebalancepro@gmail.com>:
> Hi,
>
> I came across itertools.dropwhile only today, then shortly afterwards found Raymond Hettinger wondering, in 2007, whether to drop [sic] dropwhile and takewhile from the itertools module.
>
> Fate of itertools.dropwhile() and itertools.takewhile() - Python
> bytes.com
> http://bit.ly/Vi2PqP
>
> Almost nobody else of the 18 respondents seemed to be using them.
>
> And then 2 hours later, a use case came along. I think. Anyone have any better solutions?
>
> I have a file full of things like this:
>
> "CAPSICUM RED fresh from Queensland"
>
> Product names (all caps, at start of string) and descriptions (mixed case, to end of string) all muddled up in the same field. And I need to split them into two fields. Note that if the text had said:
>
> "CAPSICUM RED fresh from QLD"
>
> I would want QLD in the description, not shunted forwards and put in the product name. So (uncontrived) list comprehensions and regex's are out.
>
> I want to split the above into:
>
> ("CAPSICUM RED", "fresh from QLD")
>
> Enter dropwhile and takewhile. 6 lines later:
>
> from itertools import takewhile, dropwhile
> def split_product_itertools(s):
>     words = s.split()
>     allcaps = lambda word: word == word.upper()
>     product, description = takewhile(allcaps, words), dropwhile(allcaps, words)
>     return " ".join(product), " ".join(description)
>
>
> When I tried to refactor this code to use while or for loops, I couldn't find any way that felt shorter or more pythonic:
>
> (9 lines: using for)
>
> def split_product_1(s):
>     words = s.split()
>     product = []
>     for word in words:
>         if word == word.upper():
>             product.append(word)
>         else:
>             break
>     return " ".join(product), " ".join(words[len(product):])
>
>
> (12 lines: using while)
>
> def split_product_2(s):
>     words = s.split()
>     i = 0
>     product = []
>     while 1:
>         word = words[i]
>         if word == word.upper():
>             product.append(word)
>             i += 1
>         else:
>             break
>     return " ".join(product), " ".join(words[i:])
>
>
> Any thoughts?
>
> Nick
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
the regex approach doesn't actually seem to be very complex, given the
mentioned specification, e.g.

>>> import re
>>> re.findall(r"(?m)^([A-Z\s]+) (.+)$", "CAPSICUM RED fresh from QLD\nCAPSICUM RED fresh from Queensland")
[('CAPSICUM RED', 'fresh from QLD'), ('CAPSICUM RED', 'fresh from Queensland')]
>>>

(It might be necessary to account for some punctuation, whitespace etc. too.)

hth,
  vbr

[toc] | [prev] | [next] | [standalone]


#34235

FromNick Mellor <thebalancepro@gmail.com>
Date2012-12-04 07:24 -0800
Message-ID<d06616b9-20f8-4390-ac28-1ad0e49ee018@googlegroups.com>
In reply to#34229
I love the way you guys can write a line of code that does the same as 20 of mine :)

I can turn up the heat on your regex by feeding it a null description or multiple white space (both in the original file.) I'm sure you'd adjust, but at the cost of a more complex regex.

Meanwhile takewith and dropwith are behaving themselves impeccably but my while loop has fallen over.

Best,

Nick

On Wednesday, 5 December 2012 01:31:48 UTC+11, Vlastimil Brom  wrote:
> 2012/12/4 Nick Mellor <thebalancepro@gmail.com>:
> 
> > Hi,
> 
> >
> 
> > I came across itertools.dropwhile only today, then shortly afterwards found Raymond Hettinger wondering, in 2007, whether to drop [sic] dropwhile and takewhile from the itertools module.
> 
> >
> 
> > Fate of itertools.dropwhile() and itertools.takewhile() - Python
> 
> > bytes.com
> 
> > http://bit.ly/Vi2PqP
> 
> >
> 
> > Almost nobody else of the 18 respondents seemed to be using them.
> 
> >
> 
> > And then 2 hours later, a use case came along. I think. Anyone have any better solutions?
> 
> >
> 
> > I have a file full of things like this:
> 
> >
> 
> > "CAPSICUM RED fresh from Queensland"
> 
> >
> 
> > Product names (all caps, at start of string) and descriptions (mixed case, to end of string) all muddled up in the same field. And I need to split them into two fields. Note that if the text had said:
> 
> >
> 
> > "CAPSICUM RED fresh from QLD"
> 
> >
> 
> > I would want QLD in the description, not shunted forwards and put in the product name. So (uncontrived) list comprehensions and regex's are out.
> 
> >
> 
> > I want to split the above into:
> 
> >
> 
> > ("CAPSICUM RED", "fresh from QLD")
> 
> >
> 
> > Enter dropwhile and takewhile. 6 lines later:
> 
> >
> 
> > from itertools import takewhile, dropwhile
> 
> > def split_product_itertools(s):
> 
> >     words = s.split()
> 
> >     allcaps = lambda word: word == word.upper()
> 
> >     product, description = takewhile(allcaps, words), dropwhile(allcaps, words)
> 
> >     return " ".join(product), " ".join(description)
> 
> >
> 
> >
> 
> > When I tried to refactor this code to use while or for loops, I couldn't find any way that felt shorter or more pythonic:
> 
> >
> 
> > (9 lines: using for)
> 
> >
> 
> > def split_product_1(s):
> 
> >     words = s.split()
> 
> >     product = []
> 
> >     for word in words:
> 
> >         if word == word.upper():
> 
> >             product.append(word)
> 
> >         else:
> 
> >             break
> 
> >     return " ".join(product), " ".join(words[len(product):])
> 
> >
> 
> >
> 
> > (12 lines: using while)
> 
> >
> 
> > def split_product_2(s):
> 
> >     words = s.split()
> 
> >     i = 0
> 
> >     product = []
> 
> >     while 1:
> 
> >         word = words[i]
> 
> >         if word == word.upper():
> 
> >             product.append(word)
> 
> >             i += 1
> 
> >         else:
> 
> >             break
> 
> >     return " ".join(product), " ".join(words[i:])
> 
> >
> 
> >
> 
> > Any thoughts?
> 
> >
> 
> > Nick
> 
> > --
> 
> > http://mail.python.org/mailman/listinfo/python-list
> 
> 
> 
> Hi,
> 
> the regex approach doesn't actually seem to be very complex, given the
> 
> mentioned specification, e.g.
> 
> 
> 
> >>> import re
> 
> >>> re.findall(r"(?m)^([A-Z\s]+) (.+)$", "CAPSICUM RED fresh from QLD\nCAPSICUM RED fresh from Queensland")
> 
> [('CAPSICUM RED', 'fresh from QLD'), ('CAPSICUM RED', 'fresh from Queensland')]
> 
> >>>
> 
> 
> 
> (It might be necessary to account for some punctuation, whitespace etc. too.)
> 
> 
> 
> hth,
> 
>   vbr

[toc] | [prev] | [next] | [standalone]


#34256

FromVlastimil Brom <vlastimil.brom@gmail.com>
Date2012-12-04 22:08 +0100
Message-ID<mailman.477.1354655318.29569.python-list@python.org>
In reply to#34235
2012/12/4 Nick Mellor <thebalancepro@gmail.com>:
> I love the way you guys can write a line of code that does the same as 20 of mine :)
> I can turn up the heat on your regex by feeding it a null description or multiple white space (both in the original file.) I'm sure you'd adjust, but at the cost of a more complex regex.
> Meanwhile takewith and dropwith are behaving themselves impeccably but my while loop has fallen over.
>
> Best,
> Nick
>> [...]
> --

Hi,
well, for what is it worth, both cases could be addressed quite
easily, with little added complexity - e.g.: make the description part
optional, allow multiple whitespace and enforce word boundary after
the product name in order to get rid of the trailing whitespace in it:

>>> re.findall(r"(?m)^([A-Z\s]+\b)(?:\s+(.*))?$", "CAPSICUM RED fresh from QLD\nCAPSICUM    RED   fresh from    Queensland\nCAPSICUM RED")
[('CAPSICUM RED', 'fresh from QLD'), ('CAPSICUM    RED', 'fresh from
 Queensland'), ('CAPSICUM RED', '')]
>>>

However, it's certainly preferable to use a solution you are more
comfortable with, e.g. the itertools one...

regards,
   vbr

[toc] | [prev] | [next] | [standalone]


#34237

FromNick Mellor <thebalancepro@gmail.com>
Date2012-12-04 07:24 -0800
Message-ID<mailman.465.1354635214.29569.python-list@python.org>
In reply to#34229
I love the way you guys can write a line of code that does the same as 20 of mine :)

I can turn up the heat on your regex by feeding it a null description or multiple white space (both in the original file.) I'm sure you'd adjust, but at the cost of a more complex regex.

Meanwhile takewith and dropwith are behaving themselves impeccably but my while loop has fallen over.

Best,

Nick

On Wednesday, 5 December 2012 01:31:48 UTC+11, Vlastimil Brom  wrote:
> 2012/12/4 Nick Mellor <thebalancepro@gmail.com>:
> 
> > Hi,
> 
> >
> 
> > I came across itertools.dropwhile only today, then shortly afterwards found Raymond Hettinger wondering, in 2007, whether to drop [sic] dropwhile and takewhile from the itertools module.
> 
> >
> 
> > Fate of itertools.dropwhile() and itertools.takewhile() - Python
> 
> > bytes.com
> 
> > http://bit.ly/Vi2PqP
> 
> >
> 
> > Almost nobody else of the 18 respondents seemed to be using them.
> 
> >
> 
> > And then 2 hours later, a use case came along. I think. Anyone have any better solutions?
> 
> >
> 
> > I have a file full of things like this:
> 
> >
> 
> > "CAPSICUM RED fresh from Queensland"
> 
> >
> 
> > Product names (all caps, at start of string) and descriptions (mixed case, to end of string) all muddled up in the same field. And I need to split them into two fields. Note that if the text had said:
> 
> >
> 
> > "CAPSICUM RED fresh from QLD"
> 
> >
> 
> > I would want QLD in the description, not shunted forwards and put in the product name. So (uncontrived) list comprehensions and regex's are out.
> 
> >
> 
> > I want to split the above into:
> 
> >
> 
> > ("CAPSICUM RED", "fresh from QLD")
> 
> >
> 
> > Enter dropwhile and takewhile. 6 lines later:
> 
> >
> 
> > from itertools import takewhile, dropwhile
> 
> > def split_product_itertools(s):
> 
> >     words = s.split()
> 
> >     allcaps = lambda word: word == word.upper()
> 
> >     product, description = takewhile(allcaps, words), dropwhile(allcaps, words)
> 
> >     return " ".join(product), " ".join(description)
> 
> >
> 
> >
> 
> > When I tried to refactor this code to use while or for loops, I couldn't find any way that felt shorter or more pythonic:
> 
> >
> 
> > (9 lines: using for)
> 
> >
> 
> > def split_product_1(s):
> 
> >     words = s.split()
> 
> >     product = []
> 
> >     for word in words:
> 
> >         if word == word.upper():
> 
> >             product.append(word)
> 
> >         else:
> 
> >             break
> 
> >     return " ".join(product), " ".join(words[len(product):])
> 
> >
> 
> >
> 
> > (12 lines: using while)
> 
> >
> 
> > def split_product_2(s):
> 
> >     words = s.split()
> 
> >     i = 0
> 
> >     product = []
> 
> >     while 1:
> 
> >         word = words[i]
> 
> >         if word == word.upper():
> 
> >             product.append(word)
> 
> >             i += 1
> 
> >         else:
> 
> >             break
> 
> >     return " ".join(product), " ".join(words[i:])
> 
> >
> 
> >
> 
> > Any thoughts?
> 
> >
> 
> > Nick
> 
> > --
> 
> > http://mail.python.org/mailman/listinfo/python-list
> 
> 
> 
> Hi,
> 
> the regex approach doesn't actually seem to be very complex, given the
> 
> mentioned specification, e.g.
> 
> 
> 
> >>> import re
> 
> >>> re.findall(r"(?m)^([A-Z\s]+) (.+)$", "CAPSICUM RED fresh from QLD\nCAPSICUM RED fresh from Queensland")
> 
> [('CAPSICUM RED', 'fresh from QLD'), ('CAPSICUM RED', 'fresh from Queensland')]
> 
> >>>
> 
> 
> 
> (It might be necessary to account for some punctuation, whitespace etc. too.)
> 
> 
> 
> hth,
> 
>   vbr

[toc] | [prev] | [next] | [standalone]


#34243

FromNeil Cerutti <neilc@norwich.edu>
Date2012-12-04 18:26 +0000
Message-ID<ai6thvFo9u6U1@mid.individual.net>
In reply to#34237
On 2012-12-04, Nick Mellor <thebalancepro@gmail.com> wrote:
> I love the way you guys can write a line of code that does the
> same as 20 of mine :)
>
> I can turn up the heat on your regex by feeding it a null
> description or multiple white space (both in the original
> file.) I'm sure you'd adjust, but at the cost of a more complex
> regex.

A re.split should be able to handle this without too much hassle.

The simplicity of my two-line version will evaporate pretty
quickly to compensate for edge cases.

Here's one that can handle one of the edge cases you mention, but
it's hardly any shorter than what you had, and it doesn't
preserve non-standard whites space, like double spaces.

def prod_desc(s):
    """split s into product name and product description. Product
    name is a series of one or more capitalized words followed
    by white space. Everything after the trailing white space is
    the product description.

    >>> prod_desc("CAR FIFTY TWO Chrysler LeBaron.")
    ['CAR FIFTY TWO', 'Chrysler LeBaron.']
    """
    prod = []
    desc = []
    target = prod
    for word in s.split():
        if target is prod and not word.isupper():
            target = desc
        target.append(word)
    return [' '.join(prod), ' '.join(desc)]

When str methods fail I'll usually write my own parser before
turning to re. The following is no longer nice looking at all.

def prod_desc(s):
    """split s into product name and product description. Product
    name is a series of one or more capitalized words followed
    by white space. Everything after the trailing white space is
    the product description.

    >>> prod_desc("CAR FIFTY TWO Chrysler LeBaron.")
    ['CAR FIFTY TWO', 'Chrysler LeBaron.']

    >>> prod_desc("MR.  JONESEY   Saskatchewan's finest")
    ['MR.  JONESEY', "Saskatchewan's finest"]
    """
    i = 0
    while not s[i].islower():
        i += 1
    i -= 1
    while not s[i].isspace():
        i -= 1
    start_desc = i+1
    while s[i].isspace():
        i -= 1
    end_prod = i+1
    return [s[:end_prod], s[start_desc:]]

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#34240

FromAlexander Blinne <news@blinne.net>
Date2012-12-04 18:18 +0100
Message-ID<50be3049$0$9517$9b4e6d93@newsspool1.arcor-online.net>
In reply to#34226
Another neat solution with a little help from

http://stackoverflow.com/questions/1701211/python-return-the-index-of-the-first-element-of-a-list-which-makes-a-passed-fun

>>> def split_product(p):
...     w = p.split(" ")
...     j = (i for i,v in enumerate(w) if v.upper() != v).next()
...     return " ".join(w[:j]), " ".join(w[j:])

Greetings

[toc] | [prev] | [next] | [standalone]


#34244

FromDJC <djc@news.invalid>
Date2012-12-04 18:28 +0000
Message-ID<k9lfd0$evp$1@dont-email.me>
In reply to#34240
On 04/12/12 17:18, Alexander Blinne wrote:
> Another neat solution with a little help from
>
> http://stackoverflow.com/questions/1701211/python-return-the-index-of-the-first-element-of-a-list-which-makes-a-passed-fun
>
>>>> def split_product(p):
> ....     w = p.split(" ")
> ....     j = (i for i,v in enumerate(w) if v.upper() != v).next()
> ....     return " ".join(w[:j]), " ".join(w[j:])
>
Python 2.7.3 (default, Sep 26 2012, 21:51:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> w1 = "CAPSICUM RED Fresh from Queensland"
 >>> w1.split()
['CAPSICUM', 'RED', 'Fresh', 'from', 'Queensland']
 >>> w = w1.split()

 >>> (i for i,v in enumerate(w) if v.upper() != v)
<generator object <genexpr> at 0x18b1910>
 >>> (i for i,v in enumerate(w) if v.upper() != v).next()
2

Python 3.2.3 (default, Oct 19 2012, 19:53:16)

 >>> (i for i,v in enumerate(w) if v.upper() != v).next()
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
AttributeError: 'generator' object has no attribute 'next'

[toc] | [prev] | [next] | [standalone]


#34245

FromAlexander Blinne <news@blinne.net>
Date2012-12-04 19:48 +0100
Message-ID<50be4566$0$9507$9b4e6d93@newsspool1.arcor-online.net>
In reply to#34244
Am 04.12.2012 19:28, schrieb DJC:
>>>> (i for i,v in enumerate(w) if v.upper() != v).next()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> AttributeError: 'generator' object has no attribute 'next'

Yeah, i saw this problem right after i sent the posting. It now is
supposed to read like this

>>> def split_product(p):
...     w = p.split(" ")
...     j = next(i for i,v in enumerate(w) if v.upper() != v)
...     return " ".join(w[:j]), " ".join(w[j:])

Greetings

[toc] | [prev] | [next] | [standalone]


#34248

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-12-04 12:37 -0700
Message-ID<mailman.470.1354649891.29569.python-list@python.org>
In reply to#34245

[Multipart message — attachments visible in raw view] — view raw

On Tue, Dec 4, 2012 at 11:48 AM, Alexander Blinne <news@blinne.net> wrote:

> Am 04.12.2012 19:28, schrieb DJC:
> >>>> (i for i,v in enumerate(w) if v.upper() != v).next()
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> > AttributeError: 'generator' object has no attribute 'next'
>
> Yeah, i saw this problem right after i sent the posting. It now is
> supposed to read like this
>
> >>> def split_product(p):
> ...     w = p.split(" ")
> ...     j = next(i for i,v in enumerate(w) if v.upper() != v)
> ...     return " ".join(w[:j]), " ".join(w[j:])
>

It still fails if the product description is empty.

>>> split_product("CAPSICUM RED")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in split_product
StopIteration

I'm not meaning to pick on you; some of the other solutions in this thread
also fail in that case.

>>> re.findall(r"(?m)^([A-Z\s]+) (.+)$", "CAPSICUM RED")
[('CAPSICUM', 'RED')]

>>> prod_desc("CAPSICUM RED")  # the second version from Neil's post
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 14, in prod_desc
IndexError: string index out of range

[toc] | [prev] | [next] | [standalone]


#34251

FromAlexander Blinne <news@blinne.net>
Date2012-12-04 21:33 +0100
Message-ID<50be5e30$0$9512$9b4e6d93@newsspool1.arcor-online.net>
In reply to#34248
Am 04.12.2012 20:37, schrieb Ian Kelly:
>     >>> def split_product(p):
>     ...     w = p.split(" ")
>     ...     j = next(i for i,v in enumerate(w) if v.upper() != v)
>     ...     return " ".join(w[:j]), " ".join(w[j:])
> 
> 
> It still fails if the product description is empty.

That's true... let's see, next() takes a default value in case the
iterator is empty and then we could use some special value and test for
it. But i think it would be more elegant to just handle the excepten
ourselves, so:

>>> def split_product(p):
...     w = p.split(" ")
...     try:
...         j = next(i for i,v in enumerate(w) if v.upper() != v)
...     except StopIteration:
...         return p, ''
...     return " ".join(w[:j]), " ".join(w[j:])

> I'm not meaning to pick on you; some of the other solutions in this
> thread also fail in that case.

It's ok, opening the eye for edge cases is always a good idea :)

Greetings

[toc] | [prev] | [next] | [standalone]


#34257

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-12-04 21:13 +0000
Message-ID<50be675c$0$29994$c3e8da3$5496439d@news.astraweb.com>
In reply to#34248
Ian,

For the sanity of those of us reading this via Usenet using the Pan 
newsreader, could you please turn off HTML emailing? It's very 
distracting.

Thanks,

Steven


On Tue, 04 Dec 2012 12:37:38 -0700, Ian Kelly wrote:

[...]
> <div class="gmail_quote">On Tue,
> Dec 4, 2012 at 11:48 AM, Alexander Blinne <span dir="ltr">&lt;<a
> href="mailto:news@blinne.net"
> target="_blank">news@blinne.net</a>&gt;</span> wrote:<br><blockquote
> class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc
> solid;padding-left:1ex">
> 
> Am 04.12.2012 19:28, schrieb DJC:<br> <div class="im">&gt;&gt;&gt;&gt;
> (i for i,v in enumerate(w) if v.upper() != v).next()<br> &gt; Traceback
> (most recent call last):<br> &gt;   File &quot;&lt;stdin&gt;&quot;, line
> 1, in &lt;module&gt;<br> &gt; AttributeError: &#39;generator&#39; object
> has no attribute &#39;next&#39;<br> <br>
> </div>Yeah, i saw this problem right after i sent the posting. It now
> is<br> supposed to read like this<br>
> <div class="im"><br>
> &gt;&gt;&gt; def split_product(p):<br> ...     w = p.split(&quot;
> &quot;)<br> </div>...     j = next(i for i,v in enumerate(w) if
> v.upper() != v)<br> <div class="im">...     return &quot;
> &quot;.join(w[:j]), &quot;
> &quot;.join(w[j:])<br></div></blockquote></div><br>It still fails if the
> product description is empty.<br><br>&gt;&gt;&gt;
> split_product(&quot;CAPSICUM RED&quot;)<br>
> 
> Traceback (most recent call last):<br>  File &quot;&lt;stdin&gt;&quot;,
> line 1, in &lt;module&gt;<br>  File &quot;&lt;stdin&gt;&quot;, line 3,
> in split_product<br>StopIteration<br><br>I&#39;m not meaning to pick on
> you; some of the other solutions in this thread also fail in that
> case.<br>
> 
> <br>&gt;&gt;&gt; re.findall(r&quot;(?m)^([A-Z\s]+) (.+)$&quot;,
> &quot;CAPSICUM RED&quot;)<br>[(&#39;CAPSICUM&#39;,
> &#39;RED&#39;)]<br><br>&gt;&gt;&gt; prod_desc(&quot;CAPSICUM RED&quot;) 
> # the second version from Neil&#39;s post<br>
> 
> Traceback (most recent call last):<br>  File &quot;&lt;stdin&gt;&quot;,
> line 1, in &lt;module&gt;<br>  File &quot;&lt;stdin&gt;&quot;, line 14,
> in prod_desc<br>IndexError: string index out of range<br><br>


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#34250

FromMRAB <python@mrabarnett.plus.com>
Date2012-12-04 20:17 +0000
Message-ID<mailman.473.1354652248.29569.python-list@python.org>
In reply to#34245
On 2012-12-04 19:37, Ian Kelly wrote:
> On Tue, Dec 4, 2012 at 11:48 AM, Alexander Blinne <news@blinne.net
> <mailto:news@blinne.net>> wrote:
>
>     Am 04.12.2012 19:28, schrieb DJC:
>      >>>> (i for i,v in enumerate(w) if v.upper() != v).next()
>      > Traceback (most recent call last):
>      >   File "<stdin>", line 1, in <module>
>      > AttributeError: 'generator' object has no attribute 'next'
>
>     Yeah, i saw this problem right after i sent the posting. It now is
>     supposed to read like this
>
>      >>> def split_product(p):
>     ...     w = p.split(" ")
>     ...     j = next(i for i,v in enumerate(w) if v.upper() != v)
>     ...     return " ".join(w[:j]), " ".join(w[j:])
>
>
> It still fails if the product description is empty.
>
>  >>> split_product("CAPSICUM RED")
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
>    File "<stdin>", line 3, in split_product
> StopIteration
>
> I'm not meaning to pick on you; some of the other solutions in this
> thread also fail in that case.
>
>  >>> re.findall(r"(?m)^([A-Z\s]+) (.+)$", "CAPSICUM RED")
> [('CAPSICUM', 'RED')]
>
That's easily fixed:

 >>> re.findall(r"(?m)^([A-Z\s]+)(?: (.*))?$", "CAPSICUM RED")
[('CAPSICUM RED', '')]

>  >>> prod_desc("CAPSICUM RED")  # the second version from Neil's post
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
>    File "<stdin>", line 14, in prod_desc
> IndexError: string index out of range
>

[toc] | [prev] | [next] | [standalone]


#34252

FromTerry Reedy <tjreedy@udel.edu>
Date2012-12-04 15:44 -0500
Message-ID<mailman.474.1354653865.29569.python-list@python.org>
In reply to#34226
On 12/4/2012 8:57 AM, Nick Mellor wrote:

> I have a file full of things like this:
>
> "CAPSICUM RED fresh from Queensland"
>
> Product names (all caps, at start of string) and descriptions (mixed
> case, to end of string) all muddled up in the same field. And I need
> to split them into two fields. Note that if the text had said:
>
> "CAPSICUM RED fresh from QLD"
>
> I would want QLD in the description, not shunted forwards and put in
> the product name. So (uncontrived) list comprehensions and regex's
> are out.
>
> I want to split the above into:
>
> ("CAPSICUM RED", "fresh from QLD")
>
> Enter dropwhile and takewhile. 6 lines later:
>
> from itertools import takewhile, dropwhile
> def split_product_itertools(s):
 >   words = s.split()
 >   allcaps = lambda word: word == word.upper()
 >   product, description =\
 >       takewhile(allcaps, words), dropwhile(allcaps, words)
 >   return " ".join(product), " ".join(description)

If the original string has no excess whitespace, description is what 
remains of s after product prefix is omitted. (Py 3 code)

from itertools import takewhile
def allcaps(word): return word == word.upper()

def split_product_itertools(s):
     product = ' '.join(takewhile(allcaps, s.split()))
     return product, s[len(product)+1:]

print(split_product_itertools("CAPSICUM RED fresh from QLD"))
 >>>
('CAPSICUM RED', 'fresh from QLD')

Without that assumption, the same idea applies to the split list.

def split_product_itertools(s):
     words = s.split()
     product = list(takewhile(allcaps, words))
     return ' '.join(product), ' '.join(words[len(product):])

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#34266

FromNick Mellor <thebalancepro@gmail.com>
Date2012-12-04 17:17 -0800
Message-ID<05bca175-2077-4fb8-917e-baee1a43a47d@googlegroups.com>
In reply to#34252
Hi Terry,

For my money, and especially in your versions, despite several expert solutions using other features, itertools has it. It seems to me to need less nutting out than the other approaches. It's short, robust, has a minimum of symbols, uses simple expressions and is not overly clever. If we could just get used to using takewhile.

takewhile mines for gold at the start of a sequence, dropwhile drops the dross at the start of a sequence.

Thanks all for your interest and your help,

Best,

Nick

Terry's implementations:

> from itertools import takewhile
> 
> def allcaps(word): return word == word.upper()
> 
> 
> 
> def split_product_itertools(s):
> 
>      product = ' '.join(takewhile(allcaps, s.split()))
> 
>      return product, s[len(product)+1:]
> 
> 
> 
> print(split_product_itertools("CAPSICUM RED fresh from QLD"))
> 
>  >>>
> 
> ('CAPSICUM RED', 'fresh from QLD')
> 
> 
> 
> [if there could be surplus whitespace], the same idea applies to the split list.
> 
> 
> 
> def split_product_itertools(s):
> 
>      words = s.split()
> 
>      product = list(takewhile(allcaps, words))
> 
>      return ' '.join(product), ' '.join(words[len(product):])
> 

[toc] | [prev] | [next] | [standalone]


#34280

FromChris Angelico <rosuav@gmail.com>
Date2012-12-06 00:45 +1100
Message-ID<mailman.490.1354715109.29569.python-list@python.org>
In reply to#34266
On Wed, Dec 5, 2012 at 12:17 PM, Nick Mellor <thebalancepro@gmail.com> wrote:
>
> takewhile mines for gold at the start of a sequence, dropwhile drops the dross at the start of a sequence.

When you're using both over the same sequence and with the same
condition, it seems odd that you need to iterate over it twice.
Perhaps a partitioning iterator would be cleaner - something like
this:

def partitionwhile(predicate, iterable):
    iterable = iter(iterable)
    while True:
        val = next(iterable)
        if not predicate(val): break
        yield val
    raise StopIteration # Signal the end of Phase 1
    for val in iterable: yield val # or just "yield from iterable", I think

Only the cold hard boot of reality just stomped out the spark of an
idea. Once StopIteration has been raised, that's it, there's no
"resuming" the iterator. Is there a way around that? Is there a clean
way to say "Done for now, but next time you ask, there'll be more"?

I tested it on Python 3.2 (yeah, time I upgraded, I know).

ChrisA

[toc] | [prev] | [next] | [standalone]


#34281

FromNeil Cerutti <neilc@norwich.edu>
Date2012-12-05 14:34 +0000
Message-ID<ai94btF9hoaU1@mid.individual.net>
In reply to#34280
On 2012-12-05, Chris Angelico <rosuav@gmail.com> wrote:
> On Wed, Dec 5, 2012 at 12:17 PM, Nick Mellor <thebalancepro@gmail.com> wrote:
>>
>> takewhile mines for gold at the start of a sequence, dropwhile
>> drops the dross at the start of a sequence.
>
> When you're using both over the same sequence and with the same
> condition, it seems odd that you need to iterate over it twice.
> Perhaps a partitioning iterator would be cleaner - something
> like this:
>
> def partitionwhile(predicate, iterable):
>     iterable = iter(iterable)
>     while True:
>         val = next(iterable)
>         if not predicate(val): break
>         yield val
>     raise StopIteration # Signal the end of Phase 1
>     for val in iterable: yield val # or just "yield from iterable", I think
>
> Only the cold hard boot of reality just stomped out the spark
> of an idea. Once StopIteration has been raised, that's it,
> there's no "resuming" the iterator. Is there a way around that?
> Is there a clean way to say "Done for now, but next time you
> ask, there'll be more"?
>
> I tested it on Python 3.2 (yeah, time I upgraded, I know).

Well, shoot! Then this is a job for groupby, not takewhile.

def prod_desc(s):
    """split s into product name and product description.

    >>> prod_desc("CAR FIFTY TWO Chrysler LeBaron.")
    ['CAR FIFTY TWO', 'Chrysler LeBaron.']

    >>> prod_desc("MR. JONESEY Saskatchewan's finest")
    ['MR. JONESEY', "Saskatchewan's finest"]

    >>> prod_desc("no product name?")
    ['', 'no product name?']

    >>> prod_desc("NO DESCRIPTION")
    ['NO DESCRIPTION', '']
    """
    prod = ''
    desc = ''
    for k, g in itertools.groupby(s.split(),
            key=lambda w: any(c.islower() for c in w)):
        a = ' '.join(g)
        if k:
            desc = a 
        else:
            prod = a
    return [prod, desc]

This has no way to preserve odd white space which could break
evil product name differences.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.python


csiph-web