Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!ecngs!feeder2.ecngs.de!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <b80f3ab3-ef81-4806-86db-efd5800d4bb3@googlegroups.com>
References: <b80f3ab3-ef81-4806-86db-efd5800d4bb3@googlegroups.com>
Date: Tue, 4 Dec 2012 15:31:48 +0100
Subject: Re: Good use for itertools.dropwhile and itertools.takewhile
From: Vlastimil Brom <vlastimil.brom@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.461.1354631511.29569.python-list@python.org>
Lines: 87
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:34229

2012/12/4 Nick Mellor <thebalancepro@gmail.com>:
> Hi,
>
> I came across itertools.dropwhile only today, then shortly afterwards found Raymond Hettinger wondering, in 2007, whether to drop [sic] dropwhile and takewhile from the itertools module.
>
> Fate of itertools.dropwhile() and itertools.takewhile() - Python
> bytes.com
> http://bit.ly/Vi2PqP
>
> Almost nobody else of the 18 respondents seemed to be using them.
>
> And then 2 hours later, a use case came along. I think. Anyone have any better solutions?
>
> I have a file full of things like this:
>
> "CAPSICUM RED fresh from Queensland"
>
> Product names (all caps, at start of string) and descriptions (mixed case, to end of string) all muddled up in the same field. And I need to split them into two fields. Note that if the text had said:
>
> "CAPSICUM RED fresh from QLD"
>
> I would want QLD in the description, not shunted forwards and put in the product name. So (uncontrived) list comprehensions and regex's are out.
>
> I want to split the above into:
>
> ("CAPSICUM RED", "fresh from QLD")
>
> Enter dropwhile and takewhile. 6 lines later:
>
> from itertools import takewhile, dropwhile
> def split_product_itertools(s):
>     words = s.split()
>     allcaps = lambda word: word == word.upper()
>     product, description = takewhile(allcaps, words), dropwhile(allcaps, words)
>     return " ".join(product), " ".join(description)
>
>
> When I tried to refactor this code to use while or for loops, I couldn't find any way that felt shorter or more pythonic:
>
> (9 lines: using for)
>
> def split_product_1(s):
>     words = s.split()
>     product = []
>     for word in words:
>         if word == word.upper():
>             product.append(word)
>         else:
>             break
>     return " ".join(product), " ".join(words[len(product):])
>
>
> (12 lines: using while)
>
> def split_product_2(s):
>     words = s.split()
>     i = 0
>     product = []
>     while 1:
>         word = words[i]
>         if word == word.upper():
>             product.append(word)
>             i += 1
>         else:
>             break
>     return " ".join(product), " ".join(words[i:])
>
>
> Any thoughts?
>
> Nick
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
the regex approach doesn't actually seem to be very complex, given the
mentioned specification, e.g.

>>> import re
>>> re.findall(r"(?m)^([A-Z\s]+) (.+)$", "CAPSICUM RED fresh from QLD\nCAPSICUM RED fresh from Queensland")
[('CAPSICUM RED', 'fresh from QLD'), ('CAPSICUM RED', 'fresh from Queensland')]
>>>

(It might be necessary to account for some punctuation, whitespace etc. too.)

hth,
  vbr