Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #34226 > unrolled thread

Good use for itertools.dropwhile and itertools.takewhile

Started byNick Mellor <thebalancepro@gmail.com>
First post2012-12-04 05:57 -0800
Last post2012-12-06 13:29 -0800
Articles 18 on this page of 38 — 12 participants

Back to article view | Back to comp.lang.python


Contents

  Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 05:57 -0800
    Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-04 14:23 +0000
      Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 06:47 -0800
        Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-04 15:17 +0000
    Re: Good use for itertools.dropwhile and itertools.takewhile Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-04 15:31 +0100
      Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 07:24 -0800
        Re: Good use for itertools.dropwhile and itertools.takewhile Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-04 22:08 +0100
      Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 07:24 -0800
        Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-04 18:26 +0000
    Re: Good use for itertools.dropwhile and itertools.takewhile Alexander Blinne <news@blinne.net> - 2012-12-04 18:18 +0100
      Re: Good use for itertools.dropwhile and itertools.takewhile DJC <djc@news.invalid> - 2012-12-04 18:28 +0000
        Re: Good use for itertools.dropwhile and itertools.takewhile Alexander Blinne <news@blinne.net> - 2012-12-04 19:48 +0100
          Re: Good use for itertools.dropwhile and itertools.takewhile Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-04 12:37 -0700
            Re: Good use for itertools.dropwhile and itertools.takewhile Alexander Blinne <news@blinne.net> - 2012-12-04 21:33 +0100
            Re: Good use for itertools.dropwhile and itertools.takewhile Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-04 21:13 +0000
          Re: Good use for itertools.dropwhile and itertools.takewhile MRAB <python@mrabarnett.plus.com> - 2012-12-04 20:17 +0000
    Re: Good use for itertools.dropwhile and itertools.takewhile Terry Reedy <tjreedy@udel.edu> - 2012-12-04 15:44 -0500
      Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 17:17 -0800
        Re: Good use for itertools.dropwhile and itertools.takewhile Chris Angelico <rosuav@gmail.com> - 2012-12-06 00:45 +1100
          Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 14:34 +0000
            Re: Good use for itertools.dropwhile and itertools.takewhile Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-05 08:33 -0700
              Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 16:11 +0000
        Re: Good use for itertools.dropwhile and itertools.takewhile Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-12-05 15:32 +0000
        Re: Good use for itertools.dropwhile and itertools.takewhile Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-05 09:16 -0700
        Re: Good use for itertools.dropwhile and itertools.takewhile MRAB <python@mrabarnett.plus.com> - 2012-12-05 17:57 +0000
      Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 17:17 -0800
        Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 13:29 +0000
          Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-05 09:04 -0800
            Re: Good use for itertools.dropwhile and itertools.takewhile MRAB <python@mrabarnett.plus.com> - 2012-12-05 17:57 +0000
            Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 18:16 +0000
              Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-05 11:01 -0800
                Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 20:13 +0000
                Re: Good use for itertools.dropwhile and itertools.takewhile Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-05 22:36 +0100
                  Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-06 13:06 +0000
                    Re: Good use for itertools.dropwhile and itertools.takewhile Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-06 15:12 +0100
            Re: Good use for itertools.dropwhile and itertools.takewhile Alexander Blinne <news@blinne.net> - 2012-12-06 14:40 +0100
    Re: Good use for itertools.dropwhile and itertools.takewhile Terry Reedy <tjreedy@udel.edu> - 2012-12-04 17:21 -0500
    Re: Good use for itertools.dropwhile and itertools.takewhile Paul Rubin <no.email@nospam.invalid> - 2012-12-06 13:29 -0800

Page 2 of 2 — ← Prev page 1 [2]


#34285

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-12-05 08:33 -0700
Message-ID<mailman.493.1354721626.29569.python-list@python.org>
In reply to#34281
On Wed, Dec 5, 2012 at 7:34 AM, Neil Cerutti <neilc@norwich.edu> wrote:
> Well, shoot! Then this is a job for groupby, not takewhile.

The problem with groupby is that you can't just limit it to two groups.

>>> prod_desc("CAPSICUM RED fresh from QLD")
['QLD', 'fresh from']

Once you've got a false key from the groupby, you would need to
pretend that any subsequent groups are part of the false group and
tack them on.

[toc] | [prev] | [next] | [standalone]


#34290

FromNeil Cerutti <neilc@norwich.edu>
Date2012-12-05 16:11 +0000
Message-ID<ai9a1aFaup4U1@mid.individual.net>
In reply to#34285
On 2012-12-05, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Wed, Dec 5, 2012 at 7:34 AM, Neil Cerutti <neilc@norwich.edu> wrote:
>> Well, shoot! Then this is a job for groupby, not takewhile.
>
> The problem with groupby is that you can't just limit it to two groups.
>
>>>> prod_desc("CAPSICUM RED fresh from QLD")
> ['QLD', 'fresh from']
>
> Once you've got a false key from the groupby, you would need to
> pretend that any subsequent groups are part of the false group
> and tack them on.

Whoops! Yep, that was from the very beginning of the thread.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#34286

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2012-12-05 15:32 +0000
Message-ID<mailman.494.1354721806.29569.python-list@python.org>
In reply to#34266
On 05/12/2012 13:45, Chris Angelico wrote:
>
> I tested it on Python 3.2 (yeah, time I upgraded, I know).

Bad move, fancy wanting to go to the completely useless version of 
Python that simply can't handle unicode properly :)

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]


#34292

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-12-05 09:16 -0700
Message-ID<mailman.499.1354724202.29569.python-list@python.org>
In reply to#34266
On Wed, Dec 5, 2012 at 6:45 AM, Chris Angelico <rosuav@gmail.com> wrote:
> On Wed, Dec 5, 2012 at 12:17 PM, Nick Mellor <thebalancepro@gmail.com> wrote:
>>
>> takewhile mines for gold at the start of a sequence, dropwhile drops the dross at the start of a sequence.
>
> When you're using both over the same sequence and with the same
> condition, it seems odd that you need to iterate over it twice.
> Perhaps a partitioning iterator would be cleaner - something like
> this:
>
> def partitionwhile(predicate, iterable):
>     iterable = iter(iterable)
>     while True:
>         val = next(iterable)
>         if not predicate(val): break
>         yield val
>     raise StopIteration # Signal the end of Phase 1
>     for val in iterable: yield val # or just "yield from iterable", I think
>
> Only the cold hard boot of reality just stomped out the spark of an
> idea. Once StopIteration has been raised, that's it, there's no
> "resuming" the iterator. Is there a way around that? Is there a clean
> way to say "Done for now, but next time you ask, there'll be more"?

Return two separate iterators, with the contract that the second
iterator can't be used until the first has completed.  Combined with
Neil's groupby suggestion, we end up with something like this:

def partitionwhile(predicate, iterable):
    it = itertools.groupby(iterable, lambda x: bool(predicate(x)))
    pushback = missing = object()
    def first():
        nonlocal pushback
        pred, subit = next(it)
        if pred:
            yield from subit
            pushback = None
        else:
            pushback = subit
    def second():
        if pushback is missing:
            raise TypeError("can't yield from second iterator before
first iterator completes")
        elif pushback is not None:
            yield from pushback
        yield from itertools.chain.from_iterable(subit for key, subit in it)
    return first(), second()

>>> list(map(' '.join, partitionwhile(lambda x: x.upper() == x, "CAPSICUM RED fresh from QLD".split())))
['CAPSICUM RED', 'fresh from QLD']

[toc] | [prev] | [next] | [standalone]


#34300

FromMRAB <python@mrabarnett.plus.com>
Date2012-12-05 17:57 +0000
Message-ID<mailman.508.1354730246.29569.python-list@python.org>
In reply to#34266
On 2012-12-05 13:45, Chris Angelico wrote:
> On Wed, Dec 5, 2012 at 12:17 PM, Nick Mellor <thebalancepro@gmail.com> wrote:
>>
>> takewhile mines for gold at the start of a sequence, dropwhile drops the dross at the start of a sequence.
>
> When you're using both over the same sequence and with the same
> condition, it seems odd that you need to iterate over it twice.
> Perhaps a partitioning iterator would be cleaner - something like
> this:
>
> def partitionwhile(predicate, iterable):
>      iterable = iter(iterable)
>      while True:
>          val = next(iterable)
>          if not predicate(val): break
>          yield val
>      raise StopIteration # Signal the end of Phase 1
>      for val in iterable: yield val # or just "yield from iterable", I think
>
> Only the cold hard boot of reality just stomped out the spark of an
> idea. Once StopIteration has been raised, that's it, there's no
> "resuming" the iterator. Is there a way around that? Is there a clean
> way to say "Done for now, but next time you ask, there'll be more"?
>
Perhaps you could have some kind of partitioner object:

class Partitioner:
     _SENTINEL = object()

     def __init__(self, iterable):
         self._iterable = iter(iterable)
         self._unused_item = self._SENTINEL

     def takewhile(self, condition):
         if self._unused_item is not self._SENTINEL:
             if not condition(self._unused_item):
                 raise StopIteration

             yield self._unused_item
             self._unused_item = self._SENTINEL

         for item in self._iterable:
             if not condition(item):
                 self._unused_item = item
                 break

             yield item

         raise StopIteration

     def remainder(self):
         if self._unused_item is not self._SENTINEL:
             yield self._unused_item
             self._unused_item = self._SENTINEL

         for item in self._iterable:
             yield item

         raise StopIteration

def is_all_caps(word):
     return word == word.upper()

part = Partitioner("CAPSICUM RED fresh from QLD".split())
product = " ".join(part.takewhile(is_all_caps))
description = " ".join(part.remainder())
print([product, description])

[toc] | [prev] | [next] | [standalone]


#34267

FromNick Mellor <thebalancepro@gmail.com>
Date2012-12-04 17:17 -0800
Message-ID<mailman.484.1354670286.29569.python-list@python.org>
In reply to#34252
Hi Terry,

For my money, and especially in your versions, despite several expert solutions using other features, itertools has it. It seems to me to need less nutting out than the other approaches. It's short, robust, has a minimum of symbols, uses simple expressions and is not overly clever. If we could just get used to using takewhile.

takewhile mines for gold at the start of a sequence, dropwhile drops the dross at the start of a sequence.

Thanks all for your interest and your help,

Best,

Nick

Terry's implementations:

> from itertools import takewhile
> 
> def allcaps(word): return word == word.upper()
> 
> 
> 
> def split_product_itertools(s):
> 
>      product = ' '.join(takewhile(allcaps, s.split()))
> 
>      return product, s[len(product)+1:]
> 
> 
> 
> print(split_product_itertools("CAPSICUM RED fresh from QLD"))
> 
>  >>>
> 
> ('CAPSICUM RED', 'fresh from QLD')
> 
> 
> 
> [if there could be surplus whitespace], the same idea applies to the split list.
> 
> 
> 
> def split_product_itertools(s):
> 
>      words = s.split()
> 
>      product = list(takewhile(allcaps, words))
> 
>      return ' '.join(product), ' '.join(words[len(product):])
> 

[toc] | [prev] | [next] | [standalone]


#34279

FromNeil Cerutti <neilc@norwich.edu>
Date2012-12-05 13:29 +0000
Message-ID<ai90h9F8mm8U4@mid.individual.net>
In reply to#34267
On 2012-12-05, Nick Mellor <thebalancepro@gmail.com> wrote:
> Hi Terry,
>
> For my money, and especially in your versions, despite several
> expert solutions using other features, itertools has it. It
> seems to me to need less nutting out than the other approaches.
> It's short, robust, has a minimum of symbols, uses simple
> expressions and is not overly clever. If we could just get used
> to using takewhile.

The main reason most of the solutions posted failed is lack of
complete specification to work with while sumultaneously trying
to make as tiny and simplistic a solution as possible.

I'm struggling with the empty description bug right now. ;)

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#34295

FromNick Mellor <thebalancepro@gmail.com>
Date2012-12-05 09:04 -0800
Message-ID<26781aa9-b4a2-4308-8db2-5a150da2128f@googlegroups.com>
In reply to#34279
Hi Neil,

Here's some sample data. The live data is about 300 minor variations on the sample data, about 20,000 lines.

Nick

Notes:

1. Whitespace is only used for word boundaries. Surplus whitespace is not significant and can be stripped

2. Retain punctuation and parentheses

3. Product is zero or more words in all caps at start of line

4. Description is zero or more words beginning with first word that is not all caps. Description continues to the end of the line

5. Return tuple of strings (product, description)


Sample data
---

BEANS hand picked
BEETROOT certified organic
BOK CHOY (bunch)
BROCCOLI Mornington Peninsula
BRUSSEL  SPROUTS
CABBAGE green
CABBAGE Red
CAPSICUM RED
CARROTS
CARROTS loose
CARROTS juicing, certified organic
CARROTS Trentham, large seconds, certified organic
CARROTS Trentham, firsts, certified organic
CAULIFLOWER
CELERY Mornington Peninsula IPM grower 
CELERY Mornington Peninsula IPM grower 
CUCUMBER
EGGPLANT
FENNEL
GARLIC (from Argentina)
GINGER fresh uncured
KALE (bunch)
KOHL RABI certified organic
LEEKS
 LETTUCE iceberg
MUSHROOM cup or flat
MUSHROOM Swiss brown
ONION brown
ONION red
ONION spring (bunch)
PARSNIP, certified organic
POTATOES certified organic
POTATOES Sebago
POTATOES Desiree
POTATOES Bullarto chemical free
POTATOES Dutch Cream
POTATOES Nicola
POTATOES Pontiac
POTATOES Otway Red
POTATOES teardrop
PUMPKIN certified organic
SCHALLOTS brown
SNOW PEAS
SPINACH I'll try to get certified organic (bunch)
SWEET POTATO gold certified organic 
SWEET POTATO red small
SWEDE certified organic
TOMATOES  Qld
TURMERIC fresh certified organic
ZUCCHINI
APPLES Harcourt  Pink Lady, Fuji, Granny Smith
APPLES Harcourt 2 kg bags, Pink Lady or Fuji (bag)
AVOCADOS
AVOCADOS certified organic, seconds
BANANAS Qld, organic
GRAPEFRUIT
GRAPES crimson seedless
KIWI FRUIT Qld certified organic
LEMONS
LIMES
MANDARINS
ORANGES Navel
PEARS Beurre Bosc Harcourt new season
PEARS Packham, Harcourt new season
SULTANAS 350g pre-packed bags
EGGS Melita free range, Barker's Creek
BASIL (bunch)
CORIANDER (bunch)
DILL (bunch)
MINT (bunch)
PARSLEY (bunch)


On Thursday, 6 December 2012 00:29:13 UTC+11, Neil Cerutti  wrote:
> On 2012-12-05, Nick Mellor <thebalancepro@gmail.com> wrote:
> 
> > Hi Terry,
> 
> >
> 
> > For my money, and especially in your versions, despite several
> 
> > expert solutions using other features, itertools has it. It
> 
> > seems to me to need less nutting out than the other approaches.
> 
> > It's short, robust, has a minimum of symbols, uses simple
> 
> > expressions and is not overly clever. If we could just get used
> 
> > to using takewhile.
> 
> 
> 
> The main reason most of the solutions posted failed is lack of
> 
> complete specification to work with while sumultaneously trying
> 
> to make as tiny and simplistic a solution as possible.
> 
> 
> 
> I'm struggling with the empty description bug right now. ;)
> 
> 
> 
> -- 
> 
> Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#34301

FromMRAB <python@mrabarnett.plus.com>
Date2012-12-05 17:57 +0000
Message-ID<mailman.509.1354730246.29569.python-list@python.org>
In reply to#34295
On 2012-12-05 17:04, Nick Mellor wrote:
> Hi Neil,
>
> Here's some sample data. The live data is about 300 minor variations on the sample data, about 20,000 lines.
>
[snip]
You have a duplicate:

> CELERY Mornington Peninsula IPM grower
> CELERY Mornington Peninsula IPM grower

[toc] | [prev] | [next] | [standalone]


#34304

FromNeil Cerutti <neilc@norwich.edu>
Date2012-12-05 18:16 +0000
Message-ID<ai9hb4FclvgU1@mid.individual.net>
In reply to#34295
On 2012-12-05, Nick Mellor <thebalancepro@gmail.com> wrote:
> Hi Neil,
>
> Here's some sample data. The live data is about 300 minor
> variations on the sample data, about 20,000 lines.

Thanks, Nick.

This slight variation on my first groupby try seems to work for
the test data.

def prod_desc(s):
    prod = []
    desc = []
    for k, g in itertools.groupby(s.split(),
            key=lambda w: any(c.islower() for c in w)):
        if prod or k:
            desc.extend(g)
        else:
            prod.extend(g)
    return [' '.join(prod), ' '.join(desc)]

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#34313

FromNick Mellor <thebalancepro@gmail.com>
Date2012-12-05 11:01 -0800
Message-ID<945048d8-961e-4894-89fc-3b7fd9b7965b@googlegroups.com>
In reply to#34304
Neil,

Further down the data, found another edge case:

"Spring ONION from QLD"

Following the spec, the whole line should be description (description starts at first word that is not all caps.) This case breaks the latest groupby.

N

[toc] | [prev] | [next] | [standalone]


#34318

FromNeil Cerutti <neilc@norwich.edu>
Date2012-12-05 20:13 +0000
Message-ID<ai9o7lFe2chU1@mid.individual.net>
In reply to#34313
On 2012-12-05, Nick Mellor <thebalancepro@gmail.com> wrote:
> Neil,
>
> Further down the data, found another edge case:
>
> "Spring ONION from QLD"
>
> Following the spec, the whole line should be description
> (description starts at first word that is not all caps.) This
> case breaks the latest groupby.

A-ha! I did check your samples for the case of an empty product
name and not find any started to think it couldn't happen.

Change

   if prod or k:

to

   if desc or prod or k:

If this data file gets any weirder, let me know. ;)

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#34368

FromVlastimil Brom <vlastimil.brom@gmail.com>
Date2012-12-05 22:36 +0100
Message-ID<mailman.548.1354782133.29569.python-list@python.org>
In reply to#34313
2012/12/5 Nick Mellor <thebalancepro@gmail.com>:
> Neil,
>
> Further down the data, found another edge case:
>
> "Spring ONION from QLD"
>
> Following the spec, the whole line should be description (description starts at first word that is not all caps.) This case breaks the latest groupby.
>
> N
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
Just for completeness..., it (likely) can be done using regex (given
the current specificatioin), but if the data are even more complex and
varying, the tools like pyparsing or dedicated parsing functions might
be more appropriate;

hth,
   vbr:


>>> import re
>>> test_product_data = """BEANS hand picked
... BEETROOT certified organic
... BOK CHOY (bunch)
... BROCCOLI Mornington Peninsula
... BRUSSEL  SPROUTS
... CABBAGE green
... CABBAGE Red
... CAPSICUM RED
... CARROTS
... CARROTS loose
... CARROTS juicing, certified organic
... CARROTS Trentham, large seconds, certified organic
... CARROTS Trentham, firsts, certified organic
... CAULIFLOWER
... CELERY Mornington Peninsula IPM grower
... CELERY Mornington Peninsula IPM grower
... CUCUMBER
... EGGPLANT
... FENNEL
... GARLIC (from Argentina)
... GINGER fresh uncured
... KALE (bunch)
... KOHL RABI certified organic
... LEEKS
...  LETTUCE iceberg
... MUSHROOM cup or flat
... MUSHROOM Swiss brown
... ONION brown
... ONION red
... ONION spring (bunch)
... PARSNIP, certified organic
... POTATOES certified organic
... POTATOES Sebago
... POTATOES Desiree
... POTATOES Bullarto chemical free
... POTATOES Dutch Cream
... POTATOES Nicola
... POTATOES Pontiac
... POTATOES Otway Red
... POTATOES teardrop
... PUMPKIN certified organic
... SCHALLOTS brown
... SNOW PEAS
... SPINACH I'll try to get certified organic (bunch)
... SWEET POTATO gold certified organic
... SWEET POTATO red small
... SWEDE certified organic
... TOMATOES  Qld
... TURMERIC fresh certified organic
... ZUCCHINI
... APPLES Harcourt  Pink Lady, Fuji, Granny Smith
... APPLES Harcourt 2 kg bags, Pink Lady or Fuji (bag)
... AVOCADOS
... AVOCADOS certified organic, seconds
... BANANAS Qld, organic
... GRAPEFRUIT
... GRAPES crimson seedless
... KIWI FRUIT Qld certified organic
... LEMONS
... LIMES
... MANDARINS
... ORANGES Navel
... PEARS Beurre Bosc Harcourt new season
... PEARS Packham, Harcourt new season
... SULTANAS 350g pre-packed bags
... EGGS Melita free range, Barker's Creek
... BASIL (bunch)
... CORIANDER (bunch)
... DILL (bunch)
... MINT (bunch)
... PARSLEY (bunch)
... Spring ONION from QLD"""
>>>
>>> len(test_product_data.splitlines())
72
>>>
>>> for prod_item in re.findall(r"(?m)(?=^.+$)^ *(?:([A-Z ]+\b(?<! )(?=[\s,]|$)))?(?: *(.*))?$", test_product_data): print prod_item
...
('BEANS', 'hand picked')
('BEETROOT', 'certified organic')
('BOK CHOY', '(bunch)')
('BROCCOLI', 'Mornington Peninsula')
('BRUSSEL  SPROUTS', '')
('CABBAGE', 'green')
('CABBAGE', 'Red')
('CAPSICUM RED', '')
('CARROTS', '')
('CARROTS', 'loose')
('CARROTS', 'juicing, certified organic')
('CARROTS', 'Trentham, large seconds, certified organic')
('CARROTS', 'Trentham, firsts, certified organic')
('CAULIFLOWER', '')
('CELERY', 'Mornington Peninsula IPM grower')
('CELERY', 'Mornington Peninsula IPM grower')
('CUCUMBER', '')
('EGGPLANT', '')
('FENNEL', '')
('GARLIC', '(from Argentina)')
('GINGER', 'fresh uncured')
('KALE', '(bunch)')
('KOHL RABI', 'certified organic')
('LEEKS', '')
('LETTUCE', 'iceberg')
('MUSHROOM', 'cup or flat')
('MUSHROOM', 'Swiss brown')
('ONION', 'brown')
('ONION', 'red')
('ONION', 'spring (bunch)')
('PARSNIP', ', certified organic')
('POTATOES', 'certified organic')
('POTATOES', 'Sebago')
('POTATOES', 'Desiree')
('POTATOES', 'Bullarto chemical free')
('POTATOES', 'Dutch Cream')
('POTATOES', 'Nicola')
('POTATOES', 'Pontiac')
('POTATOES', 'Otway Red')
('POTATOES', 'teardrop')
('PUMPKIN', 'certified organic')
('SCHALLOTS', 'brown')
('SNOW PEAS', '')
('SPINACH', "I'll try to get certified organic (bunch)")
('SWEET POTATO', 'gold certified organic')
('SWEET POTATO', 'red small')
('SWEDE', 'certified organic')
('TOMATOES', 'Qld')
('TURMERIC', 'fresh certified organic')
('ZUCCHINI', '')
('APPLES', 'Harcourt  Pink Lady, Fuji, Granny Smith')
('APPLES', 'Harcourt 2 kg bags, Pink Lady or Fuji (bag)')
('AVOCADOS', '')
('AVOCADOS', 'certified organic, seconds')
('BANANAS', 'Qld, organic')
('GRAPEFRUIT', '')
('GRAPES', 'crimson seedless')
('KIWI FRUIT', 'Qld certified organic')
('LEMONS', '')
('LIMES', '')
('MANDARINS', '')
('ORANGES', 'Navel')
('PEARS', 'Beurre Bosc Harcourt new season')
('PEARS', 'Packham, Harcourt new season')
('SULTANAS', '350g pre-packed bags')
('EGGS', "Melita free range, Barker's Creek")
('BASIL', '(bunch)')
('CORIANDER', '(bunch)')
('DILL', '(bunch)')
('MINT', '(bunch)')
('PARSLEY', '(bunch)')
('', 'Spring ONION from QLD')
>>> len(re.findall(r"(?m)(?=^.+$)^ *(?:([A-Z ]+\b(?<! )(?=[\s,]|$)))?(?: *(.*))?$", test_product_data))
72
>>>

[toc] | [prev] | [next] | [standalone]


#34380

FromNeil Cerutti <neilc@norwich.edu>
Date2012-12-06 13:06 +0000
Message-ID<aibjjaFqt9uU2@mid.individual.net>
In reply to#34368
On 2012-12-05, Vlastimil Brom <vlastimil.brom@gmail.com> wrote:
> ... PARSNIP, certified organic

I'm not sure on this one.

> ('PARSNIP', ', certified organic')

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#34389

FromVlastimil Brom <vlastimil.brom@gmail.com>
Date2012-12-06 15:12 +0100
Message-ID<mailman.558.1354803171.29569.python-list@python.org>
In reply to#34380
2012/12/6 Neil Cerutti <neilc@norwich.edu>:
> On 2012-12-05, Vlastimil Brom <vlastimil.brom@gmail.com> wrote:
>> ... PARSNIP, certified organic
>
> I'm not sure on this one.
>
>> ('PARSNIP', ', certified organic')
>
> --
> Neil Cerutti
> --

Well, I wasn't either, when I noticed this item, but given the specification:
"2. Retain punctuation and parentheses"
in one of the previous OP's messages, I figured, the punctuation would
better be a part of the description rather than the name in this case.

regards,
   vbr

[toc] | [prev] | [next] | [standalone]


#34383

FromAlexander Blinne <news@blinne.net>
Date2012-12-06 14:40 +0100
Message-ID<50c0a051$0$9514$9b4e6d93@newsspool1.arcor-online.net>
In reply to#34295
Am 05.12.2012 18:04, schrieb Nick Mellor:
> Sample data

Well let's see what

def split_product(p):
    p = p.strip()
    w = p.split(" ")
    try:
        j = next(i for i,v in enumerate(w) if v.upper() != v)
    except StopIteration:
        return p, ''
    return " ".join(w[:j]), " ".join(w[j:])

(which i still find a very elegant solution) has to say about those
sample data:

>>> for line in open('test.dat', 'r'):
...     print(split_product(line))
('BEANS', 'hand picked')
('BEETROOT', 'certified organic')
('BOK CHOY', '(bunch)')
('BROCCOLI', 'Mornington Peninsula')
('BRUSSEL  SPROUTS', '')
('CABBAGE', 'green')
('CABBAGE', 'Red')
('CAPSICUM RED', '')
('CARROTS', '')
('CARROTS', 'loose')
('CARROTS', 'juicing, certified organic')
('CARROTS', 'Trentham, large seconds, certified organic')
('CARROTS', 'Trentham, firsts, certified organic')
('CAULIFLOWER', '')
('CELERY', 'Mornington Peninsula IPM grower')
('CELERY', 'Mornington Peninsula IPM grower')
('CUCUMBER', '')
('EGGPLANT', '')
('FENNEL', '')
('GARLIC', '(from Argentina)')
('GINGER', 'fresh uncured')
('KALE', '(bunch)')
('KOHL RABI', 'certified organic')
('LEEKS', '')
('LETTUCE', 'iceberg')
('MUSHROOM', 'cup or flat')
('MUSHROOM', 'Swiss brown')
('ONION', 'brown')
('ONION', 'red')
('ONION', 'spring (bunch)')
('PARSNIP,', 'certified organic')
('POTATOES', 'certified organic')
('POTATOES', 'Sebago')
('POTATOES', 'Desiree')
('POTATOES', 'Bullarto chemical free')
('POTATOES', 'Dutch Cream')
('POTATOES', 'Nicola')
('POTATOES', 'Pontiac')
('POTATOES', 'Otway Red')
('POTATOES', 'teardrop')
('PUMPKIN', 'certified organic')
('SCHALLOTS', 'brown')
('SNOW PEAS', '')
('SPINACH', "I'll try to get certified organic (bunch)")
('SWEET POTATO', 'gold certified organic')
('SWEET POTATO', 'red small')
('SWEDE', 'certified organic')
('TOMATOES ', 'Qld')
('TURMERIC', 'fresh certified organic')
('ZUCCHINI', '')
('APPLES', 'Harcourt  Pink Lady, Fuji, Granny Smith')
('APPLES', 'Harcourt 2 kg bags, Pink Lady or Fuji (bag)')
('AVOCADOS', '')
('AVOCADOS', 'certified organic, seconds')
('BANANAS', 'Qld, organic')
('GRAPEFRUIT', '')
('GRAPES', 'crimson seedless')
('KIWI FRUIT', 'Qld certified organic')
('LEMONS', '')
('LIMES', '')
('MANDARINS', '')
('ORANGES', 'Navel')
('PEARS', 'Beurre Bosc Harcourt new season')
('PEARS', 'Packham, Harcourt new season')
('SULTANAS', '350g pre-packed bags')
('EGGS', "Melita free range, Barker's Creek")
('BASIL', '(bunch)')
('CORIANDER', '(bunch)')
('DILL', '(bunch)')
('MINT', '(bunch)')
('PARSLEY', '(bunch)')
('', 'Spring ONION from QLD')

I think the only thing one is left to think about is the
('PARSNIP,', 'certified organic')
case. What about that extra comma? Perhaps it could even be considered
an "error" in the original data? I don't see a good general way to deal
with those which does not have to handle trailing punctuation on the
product name explicitly as a special case.

Greetings

[toc] | [prev] | [next] | [standalone]


#34261

FromTerry Reedy <tjreedy@udel.edu>
Date2012-12-04 17:21 -0500
Message-ID<mailman.481.1354659896.29569.python-list@python.org>
In reply to#34226
On 12/4/2012 3:44 PM, Terry Reedy wrote:

> If the original string has no excess whitespace, description is what
> remains of s after product prefix is omitted. (Py 3 code)
>
> from itertools import takewhile
> def allcaps(word): return word == word.upper()
>
> def split_product_itertools(s):
>      product = ' '.join(takewhile(allcaps, s.split()))
>      return product, s[len(product)+1:]
>
> print(split_product_itertools("CAPSICUM RED fresh from QLD"))
>  >>>
> ('CAPSICUM RED', 'fresh from QLD')
>
> Without that assumption, the same idea applies to the split list.
>
> def split_product_itertools(s):
>      words = s.split()
>      product = list(takewhile(allcaps, words))
>      return ' '.join(product), ' '.join(words[len(product):])

Because these slice rather than index, either works trivially on an 
empty description.

print(split_product_itertools("CAPSICUM RED"))
 >>>
('CAPSICUM RED', '')



-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#34433

FromPaul Rubin <no.email@nospam.invalid>
Date2012-12-06 13:29 -0800
Message-ID<7xhany27kc.fsf@ruckus.brouhaha.com>
In reply to#34226
Nick Mellor <thebalancepro@gmail.com> writes:
> I came across itertools.dropwhile only today, then shortly afterwards
> found Raymond Hettinger wondering, in 2007, whether to drop [sic]
> dropwhile and takewhile from the itertools module....
> Almost nobody else of the 18 respondents seemed to be using them.

What?  I'm amazed by that.  I didn't bother reading the old thread, but
I use those functions fairly frequently.  I just used takewhile the
other day, processing a timestamped log file where I wanted to look at
certain clusters of events.  I won't post the actual code here, but
takewhile was a handy way to pull out intervals of interest after an
event was seen.

[toc] | [prev] | [standalone]


Page 2 of 2 — ← Prev page 1 [2]

Back to top | Article view | comp.lang.python


csiph-web