Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #34226 > unrolled thread
| Started by | Nick Mellor <thebalancepro@gmail.com> |
|---|---|
| First post | 2012-12-04 05:57 -0800 |
| Last post | 2012-12-06 13:29 -0800 |
| Articles | 18 on this page of 38 — 12 participants |
Back to article view | Back to comp.lang.python
Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 05:57 -0800
Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-04 14:23 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 06:47 -0800
Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-04 15:17 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-04 15:31 +0100
Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 07:24 -0800
Re: Good use for itertools.dropwhile and itertools.takewhile Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-04 22:08 +0100
Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 07:24 -0800
Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-04 18:26 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Alexander Blinne <news@blinne.net> - 2012-12-04 18:18 +0100
Re: Good use for itertools.dropwhile and itertools.takewhile DJC <djc@news.invalid> - 2012-12-04 18:28 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Alexander Blinne <news@blinne.net> - 2012-12-04 19:48 +0100
Re: Good use for itertools.dropwhile and itertools.takewhile Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-04 12:37 -0700
Re: Good use for itertools.dropwhile and itertools.takewhile Alexander Blinne <news@blinne.net> - 2012-12-04 21:33 +0100
Re: Good use for itertools.dropwhile and itertools.takewhile Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-04 21:13 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile MRAB <python@mrabarnett.plus.com> - 2012-12-04 20:17 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Terry Reedy <tjreedy@udel.edu> - 2012-12-04 15:44 -0500
Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 17:17 -0800
Re: Good use for itertools.dropwhile and itertools.takewhile Chris Angelico <rosuav@gmail.com> - 2012-12-06 00:45 +1100
Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 14:34 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-05 08:33 -0700
Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 16:11 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-12-05 15:32 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-05 09:16 -0700
Re: Good use for itertools.dropwhile and itertools.takewhile MRAB <python@mrabarnett.plus.com> - 2012-12-05 17:57 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-04 17:17 -0800
Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 13:29 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-05 09:04 -0800
Re: Good use for itertools.dropwhile and itertools.takewhile MRAB <python@mrabarnett.plus.com> - 2012-12-05 17:57 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 18:16 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Nick Mellor <thebalancepro@gmail.com> - 2012-12-05 11:01 -0800
Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-05 20:13 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-05 22:36 +0100
Re: Good use for itertools.dropwhile and itertools.takewhile Neil Cerutti <neilc@norwich.edu> - 2012-12-06 13:06 +0000
Re: Good use for itertools.dropwhile and itertools.takewhile Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-06 15:12 +0100
Re: Good use for itertools.dropwhile and itertools.takewhile Alexander Blinne <news@blinne.net> - 2012-12-06 14:40 +0100
Re: Good use for itertools.dropwhile and itertools.takewhile Terry Reedy <tjreedy@udel.edu> - 2012-12-04 17:21 -0500
Re: Good use for itertools.dropwhile and itertools.takewhile Paul Rubin <no.email@nospam.invalid> - 2012-12-06 13:29 -0800
Page 2 of 2 — ← Prev page 1 [2]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-12-05 08:33 -0700 |
| Message-ID | <mailman.493.1354721626.29569.python-list@python.org> |
| In reply to | #34281 |
On Wed, Dec 5, 2012 at 7:34 AM, Neil Cerutti <neilc@norwich.edu> wrote:
> Well, shoot! Then this is a job for groupby, not takewhile.
The problem with groupby is that you can't just limit it to two groups.
>>> prod_desc("CAPSICUM RED fresh from QLD")
['QLD', 'fresh from']
Once you've got a false key from the groupby, you would need to
pretend that any subsequent groups are part of the false group and
tack them on.
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2012-12-05 16:11 +0000 |
| Message-ID | <ai9a1aFaup4U1@mid.individual.net> |
| In reply to | #34285 |
On 2012-12-05, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Wed, Dec 5, 2012 at 7:34 AM, Neil Cerutti <neilc@norwich.edu> wrote:
>> Well, shoot! Then this is a job for groupby, not takewhile.
>
> The problem with groupby is that you can't just limit it to two groups.
>
>>>> prod_desc("CAPSICUM RED fresh from QLD")
> ['QLD', 'fresh from']
>
> Once you've got a false key from the groupby, you would need to
> pretend that any subsequent groups are part of the false group
> and tack them on.
Whoops! Yep, that was from the very beginning of the thread.
--
Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-12-05 15:32 +0000 |
| Message-ID | <mailman.494.1354721806.29569.python-list@python.org> |
| In reply to | #34266 |
On 05/12/2012 13:45, Chris Angelico wrote: > > I tested it on Python 3.2 (yeah, time I upgraded, I know). Bad move, fancy wanting to go to the completely useless version of Python that simply can't handle unicode properly :) -- Cheers. Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-12-05 09:16 -0700 |
| Message-ID | <mailman.499.1354724202.29569.python-list@python.org> |
| In reply to | #34266 |
On Wed, Dec 5, 2012 at 6:45 AM, Chris Angelico <rosuav@gmail.com> wrote:
> On Wed, Dec 5, 2012 at 12:17 PM, Nick Mellor <thebalancepro@gmail.com> wrote:
>>
>> takewhile mines for gold at the start of a sequence, dropwhile drops the dross at the start of a sequence.
>
> When you're using both over the same sequence and with the same
> condition, it seems odd that you need to iterate over it twice.
> Perhaps a partitioning iterator would be cleaner - something like
> this:
>
> def partitionwhile(predicate, iterable):
> iterable = iter(iterable)
> while True:
> val = next(iterable)
> if not predicate(val): break
> yield val
> raise StopIteration # Signal the end of Phase 1
> for val in iterable: yield val # or just "yield from iterable", I think
>
> Only the cold hard boot of reality just stomped out the spark of an
> idea. Once StopIteration has been raised, that's it, there's no
> "resuming" the iterator. Is there a way around that? Is there a clean
> way to say "Done for now, but next time you ask, there'll be more"?
Return two separate iterators, with the contract that the second
iterator can't be used until the first has completed. Combined with
Neil's groupby suggestion, we end up with something like this:
def partitionwhile(predicate, iterable):
it = itertools.groupby(iterable, lambda x: bool(predicate(x)))
pushback = missing = object()
def first():
nonlocal pushback
pred, subit = next(it)
if pred:
yield from subit
pushback = None
else:
pushback = subit
def second():
if pushback is missing:
raise TypeError("can't yield from second iterator before
first iterator completes")
elif pushback is not None:
yield from pushback
yield from itertools.chain.from_iterable(subit for key, subit in it)
return first(), second()
>>> list(map(' '.join, partitionwhile(lambda x: x.upper() == x, "CAPSICUM RED fresh from QLD".split())))
['CAPSICUM RED', 'fresh from QLD']
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-12-05 17:57 +0000 |
| Message-ID | <mailman.508.1354730246.29569.python-list@python.org> |
| In reply to | #34266 |
On 2012-12-05 13:45, Chris Angelico wrote:
> On Wed, Dec 5, 2012 at 12:17 PM, Nick Mellor <thebalancepro@gmail.com> wrote:
>>
>> takewhile mines for gold at the start of a sequence, dropwhile drops the dross at the start of a sequence.
>
> When you're using both over the same sequence and with the same
> condition, it seems odd that you need to iterate over it twice.
> Perhaps a partitioning iterator would be cleaner - something like
> this:
>
> def partitionwhile(predicate, iterable):
> iterable = iter(iterable)
> while True:
> val = next(iterable)
> if not predicate(val): break
> yield val
> raise StopIteration # Signal the end of Phase 1
> for val in iterable: yield val # or just "yield from iterable", I think
>
> Only the cold hard boot of reality just stomped out the spark of an
> idea. Once StopIteration has been raised, that's it, there's no
> "resuming" the iterator. Is there a way around that? Is there a clean
> way to say "Done for now, but next time you ask, there'll be more"?
>
Perhaps you could have some kind of partitioner object:
class Partitioner:
_SENTINEL = object()
def __init__(self, iterable):
self._iterable = iter(iterable)
self._unused_item = self._SENTINEL
def takewhile(self, condition):
if self._unused_item is not self._SENTINEL:
if not condition(self._unused_item):
raise StopIteration
yield self._unused_item
self._unused_item = self._SENTINEL
for item in self._iterable:
if not condition(item):
self._unused_item = item
break
yield item
raise StopIteration
def remainder(self):
if self._unused_item is not self._SENTINEL:
yield self._unused_item
self._unused_item = self._SENTINEL
for item in self._iterable:
yield item
raise StopIteration
def is_all_caps(word):
return word == word.upper()
part = Partitioner("CAPSICUM RED fresh from QLD".split())
product = " ".join(part.takewhile(is_all_caps))
description = " ".join(part.remainder())
print([product, description])
[toc] | [prev] | [next] | [standalone]
| From | Nick Mellor <thebalancepro@gmail.com> |
|---|---|
| Date | 2012-12-04 17:17 -0800 |
| Message-ID | <mailman.484.1354670286.29569.python-list@python.org> |
| In reply to | #34252 |
Hi Terry,
For my money, and especially in your versions, despite several expert solutions using other features, itertools has it. It seems to me to need less nutting out than the other approaches. It's short, robust, has a minimum of symbols, uses simple expressions and is not overly clever. If we could just get used to using takewhile.
takewhile mines for gold at the start of a sequence, dropwhile drops the dross at the start of a sequence.
Thanks all for your interest and your help,
Best,
Nick
Terry's implementations:
> from itertools import takewhile
>
> def allcaps(word): return word == word.upper()
>
>
>
> def split_product_itertools(s):
>
> product = ' '.join(takewhile(allcaps, s.split()))
>
> return product, s[len(product)+1:]
>
>
>
> print(split_product_itertools("CAPSICUM RED fresh from QLD"))
>
> >>>
>
> ('CAPSICUM RED', 'fresh from QLD')
>
>
>
> [if there could be surplus whitespace], the same idea applies to the split list.
>
>
>
> def split_product_itertools(s):
>
> words = s.split()
>
> product = list(takewhile(allcaps, words))
>
> return ' '.join(product), ' '.join(words[len(product):])
>
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2012-12-05 13:29 +0000 |
| Message-ID | <ai90h9F8mm8U4@mid.individual.net> |
| In reply to | #34267 |
On 2012-12-05, Nick Mellor <thebalancepro@gmail.com> wrote: > Hi Terry, > > For my money, and especially in your versions, despite several > expert solutions using other features, itertools has it. It > seems to me to need less nutting out than the other approaches. > It's short, robust, has a minimum of symbols, uses simple > expressions and is not overly clever. If we could just get used > to using takewhile. The main reason most of the solutions posted failed is lack of complete specification to work with while sumultaneously trying to make as tiny and simplistic a solution as possible. I'm struggling with the empty description bug right now. ;) -- Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Nick Mellor <thebalancepro@gmail.com> |
|---|---|
| Date | 2012-12-05 09:04 -0800 |
| Message-ID | <26781aa9-b4a2-4308-8db2-5a150da2128f@googlegroups.com> |
| In reply to | #34279 |
Hi Neil, Here's some sample data. The live data is about 300 minor variations on the sample data, about 20,000 lines. Nick Notes: 1. Whitespace is only used for word boundaries. Surplus whitespace is not significant and can be stripped 2. Retain punctuation and parentheses 3. Product is zero or more words in all caps at start of line 4. Description is zero or more words beginning with first word that is not all caps. Description continues to the end of the line 5. Return tuple of strings (product, description) Sample data --- BEANS hand picked BEETROOT certified organic BOK CHOY (bunch) BROCCOLI Mornington Peninsula BRUSSEL SPROUTS CABBAGE green CABBAGE Red CAPSICUM RED CARROTS CARROTS loose CARROTS juicing, certified organic CARROTS Trentham, large seconds, certified organic CARROTS Trentham, firsts, certified organic CAULIFLOWER CELERY Mornington Peninsula IPM grower CELERY Mornington Peninsula IPM grower CUCUMBER EGGPLANT FENNEL GARLIC (from Argentina) GINGER fresh uncured KALE (bunch) KOHL RABI certified organic LEEKS LETTUCE iceberg MUSHROOM cup or flat MUSHROOM Swiss brown ONION brown ONION red ONION spring (bunch) PARSNIP, certified organic POTATOES certified organic POTATOES Sebago POTATOES Desiree POTATOES Bullarto chemical free POTATOES Dutch Cream POTATOES Nicola POTATOES Pontiac POTATOES Otway Red POTATOES teardrop PUMPKIN certified organic SCHALLOTS brown SNOW PEAS SPINACH I'll try to get certified organic (bunch) SWEET POTATO gold certified organic SWEET POTATO red small SWEDE certified organic TOMATOES Qld TURMERIC fresh certified organic ZUCCHINI APPLES Harcourt Pink Lady, Fuji, Granny Smith APPLES Harcourt 2 kg bags, Pink Lady or Fuji (bag) AVOCADOS AVOCADOS certified organic, seconds BANANAS Qld, organic GRAPEFRUIT GRAPES crimson seedless KIWI FRUIT Qld certified organic LEMONS LIMES MANDARINS ORANGES Navel PEARS Beurre Bosc Harcourt new season PEARS Packham, Harcourt new season SULTANAS 350g pre-packed bags EGGS Melita free range, Barker's Creek BASIL (bunch) CORIANDER (bunch) DILL (bunch) MINT (bunch) PARSLEY (bunch) On Thursday, 6 December 2012 00:29:13 UTC+11, Neil Cerutti wrote: > On 2012-12-05, Nick Mellor <thebalancepro@gmail.com> wrote: > > > Hi Terry, > > > > > > For my money, and especially in your versions, despite several > > > expert solutions using other features, itertools has it. It > > > seems to me to need less nutting out than the other approaches. > > > It's short, robust, has a minimum of symbols, uses simple > > > expressions and is not overly clever. If we could just get used > > > to using takewhile. > > > > The main reason most of the solutions posted failed is lack of > > complete specification to work with while sumultaneously trying > > to make as tiny and simplistic a solution as possible. > > > > I'm struggling with the empty description bug right now. ;) > > > > -- > > Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-12-05 17:57 +0000 |
| Message-ID | <mailman.509.1354730246.29569.python-list@python.org> |
| In reply to | #34295 |
On 2012-12-05 17:04, Nick Mellor wrote: > Hi Neil, > > Here's some sample data. The live data is about 300 minor variations on the sample data, about 20,000 lines. > [snip] You have a duplicate: > CELERY Mornington Peninsula IPM grower > CELERY Mornington Peninsula IPM grower
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2012-12-05 18:16 +0000 |
| Message-ID | <ai9hb4FclvgU1@mid.individual.net> |
| In reply to | #34295 |
On 2012-12-05, Nick Mellor <thebalancepro@gmail.com> wrote:
> Hi Neil,
>
> Here's some sample data. The live data is about 300 minor
> variations on the sample data, about 20,000 lines.
Thanks, Nick.
This slight variation on my first groupby try seems to work for
the test data.
def prod_desc(s):
prod = []
desc = []
for k, g in itertools.groupby(s.split(),
key=lambda w: any(c.islower() for c in w)):
if prod or k:
desc.extend(g)
else:
prod.extend(g)
return [' '.join(prod), ' '.join(desc)]
--
Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Nick Mellor <thebalancepro@gmail.com> |
|---|---|
| Date | 2012-12-05 11:01 -0800 |
| Message-ID | <945048d8-961e-4894-89fc-3b7fd9b7965b@googlegroups.com> |
| In reply to | #34304 |
Neil, Further down the data, found another edge case: "Spring ONION from QLD" Following the spec, the whole line should be description (description starts at first word that is not all caps.) This case breaks the latest groupby. N
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2012-12-05 20:13 +0000 |
| Message-ID | <ai9o7lFe2chU1@mid.individual.net> |
| In reply to | #34313 |
On 2012-12-05, Nick Mellor <thebalancepro@gmail.com> wrote: > Neil, > > Further down the data, found another edge case: > > "Spring ONION from QLD" > > Following the spec, the whole line should be description > (description starts at first word that is not all caps.) This > case breaks the latest groupby. A-ha! I did check your samples for the case of an empty product name and not find any started to think it couldn't happen. Change if prod or k: to if desc or prod or k: If this data file gets any weirder, let me know. ;) -- Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2012-12-05 22:36 +0100 |
| Message-ID | <mailman.548.1354782133.29569.python-list@python.org> |
| In reply to | #34313 |
2012/12/5 Nick Mellor <thebalancepro@gmail.com>:
> Neil,
>
> Further down the data, found another edge case:
>
> "Spring ONION from QLD"
>
> Following the spec, the whole line should be description (description starts at first word that is not all caps.) This case breaks the latest groupby.
>
> N
> --
> http://mail.python.org/mailman/listinfo/python-list
Hi,
Just for completeness..., it (likely) can be done using regex (given
the current specificatioin), but if the data are even more complex and
varying, the tools like pyparsing or dedicated parsing functions might
be more appropriate;
hth,
vbr:
>>> import re
>>> test_product_data = """BEANS hand picked
... BEETROOT certified organic
... BOK CHOY (bunch)
... BROCCOLI Mornington Peninsula
... BRUSSEL SPROUTS
... CABBAGE green
... CABBAGE Red
... CAPSICUM RED
... CARROTS
... CARROTS loose
... CARROTS juicing, certified organic
... CARROTS Trentham, large seconds, certified organic
... CARROTS Trentham, firsts, certified organic
... CAULIFLOWER
... CELERY Mornington Peninsula IPM grower
... CELERY Mornington Peninsula IPM grower
... CUCUMBER
... EGGPLANT
... FENNEL
... GARLIC (from Argentina)
... GINGER fresh uncured
... KALE (bunch)
... KOHL RABI certified organic
... LEEKS
... LETTUCE iceberg
... MUSHROOM cup or flat
... MUSHROOM Swiss brown
... ONION brown
... ONION red
... ONION spring (bunch)
... PARSNIP, certified organic
... POTATOES certified organic
... POTATOES Sebago
... POTATOES Desiree
... POTATOES Bullarto chemical free
... POTATOES Dutch Cream
... POTATOES Nicola
... POTATOES Pontiac
... POTATOES Otway Red
... POTATOES teardrop
... PUMPKIN certified organic
... SCHALLOTS brown
... SNOW PEAS
... SPINACH I'll try to get certified organic (bunch)
... SWEET POTATO gold certified organic
... SWEET POTATO red small
... SWEDE certified organic
... TOMATOES Qld
... TURMERIC fresh certified organic
... ZUCCHINI
... APPLES Harcourt Pink Lady, Fuji, Granny Smith
... APPLES Harcourt 2 kg bags, Pink Lady or Fuji (bag)
... AVOCADOS
... AVOCADOS certified organic, seconds
... BANANAS Qld, organic
... GRAPEFRUIT
... GRAPES crimson seedless
... KIWI FRUIT Qld certified organic
... LEMONS
... LIMES
... MANDARINS
... ORANGES Navel
... PEARS Beurre Bosc Harcourt new season
... PEARS Packham, Harcourt new season
... SULTANAS 350g pre-packed bags
... EGGS Melita free range, Barker's Creek
... BASIL (bunch)
... CORIANDER (bunch)
... DILL (bunch)
... MINT (bunch)
... PARSLEY (bunch)
... Spring ONION from QLD"""
>>>
>>> len(test_product_data.splitlines())
72
>>>
>>> for prod_item in re.findall(r"(?m)(?=^.+$)^ *(?:([A-Z ]+\b(?<! )(?=[\s,]|$)))?(?: *(.*))?$", test_product_data): print prod_item
...
('BEANS', 'hand picked')
('BEETROOT', 'certified organic')
('BOK CHOY', '(bunch)')
('BROCCOLI', 'Mornington Peninsula')
('BRUSSEL SPROUTS', '')
('CABBAGE', 'green')
('CABBAGE', 'Red')
('CAPSICUM RED', '')
('CARROTS', '')
('CARROTS', 'loose')
('CARROTS', 'juicing, certified organic')
('CARROTS', 'Trentham, large seconds, certified organic')
('CARROTS', 'Trentham, firsts, certified organic')
('CAULIFLOWER', '')
('CELERY', 'Mornington Peninsula IPM grower')
('CELERY', 'Mornington Peninsula IPM grower')
('CUCUMBER', '')
('EGGPLANT', '')
('FENNEL', '')
('GARLIC', '(from Argentina)')
('GINGER', 'fresh uncured')
('KALE', '(bunch)')
('KOHL RABI', 'certified organic')
('LEEKS', '')
('LETTUCE', 'iceberg')
('MUSHROOM', 'cup or flat')
('MUSHROOM', 'Swiss brown')
('ONION', 'brown')
('ONION', 'red')
('ONION', 'spring (bunch)')
('PARSNIP', ', certified organic')
('POTATOES', 'certified organic')
('POTATOES', 'Sebago')
('POTATOES', 'Desiree')
('POTATOES', 'Bullarto chemical free')
('POTATOES', 'Dutch Cream')
('POTATOES', 'Nicola')
('POTATOES', 'Pontiac')
('POTATOES', 'Otway Red')
('POTATOES', 'teardrop')
('PUMPKIN', 'certified organic')
('SCHALLOTS', 'brown')
('SNOW PEAS', '')
('SPINACH', "I'll try to get certified organic (bunch)")
('SWEET POTATO', 'gold certified organic')
('SWEET POTATO', 'red small')
('SWEDE', 'certified organic')
('TOMATOES', 'Qld')
('TURMERIC', 'fresh certified organic')
('ZUCCHINI', '')
('APPLES', 'Harcourt Pink Lady, Fuji, Granny Smith')
('APPLES', 'Harcourt 2 kg bags, Pink Lady or Fuji (bag)')
('AVOCADOS', '')
('AVOCADOS', 'certified organic, seconds')
('BANANAS', 'Qld, organic')
('GRAPEFRUIT', '')
('GRAPES', 'crimson seedless')
('KIWI FRUIT', 'Qld certified organic')
('LEMONS', '')
('LIMES', '')
('MANDARINS', '')
('ORANGES', 'Navel')
('PEARS', 'Beurre Bosc Harcourt new season')
('PEARS', 'Packham, Harcourt new season')
('SULTANAS', '350g pre-packed bags')
('EGGS', "Melita free range, Barker's Creek")
('BASIL', '(bunch)')
('CORIANDER', '(bunch)')
('DILL', '(bunch)')
('MINT', '(bunch)')
('PARSLEY', '(bunch)')
('', 'Spring ONION from QLD')
>>> len(re.findall(r"(?m)(?=^.+$)^ *(?:([A-Z ]+\b(?<! )(?=[\s,]|$)))?(?: *(.*))?$", test_product_data))
72
>>>
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2012-12-06 13:06 +0000 |
| Message-ID | <aibjjaFqt9uU2@mid.individual.net> |
| In reply to | #34368 |
On 2012-12-05, Vlastimil Brom <vlastimil.brom@gmail.com> wrote:
> ... PARSNIP, certified organic
I'm not sure on this one.
> ('PARSNIP', ', certified organic')
--
Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2012-12-06 15:12 +0100 |
| Message-ID | <mailman.558.1354803171.29569.python-list@python.org> |
| In reply to | #34380 |
2012/12/6 Neil Cerutti <neilc@norwich.edu>:
> On 2012-12-05, Vlastimil Brom <vlastimil.brom@gmail.com> wrote:
>> ... PARSNIP, certified organic
>
> I'm not sure on this one.
>
>> ('PARSNIP', ', certified organic')
>
> --
> Neil Cerutti
> --
Well, I wasn't either, when I noticed this item, but given the specification:
"2. Retain punctuation and parentheses"
in one of the previous OP's messages, I figured, the punctuation would
better be a part of the description rather than the name in this case.
regards,
vbr
[toc] | [prev] | [next] | [standalone]
| From | Alexander Blinne <news@blinne.net> |
|---|---|
| Date | 2012-12-06 14:40 +0100 |
| Message-ID | <50c0a051$0$9514$9b4e6d93@newsspool1.arcor-online.net> |
| In reply to | #34295 |
Am 05.12.2012 18:04, schrieb Nick Mellor:
> Sample data
Well let's see what
def split_product(p):
p = p.strip()
w = p.split(" ")
try:
j = next(i for i,v in enumerate(w) if v.upper() != v)
except StopIteration:
return p, ''
return " ".join(w[:j]), " ".join(w[j:])
(which i still find a very elegant solution) has to say about those
sample data:
>>> for line in open('test.dat', 'r'):
... print(split_product(line))
('BEANS', 'hand picked')
('BEETROOT', 'certified organic')
('BOK CHOY', '(bunch)')
('BROCCOLI', 'Mornington Peninsula')
('BRUSSEL SPROUTS', '')
('CABBAGE', 'green')
('CABBAGE', 'Red')
('CAPSICUM RED', '')
('CARROTS', '')
('CARROTS', 'loose')
('CARROTS', 'juicing, certified organic')
('CARROTS', 'Trentham, large seconds, certified organic')
('CARROTS', 'Trentham, firsts, certified organic')
('CAULIFLOWER', '')
('CELERY', 'Mornington Peninsula IPM grower')
('CELERY', 'Mornington Peninsula IPM grower')
('CUCUMBER', '')
('EGGPLANT', '')
('FENNEL', '')
('GARLIC', '(from Argentina)')
('GINGER', 'fresh uncured')
('KALE', '(bunch)')
('KOHL RABI', 'certified organic')
('LEEKS', '')
('LETTUCE', 'iceberg')
('MUSHROOM', 'cup or flat')
('MUSHROOM', 'Swiss brown')
('ONION', 'brown')
('ONION', 'red')
('ONION', 'spring (bunch)')
('PARSNIP,', 'certified organic')
('POTATOES', 'certified organic')
('POTATOES', 'Sebago')
('POTATOES', 'Desiree')
('POTATOES', 'Bullarto chemical free')
('POTATOES', 'Dutch Cream')
('POTATOES', 'Nicola')
('POTATOES', 'Pontiac')
('POTATOES', 'Otway Red')
('POTATOES', 'teardrop')
('PUMPKIN', 'certified organic')
('SCHALLOTS', 'brown')
('SNOW PEAS', '')
('SPINACH', "I'll try to get certified organic (bunch)")
('SWEET POTATO', 'gold certified organic')
('SWEET POTATO', 'red small')
('SWEDE', 'certified organic')
('TOMATOES ', 'Qld')
('TURMERIC', 'fresh certified organic')
('ZUCCHINI', '')
('APPLES', 'Harcourt Pink Lady, Fuji, Granny Smith')
('APPLES', 'Harcourt 2 kg bags, Pink Lady or Fuji (bag)')
('AVOCADOS', '')
('AVOCADOS', 'certified organic, seconds')
('BANANAS', 'Qld, organic')
('GRAPEFRUIT', '')
('GRAPES', 'crimson seedless')
('KIWI FRUIT', 'Qld certified organic')
('LEMONS', '')
('LIMES', '')
('MANDARINS', '')
('ORANGES', 'Navel')
('PEARS', 'Beurre Bosc Harcourt new season')
('PEARS', 'Packham, Harcourt new season')
('SULTANAS', '350g pre-packed bags')
('EGGS', "Melita free range, Barker's Creek")
('BASIL', '(bunch)')
('CORIANDER', '(bunch)')
('DILL', '(bunch)')
('MINT', '(bunch)')
('PARSLEY', '(bunch)')
('', 'Spring ONION from QLD')
I think the only thing one is left to think about is the
('PARSNIP,', 'certified organic')
case. What about that extra comma? Perhaps it could even be considered
an "error" in the original data? I don't see a good general way to deal
with those which does not have to handle trailing punctuation on the
product name explicitly as a special case.
Greetings
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-12-04 17:21 -0500 |
| Message-ID | <mailman.481.1354659896.29569.python-list@python.org> |
| In reply to | #34226 |
On 12/4/2012 3:44 PM, Terry Reedy wrote:
> If the original string has no excess whitespace, description is what
> remains of s after product prefix is omitted. (Py 3 code)
>
> from itertools import takewhile
> def allcaps(word): return word == word.upper()
>
> def split_product_itertools(s):
> product = ' '.join(takewhile(allcaps, s.split()))
> return product, s[len(product)+1:]
>
> print(split_product_itertools("CAPSICUM RED fresh from QLD"))
> >>>
> ('CAPSICUM RED', 'fresh from QLD')
>
> Without that assumption, the same idea applies to the split list.
>
> def split_product_itertools(s):
> words = s.split()
> product = list(takewhile(allcaps, words))
> return ' '.join(product), ' '.join(words[len(product):])
Because these slice rather than index, either works trivially on an
empty description.
print(split_product_itertools("CAPSICUM RED"))
>>>
('CAPSICUM RED', '')
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-12-06 13:29 -0800 |
| Message-ID | <7xhany27kc.fsf@ruckus.brouhaha.com> |
| In reply to | #34226 |
Nick Mellor <thebalancepro@gmail.com> writes: > I came across itertools.dropwhile only today, then shortly afterwards > found Raymond Hettinger wondering, in 2007, whether to drop [sic] > dropwhile and takewhile from the itertools module.... > Almost nobody else of the 18 respondents seemed to be using them. What? I'm amazed by that. I didn't bother reading the old thread, but I use those functions fairly frequently. I just used takewhile the other day, processing a timestamped log file where I wanted to look at certain clusters of events. I won't post the actual code here, but takewhile was a handy way to pull out intervals of interest after an event was seen.
[toc] | [prev] | [standalone]
Page 2 of 2 — ← Prev page 1 [2]
Back to top | Article view | comp.lang.python
csiph-web