Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: UNSURE 0.281 X-Spam-Level: ** X-Spam-Evidence: '*H*': 0.46; '*S*': 0.02; "'')": 0.07; 'parsing': 0.07; "'gold": 0.16; "('',": 0.16; '(given': 0.16; 'fruit': 0.16; 'kiwi': 0.16; 'nick': 0.16; 'range,': 0.16; 'seconds,': 0.16; 'spec,': 0.16; '>>>': 0.18; 'import': 0.21; 'spring': 0.22; 'subject:skip:i 10': 0.22; 'header:In-Reply-To:1': 0.25; 'message- id:@mail.gmail.com': 0.27; 'url:mailman': 0.29; 'starts': 0.29; '(from': 0.30; 'seconds': 0.30; 'url:python': 0.32; 'print': 0.32; 'url:listinfo': 0.32; 'organic': 0.33; 'to:addr:python-list': 0.33; 'hi,': 0.33; 'another': 0.33; 'received:google.com': 0.34; 'done': 0.34; 'data,': 0.35; 'fresh': 0.35; 'skip:l 30': 0.35; 'received:209.85.220': 0.35; 'received:209.85': 0.35; 'but': 0.36; 'url:org': 0.36; "i'll": 0.36; 'should': 0.36; 'received:209': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'green': 0.38; 'description': 0.39; 'to:addr:python.org': 0.39; 'header:Received:5': 0.40; 'url:mail': 0.40; 'red': 0.60; 'further': 0.61; 'first': 0.61; 'dedicated': 0.61; 'latest': 0.61; 'free': 0.61; 'more': 0.63; 'brown': 0.65; 'smith': 0.71; 'sweet': 0.71; 'swiss': 0.71; 'hand': 0.82; 'certified': 0.83; 'chemical': 0.84; 'cream': 0.84; 'loose': 0.84; 'peninsula': 0.84; 'season': 0.84; 'bags': 0.91; 'bags,': 0.91; 'dutch': 0.91; 'lady': 0.91; 'snow': 0.91; 'subject:Good': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=zleDDaEK4DcV2BXSkM9p75TpMNG+bNHcoTTUu6Ovwj8=; b=vAnPfnhdYGdJXgw142xTi3ZvP7j0sHs2joB48yCeUb7EQ2e2KismNuEgbaa1OtP333 RYjgPQvJ+4FNcaSqnn2pTzNiG25V8OJnnZ/BX4MykcRpzRlWBqhi+VrcuZTJ62troUiM kXkzWGI8MDjFpRsKI2TU8JFmwCaMaJw1pdn7pfChs0CLXpmfjM9O40l4ZbLoaOPthTYE HaJQSZsEblexNfad4bXoPb+81vxk1gQYL31eWRrR1ZWHv5baO9Nl8O0g5xSKwsjOHlLR rkLWw4SJczFO9SviWIWWq51wKbJPz0cKgMkxnd2QMsMmNKimNBG+YinpZqJSlS6hdB3X OxpQ== MIME-Version: 1.0 In-Reply-To: <945048d8-961e-4894-89fc-3b7fd9b7965b@googlegroups.com> References: <26781aa9-b4a2-4308-8db2-5a150da2128f@googlegroups.com> <945048d8-961e-4894-89fc-3b7fd9b7965b@googlegroups.com> Date: Wed, 5 Dec 2012 22:36:36 +0100 Subject: Re: Good use for itertools.dropwhile and itertools.takewhile From: Vlastimil Brom To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 X-Mailman-Approved-At: Thu, 06 Dec 2012 09:22:12 +0100 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 177 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1354782133 news.xs4all.nl 6918 [2001:888:2000:d::a6]:50181 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:34368 2012/12/5 Nick Mellor : > Neil, > > Further down the data, found another edge case: > > "Spring ONION from QLD" > > Following the spec, the whole line should be description (description starts at first word that is not all caps.) This case breaks the latest groupby. > > N > -- > http://mail.python.org/mailman/listinfo/python-list Hi, Just for completeness..., it (likely) can be done using regex (given the current specificatioin), but if the data are even more complex and varying, the tools like pyparsing or dedicated parsing functions might be more appropriate; hth, vbr: >>> import re >>> test_product_data = """BEANS hand picked ... BEETROOT certified organic ... BOK CHOY (bunch) ... BROCCOLI Mornington Peninsula ... BRUSSEL SPROUTS ... CABBAGE green ... CABBAGE Red ... CAPSICUM RED ... CARROTS ... CARROTS loose ... CARROTS juicing, certified organic ... CARROTS Trentham, large seconds, certified organic ... CARROTS Trentham, firsts, certified organic ... CAULIFLOWER ... CELERY Mornington Peninsula IPM grower ... CELERY Mornington Peninsula IPM grower ... CUCUMBER ... EGGPLANT ... FENNEL ... GARLIC (from Argentina) ... GINGER fresh uncured ... KALE (bunch) ... KOHL RABI certified organic ... LEEKS ... LETTUCE iceberg ... MUSHROOM cup or flat ... MUSHROOM Swiss brown ... ONION brown ... ONION red ... ONION spring (bunch) ... PARSNIP, certified organic ... POTATOES certified organic ... POTATOES Sebago ... POTATOES Desiree ... POTATOES Bullarto chemical free ... POTATOES Dutch Cream ... POTATOES Nicola ... POTATOES Pontiac ... POTATOES Otway Red ... POTATOES teardrop ... PUMPKIN certified organic ... SCHALLOTS brown ... SNOW PEAS ... SPINACH I'll try to get certified organic (bunch) ... SWEET POTATO gold certified organic ... SWEET POTATO red small ... SWEDE certified organic ... TOMATOES Qld ... TURMERIC fresh certified organic ... ZUCCHINI ... APPLES Harcourt Pink Lady, Fuji, Granny Smith ... APPLES Harcourt 2 kg bags, Pink Lady or Fuji (bag) ... AVOCADOS ... AVOCADOS certified organic, seconds ... BANANAS Qld, organic ... GRAPEFRUIT ... GRAPES crimson seedless ... KIWI FRUIT Qld certified organic ... LEMONS ... LIMES ... MANDARINS ... ORANGES Navel ... PEARS Beurre Bosc Harcourt new season ... PEARS Packham, Harcourt new season ... SULTANAS 350g pre-packed bags ... EGGS Melita free range, Barker's Creek ... BASIL (bunch) ... CORIANDER (bunch) ... DILL (bunch) ... MINT (bunch) ... PARSLEY (bunch) ... Spring ONION from QLD""" >>> >>> len(test_product_data.splitlines()) 72 >>> >>> for prod_item in re.findall(r"(?m)(?=^.+$)^ *(?:([A-Z ]+\b(?>> len(re.findall(r"(?m)(?=^.+$)^ *(?:([A-Z ]+\b(?>>