Path: csiph.com!feeder.erje.net!2.eu.feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!fu-berlin.de!uni-berlin.de!not-for-mail From: Stephen Hansen Newsgroups: comp.lang.python Subject: Re: Whittle it on down Date: Thu, 05 May 2016 06:32:24 -0700 Lines: 39 Message-ID: References: <572ae25f$0$2821$c3e8da3$76491128@news.astraweb.com> <1462430766.25079.598726825.1B90C7A1@webmail.messagingengine.com> <572af811$0$1608$c3e8da3$5496439d@news.astraweb.com> <1462455144.93995.599007201.4350517C@webmail.messagingengine.com> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Trace: news.uni-berlin.de dNU34Ta7wzXNRu+I3KxnNA1BEdQaTakTBWd34k1vNAnw== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.003 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'processing.': 0.07; 'encode': 0.09; 'expression:': 0.09; 'naturally': 0.09; 'non- ascii': 0.09; 'oh,': 0.09; 'received:internal': 0.09; 'rigorous': 0.09; 'surrounded': 0.09; 'thu,': 0.15; '2016': 0.16; 'cleanly': 0.16; 'enough.': 0.16; 'message-id:@webmail.messagingengine.com': 0.16; 'q&a': 0.16; 'received:10.202': 0.16; 'received:10.202.2': 0.16; 'received:66.111': 0.16; 'received:66.111.4': 0.16; 'received:io': 0.16; 'received:messagingengine.com': 0.16; 'received:psf.io': 0.16; 'uppercase': 0.16; 'wrote:': 0.16; 'looked': 0.16; "wouldn't": 0.16; 'string': 0.17; 'meant': 0.22; 'stephen': 0.22; 'trying': 0.22; 'am,': 0.23; '(where': 0.23; 'elements': 0.23; 'matching': 0.23; 'header:In-Reply-To:1': 0.24; "doesn't": 0.26; 'mostly': 0.27; 'separate': 0.27; 'data,': 0.27; 'regular': 0.29; 'spaces': 0.29; 'character': 0.29; 'allows': 0.30; 'putting': 0.30; 'rules': 0.31; 'aside': 0.32; 'problem': 0.33; "d'aprano": 0.33; "he's": 0.33; 'instead,': 0.33; 'steven': 0.33; 'previous': 0.34; 'requirements': 0.35; "isn't": 0.35; 'problem.': 0.35; 'sometimes': 0.35; 'but': 0.36; '(and': 0.36; 'data.': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'received:10': 0.37; 'really': 0.37; 'being': 0.37; 'doing': 0.38; 'detail': 0.38; 'received:66': 0.38; 'data': 0.39; 'to:addr:python.org': 0.40; 'care': 0.60; 'your': 0.60; 'header :Message-Id:1': 0.61; 'identify': 0.61; 'further': 0.62; 'our': 0.64; 'thursday': 0.66; 'letters': 0.67; 'answer.': 0.72; '100%': 0.72; 'upper': 0.76; "op's": 0.84; 'subject:down': 0.84; 'absolutely': 0.88; 'scraping': 0.91; 'try.': 0.91 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=ixokai.io; h= content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-sasl-enc :x-sasl-enc; s=mesmtp; bh=sgxCndeskF6dXhrCvYB6SdxCcCc=; b=iBOcyE Ik/JvRfRJxfmoPt86HJYLQZMocevu+93ZVDPYrWiT7taJc0LX3B8cA8QlURCN4fv cGu45o358n02biYlrjthVAZX9RniR0gfji1wNQtc0FI2rKmR8KHOYDtDAfC+M1Hq jAWI+lGDE1SCM7iG66fZUQ2LdzkcetccT3Ugs= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-sasl-enc:x-sasl-enc; s=smtpout; bh=sgxCndeskF6dXhr CvYB6SdxCcCc=; b=L7cpFBdzF3Kw8if9DuZjYrNQ/31I7SnEWVvmlDrUTeLNWdr WuAvJVIU/tTq13Lfx3o00mfX4qv6k2oPTjdBTJiN52m3BaESs2PsivPutfKC/pBL FSMSLk1DiWmzGG2DcI19UStSxGzXkXFzVNZ/7/Kr6uKd92xFbS8FFbgpEWPk= X-Sasl-Enc: zDQnEmOCkrBHuzwtRXOAr2FqlEZTgN6iyJAGIWhmrr/3 1462455144 X-Mailer: MessagingEngine.com Webmail Interface - ajax-140377c4 In-Reply-To: <572af811$0$1608$c3e8da3$5496439d@news.astraweb.com> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: <1462455144.93995.599007201.4350517C@webmail.messagingengine.com> X-Mailman-Original-References: <572ae25f$0$2821$c3e8da3$76491128@news.astraweb.com> <1462430766.25079.598726825.1B90C7A1@webmail.messagingengine.com> <572af811$0$1608$c3e8da3$5496439d@news.astraweb.com> Xref: csiph.com comp.lang.python:108177 On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote: > Oh, a further thought... > > On Thursday 05 May 2016 16:46, Stephen Hansen wrote: > > I don't even care about faster: Its overly complicated. Sometimes a > > regular expression really is the clearest way to solve a problem. > > Putting non-ASCII letters aside for the moment, how would you match these > specs as a regular expression? I don't know, but mostly because I wouldn't even try. The requirements are over-specified. If you look at the OP's data (and based on previous conversation), he's doing web scraping and trying to pull out good data. There's no absolutely perfect way to do that because the system he's scraping isn't meant for data processing. The data isn't cleanly articulated. Instead, he wants a heuristic to pull out what look like section titles. The OP looked at the data and came up with a simple set of rules that identify these section titles: >> Want to keep all elements containing only upper case letters or upper case letters and ampersand (where ampersand is surrounded by spaces) This translates naturally into a simple regular expression: an uppercase string with spaces and &'s. Now, that expression doesn't 100% encode every detail of that rule-- it allows both Q&A and Q & A-- but on my own looking at the data, I suspect its good enough. The titles are clearly separate from the other data scraped by their being upper cased. We just need to expand our allowed character range into spaces and &'s. Nothing in the OP's request demands the kind of rigorous matching that your scenario does. Its a practical problem with a simple, practical answer. -- Stephen Hansen m e @ i x o k a i . i o