Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #108177

Re: Whittle it on down

From Stephen Hansen <me+python@ixokai.io>
Newsgroups comp.lang.python
Subject Re: Whittle it on down
Date 2016-05-05 06:32 -0700
Message-ID <mailman.406.1462455146.32212.python-list@python.org> (permalink)
References (1 earlier) <572ae25f$0$2821$c3e8da3$76491128@news.astraweb.com> <1462430766.25079.598726825.1B90C7A1@webmail.messagingengine.com> <mailman.398.1462430769.32212.python-list@python.org> <572af811$0$1608$c3e8da3$5496439d@news.astraweb.com> <1462455144.93995.599007201.4350517C@webmail.messagingengine.com>

Show all headers | View raw


On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:
> Oh, a further thought...
> 
> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> > I don't even care about faster: Its overly complicated. Sometimes a
> > regular expression really is the clearest way to solve a problem.
> 
> Putting non-ASCII letters aside for the moment, how would you match these 
> specs as a regular expression?

I don't know, but mostly because I wouldn't even try. The requirements
are over-specified. If you look at the OP's data (and based on previous
conversation), he's doing web scraping and trying to pull out good data.
There's no absolutely perfect way to do that because the system he's
scraping isn't meant for data processing. The data isn't cleanly
articulated.

Instead, he wants a heuristic to pull out what look like section titles. 

The OP looked at the data and came up with a simple set of rules that
identify these section titles:

>> Want to keep all elements containing only upper case letters or upper 
case letters and ampersand (where ampersand is surrounded by spaces)

This translates naturally into a simple regular expression: an uppercase
string with spaces and &'s. Now, that expression doesn't 100% encode
every detail of that rule-- it allows both Q&A and Q & A-- but on my own
looking at the data, I suspect its good enough. The titles are clearly
separate from the other data scraped by their being upper cased. We just
need to expand our allowed character range into spaces and &'s.

Nothing in the OP's request demands the kind of rigorous matching that
your scenario does. Its a practical problem with a simple, practical
answer.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 00:58 -0400
  Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-04 22:39 -0700
    Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:44 -0400
    Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 19:31 -0400
      Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-06 09:45 +0200
        Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 09:58 -0400
          Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 10:41 -0400
            Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-06 17:44 +0200
              Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 18:43 -0400
      Re: Whittle it on down alister <alister.ware@ntlworld.com> - 2016-05-06 10:01 +0000
  Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 08:53 +0300
    Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:57 -0400
  Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 16:04 +1000
    Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-04 23:46 -0700
      Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 17:04 +1000
        Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 00:34 -0700
          Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 18:41 +1000
            Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 09:13 -0400
              Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:13 +1000
      Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 17:36 +1000
        Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-05 10:17 +0200
          Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 01:39 +1000
        Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 09:21 -0400
          Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 04:03 +1000
            Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 14:52 -0400
            Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 12:09 -0700
        Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 06:32 -0700
          Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 10:36 -0400
          Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:43 +1000
            Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 11:55 -0700
        Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 20:49 +0300
          Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 04:14 +1000
            Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 21:27 +0300
              Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 14:54 -0400
              Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 10:57 +1000
                Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-06 07:19 +0300
    Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:31 -0400
      Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:54 +1000
        Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 17:36 -0400
      Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 11:56 -0700
        Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 17:45 -0400

csiph-web