Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #108184

Re: Whittle it on down

From Jussi Piitulainen <jussi.piitulainen@helsinki.fi>
Newsgroups comp.lang.python
Subject Re: Whittle it on down
Date 2016-05-05 20:49 +0300
Organization A noiseless patient Spider
Message-ID <lf5y47os8d5.fsf@ling.helsinki.fi> (permalink)
References <ngejmj$gc4$1@dont-email.me> <572ae25f$0$2821$c3e8da3$76491128@news.astraweb.com> <1462430766.25079.598726825.1B90C7A1@webmail.messagingengine.com> <mailman.398.1462430769.32212.python-list@python.org> <572af811$0$1608$c3e8da3$5496439d@news.astraweb.com>

Show all headers | View raw


Steven D'Aprano writes:

> I get something like this:
>
> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>
>
> but it fails on strings like "AA   &  A &  A". What am I doing wrong?

It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS)
when the middle part is just one LETTER. That's something of a
misanalysis anyway. I notice that the correct pattern has already been
posted at least thrice and you have acknowledged one of them.

But I think you are also trying to do too much with a single regex. A
more promising start is to think of the whole string as "parts" joined
with "glue", then split with a glue pattern and test the parts:

import re
glue = re.compile(" *& *| +")
keep, drop = [], []
for datum in data:
    items = glue.split(datum)
    if all(map(str.isupper, items)):
        keep.append(datum)
    else:
        drop.append(datum)

That will cope with Greek, by the way.

It's annoying that the order of the branches of the glue pattern above
matters. One _does_ have problems when one uses the usual regex engines.

Capturing groups in the glue pattern would produce glue items in the
split output. Either avoid them or deal with them: one could split with
the underspecific "([ &]+)" and then check that each glue item contains
at most one ampersand. One could also allow other punctuation, and then
check afterwards.

One can use _another_ regex to test individual parts. Code above used
str.isupper to test a part. The improved regex package (from PyPI, to
cope with Greek) can do the same:

import regex
part = regex.compile("[[:upper:]]+")
glue = regex.compile(" *& *| *")

keep, drop = [], []
for datum in data:
    items = glue.split(datum)
    if all(map(part.fullmatch, items)):
        keep.append(datum)
    else:
        drop.append(datum)

Just "[A-Z]+" suffices for ASCII letters, and "[A-ZÄÖ]+" copes with most
of Finnish; the [:upper:] class is nicer and there's much more that is
nicer in the newer regex package.

The point of using a regex for this is that the part pattern can then be
generalized to allow some punctuation or digits in a part, for example.
Anything that the glue pattern doesn't consume. (Nothing wrong with
using other techniques for this, either; str.isupper worked nicely
above.)

It's also possible to swap the roles of the patterns. Split with a part
pattern. Then check that the text between such parts is glue:

keep, drop = [], []
for datum in data:
    items = part.split(datum)
    if all(map(glue.fullmatch, items)):
        keep.append(datum)
    else:
        drop.append(datum)

The point is to keep the patterns simple by making them more local, or
more relaxed, followed by a further test. This way they can be made to
do more, but not more than they reasonably can.

Note also the use of re.fullmatch instead of re.match (let alone
re.search) when a full match is required! This gets rid of all anchors
in the pattern, which may in turn allow fewer parentheses inside the
pattern.

The usual regex engines are not perfect, but parts of them are
fantastic.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 00:58 -0400
  Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-04 22:39 -0700
    Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:44 -0400
    Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 19:31 -0400
      Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-06 09:45 +0200
        Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 09:58 -0400
          Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 10:41 -0400
            Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-06 17:44 +0200
              Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 18:43 -0400
      Re: Whittle it on down alister <alister.ware@ntlworld.com> - 2016-05-06 10:01 +0000
  Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 08:53 +0300
    Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:57 -0400
  Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 16:04 +1000
    Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-04 23:46 -0700
      Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 17:04 +1000
        Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 00:34 -0700
          Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 18:41 +1000
            Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 09:13 -0400
              Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:13 +1000
      Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 17:36 +1000
        Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-05 10:17 +0200
          Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 01:39 +1000
        Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 09:21 -0400
          Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 04:03 +1000
            Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 14:52 -0400
            Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 12:09 -0700
        Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 06:32 -0700
          Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 10:36 -0400
          Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:43 +1000
            Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 11:55 -0700
        Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 20:49 +0300
          Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 04:14 +1000
            Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 21:27 +0300
              Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 14:54 -0400
              Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 10:57 +1000
                Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-06 07:19 +0300
    Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:31 -0400
      Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:54 +1000
        Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 17:36 -0400
      Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 11:56 -0700
        Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 17:45 -0400

csiph-web