Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #108158 > unrolled thread

Whittle it on down

Started byDFS <nospam@dfs.com>
First post2016-05-05 00:58 -0400
Last post2016-05-05 17:45 -0400
Articles 20 on this page of 41 — 8 participants

Back to article view | Back to comp.lang.python


Contents

  Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 00:58 -0400
    Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-04 22:39 -0700
      Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:44 -0400
      Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 19:31 -0400
        Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-06 09:45 +0200
          Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 09:58 -0400
            Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 10:41 -0400
              Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-06 17:44 +0200
                Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-06 18:43 -0400
        Re: Whittle it on down alister <alister.ware@ntlworld.com> - 2016-05-06 10:01 +0000
    Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 08:53 +0300
      Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:57 -0400
    Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 16:04 +1000
      Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-04 23:46 -0700
        Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 17:04 +1000
          Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 00:34 -0700
            Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 18:41 +1000
              Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 09:13 -0400
                Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:13 +1000
        Re: Whittle it on down Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-05 17:36 +1000
          Re: Whittle it on down Peter Otten <__peter__@web.de> - 2016-05-05 10:17 +0200
            Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 01:39 +1000
          Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 09:21 -0400
            Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 04:03 +1000
              Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 14:52 -0400
              Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 12:09 -0700
          Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 06:32 -0700
            Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 10:36 -0400
            Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:43 +1000
              Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 11:55 -0700
          Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 20:49 +0300
            Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 04:14 +1000
              Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-05 21:27 +0300
                Re: Whittle it on down Random832 <random832@fastmail.com> - 2016-05-05 14:54 -0400
                Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 10:57 +1000
                  Re: Whittle it on down Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-05-06 07:19 +0300
      Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 08:31 -0400
        Re: Whittle it on down Steven D'Aprano <steve@pearwood.info> - 2016-05-06 03:54 +1000
          Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 17:36 -0400
        Re: Whittle it on down Stephen Hansen <me+python@ixokai.io> - 2016-05-05 11:56 -0700
          Re: Whittle it on down DFS <nospam@dfs.com> - 2016-05-05 17:45 -0400

Page 2 of 3 — ← Prev page 1 [2] 3  Next page →


#108167

FromPeter Otten <__peter__@web.de>
Date2016-05-05 10:17 +0200
Message-ID<mailman.402.1462436282.32212.python-list@python.org>
In reply to#108166
Steven D'Aprano wrote:

> Oh, a further thought...
> 
> 
> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> 
>> On Wed, May 4, 2016, at 11:04 PM, Steven D'Aprano wrote:
>>> Start by writing a function or a regex that will distinguish strings
>>> that match your conditions from those that don't. A regex might be
>>> faster, but here's a function version.
>>> ... snip ...
>> 
>> Yikes. I'm all for the idea that one shouldn't go to regex when Python's
>> powerful string type can answer the problem more clearly, but this seems
>> to go out of its way to do otherwise.
>> 
>> I don't even care about faster: Its overly complicated. Sometimes a
>> regular expression really is the clearest way to solve a problem.
> 
> Putting non-ASCII letters aside for the moment, how would you match these
> specs as a regular expression?
> 
> - All uppercase ASCII letters (A to Z only), optionally separated into
> words by either a bare ampersand (e.g. "AAA&AAA") or an ampersand with
> leading and
> trailing spaces (spaces only, not arbitrary whitespace): "AAA   & AAA".
> 
> - The number of spaces on either side of the ampersands need not be the
> same: "AAA&   BBB &       CCC" should match.
> 
> - Leading or trailing spaces, or spaces not surrounding an ampersand, must
> not match: "AAA BBB" must be rejected.
> 
> - Leading or trailing ampersands must also be rejected. This includes the
> case where the string is nothing but ampersands.
> 
> - Consecutive ampersands "AAA&&&BBB" and the empty string must be
> rejected.
> 
> 
> I get something like this:
> 
> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
> 
> 
> but it fails on strings like "AA   &  A &  A". What am I doing wrong?
> 
> 
> For the record, here's my brief test suite:
> 
> 
> def test(pat):
>     for s in ("", " ", "&" "A A", "A&", "&A", "A&&A", "A& &A"):
>         assert re.match(pat, s) is None
>     for s in ("A", "A & A", "AA&A", "AA   &  A &  A"):
>         assert re.match(pat, s)

>>> def test(pat):
...     for s in ("", " ", "&" "A A", "A&", "&A", "A&&A", "A& &A"):
...         assert re.match(pat, s) is None
...     for s in ("A", "A & A", "AA&A", "AA   &  A &  A"):
...         assert re.match(pat, s)
... 
>>> test("^A+( *& *A+)*$")
>>> 

[toc] | [prev] | [next] | [standalone]


#108179

FromSteven D'Aprano <steve@pearwood.info>
Date2016-05-06 01:39 +1000
Message-ID<572b693d$0$1599$c3e8da3$5496439d@news.astraweb.com>
In reply to#108167
On Thu, 5 May 2016 06:17 pm, Peter Otten wrote:

>> I get something like this:
>> 
>> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>> 
>> 
>> but it fails on strings like "AA   &  A &  A". What am I doing wrong?

> test("^A+( *& *A+)*$")

Thanks Peter, that's nice!


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#108176

FromRandom832 <random832@fastmail.com>
Date2016-05-05 09:21 -0400
Message-ID<mailman.405.1462454501.32212.python-list@python.org>
In reply to#108166
On Thu, May 5, 2016, at 03:36, Steven D'Aprano wrote:
> Putting non-ASCII letters aside for the moment, how would you match these 
> specs as a regular expression?

Well, obviously *your* language (not the OP's), given the cases you
reject, is "one or more sequences of letters separated by
space*-ampersand-space*", and that is actually one of the easiest kinds
of regex to write: "[A-Z]+( *& *[A-Z]+)*".

However, your spec is wrong:

> - Leading or trailing spaces, or spaces not surrounding an ampersand,
> must not match: "AAA BBB" must be rejected.

The *very first* item in OP's list of good outputs is 'PHYSICAL FITNESS
CONSULTANTS & TRAINERS'.

If you want something that's extremely conservative (except for the
*very odd in context* choice of allowing arbitrary numbers of spaces -
why would you allow this but reject leading or trailing space?) and
accepts all of OP's input:

[A-Z]+(( *& *| +)[A-Z]+)*

[toc] | [prev] | [next] | [standalone]


#108186

FromSteven D'Aprano <steve@pearwood.info>
Date2016-05-06 04:03 +1000
Message-ID<572b8aee$0$1589$c3e8da3$5496439d@news.astraweb.com>
In reply to#108176
On Thu, 5 May 2016 11:21 pm, Random832 wrote:

> On Thu, May 5, 2016, at 03:36, Steven D'Aprano wrote:
>> Putting non-ASCII letters aside for the moment, how would you match these
>> specs as a regular expression?
> 
> Well, obviously *your* language (not the OP's), given the cases you
> reject, is "one or more sequences of letters separated by
> space*-ampersand-space*", and that is actually one of the easiest kinds
> of regex to write: "[A-Z]+( *& *[A-Z]+)*".

One of the easiest kind of regex to write incorrectly:

py> re.match("[A-Z]+( *& *[A-Z]+)*", "A----")
<_sre.SRE_Match object at 0xb7bf4aa0>


It doesn't even get the "all uppercase" part of the specification:

py> re.match("[A-Z]+( *& *[A-Z]+)*", "Azzz")
<_sre.SRE_Match object at 0xb7bf4aa0>

You failed to anchor the string at the beginning and end of the string, an
easy mistake to make, but that's the point. It's easy to make mistakes with
regexes because the syntax is so overly terse and unforgiving.

But I think I just learned something important today. I learned that's it's
not actually regexes that I dislike, it's regex culture that I dislike.
What I learned from this thread:


- Nobody could possibly want to support non-ASCII text. (Apart from the
approximately 6.5 billion people in the world that don't speak English of
course, an utterly insignificant majority.)

- Data validity doesn't matter, because there's no possible way that you
might accidentally scrape data from the wrong part of a HTML file and end
up with junk input.

- Even if you do somehow end up with junk, there couldn't possibly be any
real consequences to that.

- It doesn't matter if you match too much, or to little, that just means the
specs are too pedantic.


Hence the famous quote:

    Some people, when confronted with a problem, think 
    "I know, I'll use regular expressions." Now they 
    have two problems.


It's not really regexes that are the problem.


> However, your spec is wrong:

How can you say that? It's *my* spec, I can specify anything I want.


>> - Leading or trailing spaces, or spaces not surrounding an ampersand,
>> must not match: "AAA BBB" must be rejected.
> 
> The *very first* item in OP's list of good outputs is 'PHYSICAL FITNESS
> CONSULTANTS & TRAINERS'.

That's very nice, but irrelevant. I'm not talking about the OP's outputs.
I'm giving my own.




-- 
Steven

[toc] | [prev] | [next] | [standalone]


#108190

FromRandom832 <random832@fastmail.com>
Date2016-05-05 14:52 -0400
Message-ID<mailman.409.1462474328.32212.python-list@python.org>
In reply to#108186
On Thu, May 5, 2016, at 14:03, Steven D'Aprano wrote:
> You failed to anchor the string at the beginning and end of the string,
> an easy mistake to make, but that's the point.

I don't think anchoring is properly a concern of the regex itself -
.match is anchored implicitly at the beginning, and one could easily
imagine an API that implicitly anchors at the end - or you can simply
check that the match length == the string length.

> - Data validity doesn't matter, because there's no possible way that you
> might accidentally scrape data from the wrong part of a HTML file and end
> up with junk input.

If you've scraped data from the wrong part of the file, then nothing you
do to your regex can prevent the junk input from coincidentally matching
the input format.

[toc] | [prev] | [next] | [standalone]


#108195

FromStephen Hansen <me+python@ixokai.io>
Date2016-05-05 12:09 -0700
Message-ID<mailman.414.1462475384.32212.python-list@python.org>
In reply to#108186
On Thu, May 5, 2016, at 11:03 AM, Steven D'Aprano wrote:
> - Nobody could possibly want to support non-ASCII text. (Apart from the
> approximately 6.5 billion people in the world that don't speak English of
> course, an utterly insignificant majority.)

Oh, I'd absolutely want to support non-ASCII text. If I have unicode
input, though, I unfortunately have to rely on
https://pypi.python.org/pypi/regex as 're' doesn't support matching on
character properties. 

I keep hoping it'll replace "re", then we could do:

pattern = regex.compile(ru"^\p{Lu}\s&]+$")

where \p{property} matches against character properties in the unicode
database.

> - Data validity doesn't matter, because there's no possible way that you
> might accidentally scrape data from the wrong part of a HTML file and end
> up with junk input.

Um, no one said that. I was arguing that the *regular expression*
doesn't need to be responsible for validation.

> - Even if you do somehow end up with junk, there couldn't possibly be any
> real consequences to that.

No one said that either...

> - It doesn't matter if you match too much, or to little, that just means
> the
> specs are too pedantic.

Or that...

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]


#108177

FromStephen Hansen <me+python@ixokai.io>
Date2016-05-05 06:32 -0700
Message-ID<mailman.406.1462455146.32212.python-list@python.org>
In reply to#108166
On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:
> Oh, a further thought...
> 
> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> > I don't even care about faster: Its overly complicated. Sometimes a
> > regular expression really is the clearest way to solve a problem.
> 
> Putting non-ASCII letters aside for the moment, how would you match these 
> specs as a regular expression?

I don't know, but mostly because I wouldn't even try. The requirements
are over-specified. If you look at the OP's data (and based on previous
conversation), he's doing web scraping and trying to pull out good data.
There's no absolutely perfect way to do that because the system he's
scraping isn't meant for data processing. The data isn't cleanly
articulated.

Instead, he wants a heuristic to pull out what look like section titles. 

The OP looked at the data and came up with a simple set of rules that
identify these section titles:

>> Want to keep all elements containing only upper case letters or upper 
case letters and ampersand (where ampersand is surrounded by spaces)

This translates naturally into a simple regular expression: an uppercase
string with spaces and &'s. Now, that expression doesn't 100% encode
every detail of that rule-- it allows both Q&A and Q & A-- but on my own
looking at the data, I suspect its good enough. The titles are clearly
separate from the other data scraped by their being upper cased. We just
need to expand our allowed character range into spaces and &'s.

Nothing in the OP's request demands the kind of rigorous matching that
your scenario does. Its a practical problem with a simple, practical
answer.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]


#108178

FromDFS <nospam@dfs.com>
Date2016-05-05 10:36 -0400
Message-ID<ngfliq$u51$1@dont-email.me>
In reply to#108177
On 5/5/2016 9:32 AM, Stephen Hansen wrote:
> On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:
>> Oh, a further thought...
>>
>> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
>>> I don't even care about faster: Its overly complicated. Sometimes a
>>> regular expression really is the clearest way to solve a problem.
>>
>> Putting non-ASCII letters aside for the moment, how would you match these
>> specs as a regular expression?
>
> I don't know, but mostly because I wouldn't even try. The requirements
> are over-specified. If you look at the OP's data (and based on previous
> conversation), he's doing web scraping and trying to pull out good data.
> There's no absolutely perfect way to do that because the system he's
> scraping isn't meant for data processing. The data isn't cleanly
> articulated.
>
> Instead, he wants a heuristic to pull out what look like section titles.


Assigned by a company named localeze, apparently.

http://www.usdirectory.com/cat/g0

https://www.neustarlocaleze.biz/welcome/



> The OP looked at the data and came up with a simple set of rules that
> identify these section titles:
>
>>> Want to keep all elements containing only upper case letters or upper
> case letters and ampersand (where ampersand is surrounded by spaces)
>
> This translates naturally into a simple regular expression: an uppercase
> string with spaces and &'s. Now, that expression doesn't 100% encode
> every detail of that rule-- it allows both Q&A and Q & A-- but on my own
> looking at the data, I suspect its good enough. The titles are clearly
> separate from the other data scraped by their being upper cased. We just
> need to expand our allowed character range into spaces and &'s.
>
> Nothing in the OP's request demands the kind of rigorous matching that
> your scenario does. Its a practical problem with a simple, practical
> answer.


Yes.  And simplicity + practicality = successfulality.

And I do a sanity check before using the data anyway: after parse and 
cleanup and regex matching, I make sure all lists have the same number 
of elements:

lenData = 
[len(title),len(names),len(addr),len(street),len(city),len(state),len(zip)]

if len(set(lenData)) != 1:  alert the media

[toc] | [prev] | [next] | [standalone]


#108183

FromSteven D'Aprano <steve@pearwood.info>
Date2016-05-06 03:43 +1000
Message-ID<572b864e$0$1622$c3e8da3$5496439d@news.astraweb.com>
In reply to#108177
On Thu, 5 May 2016 11:32 pm, Stephen Hansen wrote:

> On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:
>> Oh, a further thought...
>> 
>> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
>> > I don't even care about faster: Its overly complicated. Sometimes a
>> > regular expression really is the clearest way to solve a problem.
>> 
>> Putting non-ASCII letters aside for the moment, how would you match these
>> specs as a regular expression?
> 
> I don't know, but mostly because I wouldn't even try. 

Really? Peter Otten seems to have found a solution, and Random832 almost
found it too.


> The requirements 
> are over-specified. If you look at the OP's data (and based on previous
> conversation), he's doing web scraping and trying to pull out good data.

I'm not talking about the OP's data. I'm talking about *my* requirements.

I thought that this was a friendly discussion about regexes, but perhaps I
was mistaken. Because I sure am feeling a lot of hostility to the ideas
that regexes are not necessarily the only way to solve this, and that data
validation is a good thing.


> There's no absolutely perfect way to do that because the system he's
> scraping isn't meant for data processing. The data isn't cleanly
> articulated.

Right. Which makes it *more*, not less, important to be sure that your regex
doesn't match too much, because your data is likely to be contaminated by
junk strings that don't belong in the data and shouldn't be accepted. I've
done enough web scraping to realise just how easy it is to start grabbing
data from the wrong part of the file.


> Instead, he wants a heuristic to pull out what look like section titles.

Good for him. I asked a different question. Does my question not count?


> The OP looked at the data and came up with a simple set of rules that
> identify these section titles:
> 
>>> Want to keep all elements containing only upper case letters or upper
> case letters and ampersand (where ampersand is surrounded by spaces)

That simple rule doesn't match his examples, as I know too well because I
made the silly mistake of writing to the written spec as written without
reading the examples as well. As I already admitted. That was a silly
mistake because I know very well that people are really bad at writing
detailed specs that neither match too much nor too little.

But you know, I was more focused on the rest of his question, namely whether
it was better to extract the matches strings into a new list, or delete the
non-matches from the existing string, and just got carried away writing the
match function. I didn't actually expect anyone to use it. It was untested,
and I hinted that a regex would probably be better.

I was trying to teach DFS a generic programming technique, not solve his
stupid web scraping problem for him. What happens next time when he's
trying to filter a list of floats, or Widgets? Should he convert them to
strings so he can use a regex to match them, or should he learn about
general filtering techniques?


> This translates naturally into a simple regular expression: an uppercase
> string with spaces and &'s. Now, that expression doesn't 100% encode
> every detail of that rule-- it allows both Q&A and Q & A-- but on my own
> looking at the data, I suspect its good enough. The titles are clearly
> separate from the other data scraped by their being upper cased. We just
> need to expand our allowed character range into spaces and &'s.
> 
> Nothing in the OP's request demands the kind of rigorous matching that
> your scenario does. Its a practical problem with a simple, practical
> answer.

Yes, and that practical answer needs to reject:

- the empty string, because it is easy to mistakenly get empty strings when
scraping data, especially if you post-process the data;

- strings that are all spaces, because "       " cannot possibly be a title;

- strings that are all ampersands, because "&&&&&" is not a title, and it
almost surely indicates that your scraping has gone wrong and you're
reading junk from somewhere;

- even leading and trailing spaces are suspect: "  FOO  " doesn't match any
of the examples given, and it seems unlikely to be a title. Presumably the
strings have already been filtered or post-processed to have leading and
trailing spaces removed, in which case "  FOO  " reveals a bug.

 

-- 
Steven

[toc] | [prev] | [next] | [standalone]


#108192

FromStephen Hansen <me+python@ixokai.io>
Date2016-05-05 11:55 -0700
Message-ID<mailman.411.1462474513.32212.python-list@python.org>
In reply to#108183
On Thu, May 5, 2016, at 10:43 AM, Steven D'Aprano wrote:
> On Thu, 5 May 2016 11:32 pm, Stephen Hansen wrote:
> 
> > On Thu, May 5, 2016, at 12:36 AM, Steven D'Aprano wrote:
> >> Oh, a further thought...
> >> 
> >> On Thursday 05 May 2016 16:46, Stephen Hansen wrote:
> >> > I don't even care about faster: Its overly complicated. Sometimes a
> >> > regular expression really is the clearest way to solve a problem.
> >> 
> >> Putting non-ASCII letters aside for the moment, how would you match these
> >> specs as a regular expression?
> > 
> > I don't know, but mostly because I wouldn't even try. 
> 
> Really? Peter Otten seems to have found a solution, and Random832 almost
> found it too.
> 
> 
> > The requirements 
> > are over-specified. If you look at the OP's data (and based on previous
> > conversation), he's doing web scraping and trying to pull out good data.
> 
> I'm not talking about the OP's data. I'm talking about *my* requirements.
> 
> I thought that this was a friendly discussion about regexes, but perhaps
> I
> was mistaken. Because I sure am feeling a lot of hostility to the ideas
> that regexes are not necessarily the only way to solve this, and that
> data
> validation is a good thing.

Umm, what? Hostility? I have no idea where you're getting that.

I didn't say that regexs are the only way to solve problems; in fact
they're something I avoid using in most cases. In the OP's case, though,
I did say I thought was a natural fit. Usually, I'd go for
startswith/endswith, "in", slicing and such string primitives before I
go for a regular expression.

"Find all upper cased phrases that may have &'s in them" is something
just specific enough that the built in string primitives are awkward
tools.

In my experience, most of the problems with regexes is people think
they're the hammer and every problem is a nail: and then they get into
ever more convoluted expressions that become brittle.  More specific in
a regular expression is not, necessarily, a virtue. In fact its exactly
the opposite a lot of times.

> > There's no absolutely perfect way to do that because the system he's
> > scraping isn't meant for data processing. The data isn't cleanly
> > articulated.
> 
> Right. Which makes it *more*, not less, important to be sure that your
> regex
> doesn't match too much, because your data is likely to be contaminated by
> junk strings that don't belong in the data and shouldn't be accepted.
> I've
> done enough web scraping to realise just how easy it is to start grabbing
> data from the wrong part of the file.

I have nothing against data validation: I don't think it belongs in
regular expressions, though. That can be a step done afterwards.

> > Instead, he wants a heuristic to pull out what look like section titles.
> 
> Good for him. I asked a different question. Does my question not count?

Sure it counts, but I don't want to engage in your theoretical exercise.
That's not being hostile, that's me not wanting to think about a complex
set of constraints for a regular expression for purely intellectual
reasons.

> I was trying to teach DFS a generic programming technique, not solve his
> stupid web scraping problem for him. What happens next time when he's
> trying to filter a list of floats, or Widgets? Should he convert them to
> strings so he can use a regex to match them, or should he learn about
> general filtering techniques?

Come on. This is a bit presumptuous, don't you think?

> > This translates naturally into a simple regular expression: an uppercase
> > string with spaces and &'s. Now, that expression doesn't 100% encode
> > every detail of that rule-- it allows both Q&A and Q & A-- but on my own
> > looking at the data, I suspect its good enough. The titles are clearly
> > separate from the other data scraped by their being upper cased. We just
> > need to expand our allowed character range into spaces and &'s.
> > 
> > Nothing in the OP's request demands the kind of rigorous matching that
> > your scenario does. Its a practical problem with a simple, practical
> > answer.
> 
> Yes, and that practical answer needs to reject:
> 
> - the empty string, because it is easy to mistakenly get empty strings
> when
> scraping data, especially if you post-process the data;
> 
> - strings that are all spaces, because "       " cannot possibly be a
> title;
> 
> - strings that are all ampersands, because "&&&&&" is not a title, and it
> almost surely indicates that your scraping has gone wrong and you're
> reading junk from somewhere;
> 
> - even leading and trailing spaces are suspect: "  FOO  " doesn't match
> any
> of the examples given, and it seems unlikely to be a title. Presumably
> the
> strings have already been filtered or post-processed to have leading and
> trailing spaces removed, in which case "  FOO  " reveals a bug.

We're going to have to agree to disagree. I find all of that
unnecessary.  Any validation can be easily done before or after
matching, you don't need to over-complicate the regular expression
itself. The urge to find an ever more perfect regular expression that
manages to encapsulate what is precisely correct and what is not leads
itself to over-complicated expressions.

And regular expressions are ugly. I'd rather keep them simple and
straight-forward and deal with the rest in Python.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]


#108184

FromJussi Piitulainen <jussi.piitulainen@helsinki.fi>
Date2016-05-05 20:49 +0300
Message-ID<lf5y47os8d5.fsf@ling.helsinki.fi>
In reply to#108166
Steven D'Aprano writes:

> I get something like this:
>
> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>
>
> but it fails on strings like "AA   &  A &  A". What am I doing wrong?

It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS)
when the middle part is just one LETTER. That's something of a
misanalysis anyway. I notice that the correct pattern has already been
posted at least thrice and you have acknowledged one of them.

But I think you are also trying to do too much with a single regex. A
more promising start is to think of the whole string as "parts" joined
with "glue", then split with a glue pattern and test the parts:

import re
glue = re.compile(" *& *| +")
keep, drop = [], []
for datum in data:
    items = glue.split(datum)
    if all(map(str.isupper, items)):
        keep.append(datum)
    else:
        drop.append(datum)

That will cope with Greek, by the way.

It's annoying that the order of the branches of the glue pattern above
matters. One _does_ have problems when one uses the usual regex engines.

Capturing groups in the glue pattern would produce glue items in the
split output. Either avoid them or deal with them: one could split with
the underspecific "([ &]+)" and then check that each glue item contains
at most one ampersand. One could also allow other punctuation, and then
check afterwards.

One can use _another_ regex to test individual parts. Code above used
str.isupper to test a part. The improved regex package (from PyPI, to
cope with Greek) can do the same:

import regex
part = regex.compile("[[:upper:]]+")
glue = regex.compile(" *& *| *")

keep, drop = [], []
for datum in data:
    items = glue.split(datum)
    if all(map(part.fullmatch, items)):
        keep.append(datum)
    else:
        drop.append(datum)

Just "[A-Z]+" suffices for ASCII letters, and "[A-ZÄÖ]+" copes with most
of Finnish; the [:upper:] class is nicer and there's much more that is
nicer in the newer regex package.

The point of using a regex for this is that the part pattern can then be
generalized to allow some punctuation or digits in a part, for example.
Anything that the glue pattern doesn't consume. (Nothing wrong with
using other techniques for this, either; str.isupper worked nicely
above.)

It's also possible to swap the roles of the patterns. Split with a part
pattern. Then check that the text between such parts is glue:

keep, drop = [], []
for datum in data:
    items = part.split(datum)
    if all(map(glue.fullmatch, items)):
        keep.append(datum)
    else:
        drop.append(datum)

The point is to keep the patterns simple by making them more local, or
more relaxed, followed by a further test. This way they can be made to
do more, but not more than they reasonably can.

Note also the use of re.fullmatch instead of re.match (let alone
re.search) when a full match is required! This gets rid of all anchors
in the pattern, which may in turn allow fewer parentheses inside the
pattern.

The usual regex engines are not perfect, but parts of them are
fantastic.

[toc] | [prev] | [next] | [standalone]


#108187

FromSteven D'Aprano <steve@pearwood.info>
Date2016-05-06 04:14 +1000
Message-ID<572b8d6f$0$1602$c3e8da3$5496439d@news.astraweb.com>
In reply to#108184
On Fri, 6 May 2016 03:49 am, Jussi Piitulainen wrote:

> Steven D'Aprano writes:
> 
>> I get something like this:
>>
>> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>>
>>
>> but it fails on strings like "AA   &  A &  A". What am I doing wrong?
> 
> It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS)
> when the middle part is just one LETTER. That's something of a
> misanalysis anyway. I notice that the correct pattern has already been
> posted at least thrice and you have acknowledged one of them.

Thrice? I've seen Peter's response (he made the trivial and obvious
simplification of just using A instead of [A-Z], but that was easy to
understand), and Random832 almost got it, missing only that you need to
match the entire string, not just a substring. If there was a third
response, I missed it.


> But I think you are also trying to do too much with a single regex. A
> more promising start is to think of the whole string as "parts" joined
> with "glue", then split with a glue pattern and test the parts:
> 
> import re
> glue = re.compile(" *& *| +")
> keep, drop = [], []
> for datum in data:
>     items = glue.split(datum)
>     if all(map(str.isupper, items)):
>         keep.append(datum)
>     else:
>         drop.append(datum)

Ah, the penny drops! For a while I thought you were suggesting using this to
assemble a regex, and it just wasn't making sense to me. Then I realised
you were using this as a matcher: feed in the list of strings, and it
splits it into strings to keep and strings to discard. Nicely done, that is
a good technique to remember.

Thanks for the analysis!



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#108188

FromJussi Piitulainen <jussi.piitulainen@helsinki.fi>
Date2016-05-05 21:27 +0300
Message-ID<lf5twics6me.fsf@ling.helsinki.fi>
In reply to#108187
Steven D'Aprano writes:

> On Fri, 6 May 2016 03:49 am, Jussi Piitulainen wrote:
>
>> Steven D'Aprano writes:
>> 
>>> I get something like this:
>>>
>>> r"(^[A-Z]+$)|(^([A-Z]+[ ]*\&[ ]*[A-Z]+)+$)"
>>>
>>>
>>> but it fails on strings like "AA   &  A &  A". What am I doing wrong?
>> 
>> It cannot split the string as (LETTERS & LETTERS)(LETTERS & LETTERS)
>> when the middle part is just one LETTER. That's something of a
>> misanalysis anyway. I notice that the correct pattern has already been
>> posted at least thrice and you have acknowledged one of them.
>
> Thrice? I've seen Peter's response (he made the trivial and obvious
> simplification of just using A instead of [A-Z], but that was easy to
> understand), and Random832 almost got it, missing only that you need to
> match the entire string, not just a substring. If there was a third
> response, I missed it.

I think I saw another. I may be mistaken.

Random832's pattern is fine. You need to use re.fullmatch with it.

. .

[toc] | [prev] | [next] | [standalone]


#108191

FromRandom832 <random832@fastmail.com>
Date2016-05-05 14:54 -0400
Message-ID<mailman.410.1462474499.32212.python-list@python.org>
In reply to#108188
On Thu, May 5, 2016, at 14:27, Jussi Piitulainen wrote:
> Random832's pattern is fine. You need to use re.fullmatch with it.

Heh, in my previous post I said "and one could easily imagine an API
that implicitly anchors at the end". So easy to imagine it turns out
that someone already did, as it turns out. Batteries included indeed.

[toc] | [prev] | [next] | [standalone]


#108204

FromSteven D'Aprano <steve@pearwood.info>
Date2016-05-06 10:57 +1000
Message-ID<572bebf3$0$1585$c3e8da3$5496439d@news.astraweb.com>
In reply to#108188
On Fri, 6 May 2016 04:27 am, Jussi Piitulainen wrote:

> Random832's pattern is fine. You need to use re.fullmatch with it.

py> re.fullmatch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'fullmatch'



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#108216

FromJussi Piitulainen <jussi.piitulainen@helsinki.fi>
Date2016-05-06 07:19 +0300
Message-ID<lf5inyrn7ig.fsf@ling.helsinki.fi>
In reply to#108204
Steven D'Aprano writes:

> On Fri, 6 May 2016 04:27 am, Jussi Piitulainen wrote:
>
>> Random832's pattern is fine. You need to use re.fullmatch with it.
>
> py> re.fullmatch
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> AttributeError: 'module' object has no attribute 'fullmatch'

It's new in version 3.4 (of Python).

[toc] | [prev] | [next] | [standalone]


#108172

FromDFS <nospam@dfs.com>
Date2016-05-05 08:31 -0400
Message-ID<ngfe7i$2pp$2@dont-email.me>
In reply to#108162
On 5/5/2016 2:04 AM, Steven D'Aprano wrote:
> On Thursday 05 May 2016 14:58, DFS wrote:
>
>> Want to whittle a list like this:
> [...]
>> Want to keep all elements containing only upper case letters or upper
>> case letters and ampersand (where ampersand is surrounded by spaces)
>
>
> Start by writing a function or a regex that will distinguish strings that
> match your conditions from those that don't. A regex might be faster, but
> here's a function version.
>
> def isupperalpha(string):
>     return string.isalpha() and string.isupper()
>
> def check(string):
>     if isupperalpha(string):
>         return True
>     parts = string.split("&")
>     if len(parts) < 2:
>         return False
>     # Don't strip leading spaces from the start of the string.
>     parts[0] = parts[0].rstrip(" ")
>     # Or trailing spaces from the end of the string.
>     parts[-1] = parts[-1].lstrip(" ")
>     # But strip leading and trailing spaces from the middle parts
>     # (if any).
>     for i in range(1, len(parts)-1):
>         parts[i] = parts[i].strip(" ")
>      return all(isupperalpha(part) for part in parts)
>
>
> Now you have two ways of filtering this. The obvious way is to extract
> elements which meet the condition. Here are two ways:
>
> # List comprehension.
> newlist = [item for item in oldlist if check(item)]
>
> # Filter, Python 2 version
> newlist = filter(check, oldlist)
>
> # Filter, Python 3 version
> newlist = list(filter(check, oldlist))
>
>
> In practice, this is the best (fastest, simplest) way. But if you fear that
> you will run out of memory dealing with absolutely humongous lists with
> hundreds of millions or billions of strings, you can remove items in place:
>
>
> def remove(func, alist):
>     for i in range(len(alist)-1, -1, -1):
>         if not func(alist[i]):
>             del alist[i]
>
>
> Note the magic incantation to iterate from the end of the list towards the
> front. If you do it the other way, Bad Things happen. Note that this will
> use less memory than extracting the items, but it will be much slower.
>
> You can combine the best of both words. Here is a version that uses a
> temporary list to modify the original in place:
>
> # works in both Python 2 and 3
> def remove(func, alist):
>     # Modify list in place, the fast way.
>     alist[:] = filter(check, alist)


You are out of your mind.




[toc] | [prev] | [next] | [standalone]


#108185

FromSteven D'Aprano <steve@pearwood.info>
Date2016-05-06 03:54 +1000
Message-ID<572b88c5$0$1601$c3e8da3$5496439d@news.astraweb.com>
In reply to#108172
On Thu, 5 May 2016 10:31 pm, DFS wrote:

> You are out of your mind.

That's twice you've tried to put me down, first by dismissing my comments
about text processing with "Linguist much", and now an outright insult. The
first time I laughed it off and made a joke about it. I won't do that
again.

You asked whether it was better to extract the matching strings into a new
list, or remove them in place in the existing list. I not only showed you
how to do both, but I tried to give you the mental tools to understand when
you should pick one answer over the other. And your response is to insult
me and question my sanity.

Well, DFS, I might be crazy, but I'm not stupid. If that's really how you
feel about my answers, I won't make the mistake of wasting my time
answering your questions in the future.

Over to you now.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#108200

FromDFS <nospam@dfs.com>
Date2016-05-05 17:36 -0400
Message-ID<ngge5q$142$1@dont-email.me>
In reply to#108185
On 5/5/2016 1:54 PM, Steven D'Aprano wrote:
> On Thu, 5 May 2016 10:31 pm, DFS wrote:
>
>> You are out of your mind.
>
> That's twice you've tried to put me down, first by dismissing my comments
> about text processing with "Linguist much", and now an outright insult. The
> first time I laughed it off and made a joke about it. I won't do that
> again.
 >
> You asked whether it was better to extract the matching strings into a new
> list, or remove them in place in the existing list. I not only showed you
> how to do both, but I tried to give you the mental tools to understand when
> you should pick one answer over the other. And your response is to insult
> me and question my sanity.
>
> Well, DFS, I might be crazy, but I'm not stupid. If that's really how you
> feel about my answers, I won't make the mistake of wasting my time
> answering your questions in the future.
>
> Over to you now.


heh!  Relax, pal.

I was just trying to be funny - no insult intended either time, of 
course.  Look for similar responses from me in the future.  Usenet 
brings out the smart-aleck in me.

Actually, you should've accepted the 'Linguist much?' as a compliment, 
because I seriously thought you were.

But you ARE out of your mind if you prefer that convoluted "function" 
method over a simple 1-line regex method (as per S. Hansen).

def isupperalpha(string):
     return string.isalpha() and string.isupper()

def check(string):
     if isupperalpha(string):
         return True
     parts = string.split("&")
     if len(parts) < 2:
         return False
     parts[0] = parts[0].rstrip(" ")
     parts[-1] = parts[-1].lstrip(" ")
     for i in range(1, len(parts)-1):
         parts[i] = parts[i].strip(" ")
      return all(isupperalpha(part) for part in parts)


I'm sure it does the job well, but that style brings back [bad] memories 
of the VBA I used to write.  I expected something very concise and 
'pythonic' (which I'm learning is everyone's favorite mantra here in 
python-land).

Anyway, I appreciate ALL replies to my queries.  So thank you for taking 
the time.

Whenever I'm able, I'll try to contribute to clp as well.



[toc] | [prev] | [next] | [standalone]


#108193

FromStephen Hansen <me+python@ixokai.io>
Date2016-05-05 11:56 -0700
Message-ID<mailman.412.1462474615.32212.python-list@python.org>
In reply to#108172
On Thu, May 5, 2016, at 05:31 AM, DFS wrote:
> You are out of your mind.

Whoa, now. I might disagree with Steven D'Aprano about how to approach
this problem, but there's no need to be rude. Everyone's trying to help
you, after all.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]


Page 2 of 3 — ← Prev page 1 [2] 3  Next page →

Back to top | Article view | comp.lang.python


csiph-web