Groups > comp.lang.python > #100720 > unrolled thread

match point

Started by	Thierry <no@mail.com>
First post	2015-12-22 11:56 +0100
Last post	2015-12-22 18:18 +0100
Articles	4 — 4 participants

Back to article view | Back to comp.lang.python

  match point Thierry <no@mail.com> - 2015-12-22 11:56 +0100
    Re: match point Chris Angelico <rosuav@gmail.com> - 2015-12-22 22:07 +1100
    Re: match point Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-12-22 13:01 +0100
    Re: match point Thierry Closen <no@mail.com> - 2015-12-22 18:18 +0100

#100720 — match point

From	Thierry <no@mail.com>
Date	2015-12-22 11:56 +0100
Subject	match point
Message-ID	<20151222115648.1222c992@eeearch>

Hi,

Reading the docs about regular expressions, I am under the impression
that calling
	re.match(pattern, string)
is exactly the same as
	re.search(r'\A'+pattern, string)

Same for fullmatch, that amounts to
	re.search(r'\A'+pattern+r'\Z', string)

The docs devote a chapter to "6.2.5.3. search() vs. match()", but they
only discuss how match() is different from search() with '^', completely
eluding the case of search() with r'\A'.

At first I thought those functions could have been introduced at a time
when r'\A' and r'\Z' did not exist, but then I noticed that re.fullmatch
is a recent addition (python 3.4)

Surely the python devs are not cluttering the interface of the re module
with useless functions for no reason, so what am I missing?

Maybe re.match has an implementation that makes it more efficient? But
then why would I ever use r'\A', since that anchor makes a pattern match
in only a single position, and is therefore useless in functions like
re.findall, re.finditer or re.split?

Thanks,

Thierry

[toc] | [next] | [standalone]

#100721

From	Chris Angelico <rosuav@gmail.com>
Date	2015-12-22 22:07 +1100
Message-ID	<mailman.55.1450782472.2237.python-list@python.org>
In reply to	#100720

On Tue, Dec 22, 2015 at 9:56 PM, Thierry <no@mail.com> wrote:
> Maybe re.match has an implementation that makes it more efficient? But
> then why would I ever use r'\A', since that anchor makes a pattern match
> in only a single position, and is therefore useless in functions like
> re.findall, re.finditer or re.split?

Much of the value of regular expressions is that they are NOT string
literals (just strings). Effectively, someone who has no authority to
change the code of the program can cause it to change from re.search
to re.match, simply by putting \A at the beginning of the search
string.

ChrisA

[toc] | [prev] | [next] | [standalone]

#100723

From	Thomas 'PointedEars' Lahn <PointedEars@web.de>
Date	2015-12-22 13:01 +0100
Message-ID	<5716668.JdziER6HFD@PointedEars.de>
In reply to	#100720

Thierry wrote:

> Reading the docs about regular expressions, I am under the impression
> that calling
> re.match(pattern, string)
> is exactly the same as
> re.search(r'\A'+pattern, string)

Correct.

> Same for fullmatch, that amounts to
> re.search(r'\A'+pattern+r'\Z', string)

Correct.

> The docs devote a chapter to "6.2.5.3. search() vs. match()", but they
> only discuss how match() is different from search() with '^', completely
> eluding the case of search() with r'\A'.
> 
> At first I thought those functions could have been introduced at a time
> when r'\A' and r'\Z' did not exist, but then I noticed that re.fullmatch
> is a recent addition (python 3.4)
> 
> Maybe re.match has an implementation that makes it more efficient? But
> then why would I ever use r'\A', since that anchor makes a pattern match
> in only a single position, and is therefore useless in functions like
> re.findall, re.finditer or re.split?

(Thank you for pointing out “\A” and “\Z”; this strongly suggests that even 
in raw mode you should always match literal “\” with the regular expression 
“\\”, or IOW that you should always use re.escape() when constructing 
regular expressions from arbitrary strings for matching WinDOS/UNC paths, 
for example.)

If you would use

  re.search(r'\Afoo.*^bar$.*baz\Z', string, flags=re.DOTALL | re.MULTILINE)

you could match only strings that start with “foo”, have a line following 
that which contains only “bar”, and end with “baz”.  (In multi-line mode, 
the meaning of “^” and “$” change to start-of-line and end-of-line, 
respectively.)

Presumably, re.fullmatch() was introduced in Python 3.4 so that you can 
write

  re.fullmatch(r'foo.*^bar$.*baz', string, flags=re.DOTALL | re.MULTILINE)

instead, since you are not actually searching, and would make sure that you 
*always* want to match against the whole string, regardless of the 
expression.

| Note that even in MULTILINE mode, re.match() will only match at the 
| beginning of the string and not at the beginning of each line.

and that

| re.search(pattern, string, flags=0)
|   Scan through string looking for the first location where the regular 
|   expression pattern produces a match […]

So with both re.search() and re.fullmatch(), you are more flexible should 
the expression be dynamically constructed: you can always use re.search().

<https://docs.python.org/3/library/re.html#re.search>

Please add your last name, Thierry #1701.

-- 
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

[toc] | [prev] | [next] | [standalone]

#100738

From	Thierry Closen <no@mail.com>
Date	2015-12-22 18:18 +0100
Message-ID	<20151222181825.67b87012@eeearch>
In reply to	#100720

I found the story behind the creation of re.fullmatch(). 

I had no luck before because I was searching under "www.python.org/dev",
while in reality it sprang out of a bug report:
https://bugs.python.org/issue16203

In summary, there were repeated bugs where during maintenance of code
the $ symbol disappeared from patterns, hence the decision to create a
function that anchors the pattern to the end of the string independently
of the presence of that symbol.

I am perplexed by what I discovered, as I would never have thought that
such prominent functions can be created to scratch such a minor itch:
The creation of fullmatch() might address this very specific issue, but 
I would tend to think that if really certain symbols disappear from
patterns inside a code base, this should be seen as the sign of more
profound problems in the code maintenance processes.

Anyway, the discussion around that bug inspired me another argument that
is more satisfying:

When I was saying that
        re.fullmatch(pattern, string)
is exactly the same as
        re.search(r'\A'+pattern+r'\Z', string)
I was wrong.

For example if pattern starts with an inline flag like (?i), we cannot
simply stick \A in front of it.

Other example, consider pattern is 'a|b'. We end up with:
        re.search(r'\Aa|b\Z', string)
which is not what we want.

To avoid that problem we need to add parentheses:
        re.search(r'\A('+pattern+r')\Z', string)
But now we created a group, and if the pattern already contained groups
and backreferences we may just have broken it.

So we need to use a non-capturing group:
        re.search(r'\A(?:'+pattern+r')\Z', string)
...and now I think we can say we are at a level of complexity where we
cannot reasonably expect the average user to always remember to write
exactly this, so it makes sense to add an easy-to-use fullmatch function
to the re namespace.

It may not be the real historical reason behind re.fullmatch, but
personally I will stick with that one :)

Cheers,

Thierry

[toc] | [prev] | [standalone]

csiph-web

match point

Contents

#100720 — match point

#100721

#100723

#100738