Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #100720 > unrolled thread
| Started by | Thierry <no@mail.com> |
|---|---|
| First post | 2015-12-22 11:56 +0100 |
| Last post | 2015-12-22 18:18 +0100 |
| Articles | 4 — 4 participants |
Back to article view | Back to comp.lang.python
match point Thierry <no@mail.com> - 2015-12-22 11:56 +0100
Re: match point Chris Angelico <rosuav@gmail.com> - 2015-12-22 22:07 +1100
Re: match point Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2015-12-22 13:01 +0100
Re: match point Thierry Closen <no@mail.com> - 2015-12-22 18:18 +0100
| From | Thierry <no@mail.com> |
|---|---|
| Date | 2015-12-22 11:56 +0100 |
| Subject | match point |
| Message-ID | <20151222115648.1222c992@eeearch> |
Hi, Reading the docs about regular expressions, I am under the impression that calling re.match(pattern, string) is exactly the same as re.search(r'\A'+pattern, string) Same for fullmatch, that amounts to re.search(r'\A'+pattern+r'\Z', string) The docs devote a chapter to "6.2.5.3. search() vs. match()", but they only discuss how match() is different from search() with '^', completely eluding the case of search() with r'\A'. At first I thought those functions could have been introduced at a time when r'\A' and r'\Z' did not exist, but then I noticed that re.fullmatch is a recent addition (python 3.4) Surely the python devs are not cluttering the interface of the re module with useless functions for no reason, so what am I missing? Maybe re.match has an implementation that makes it more efficient? But then why would I ever use r'\A', since that anchor makes a pattern match in only a single position, and is therefore useless in functions like re.findall, re.finditer or re.split? Thanks, Thierry
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-12-22 22:07 +1100 |
| Message-ID | <mailman.55.1450782472.2237.python-list@python.org> |
| In reply to | #100720 |
On Tue, Dec 22, 2015 at 9:56 PM, Thierry <no@mail.com> wrote: > Maybe re.match has an implementation that makes it more efficient? But > then why would I ever use r'\A', since that anchor makes a pattern match > in only a single position, and is therefore useless in functions like > re.findall, re.finditer or re.split? Much of the value of regular expressions is that they are NOT string literals (just strings). Effectively, someone who has no authority to change the code of the program can cause it to change from re.search to re.match, simply by putting \A at the beginning of the search string. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Thomas 'PointedEars' Lahn <PointedEars@web.de> |
|---|---|
| Date | 2015-12-22 13:01 +0100 |
| Message-ID | <5716668.JdziER6HFD@PointedEars.de> |
| In reply to | #100720 |
Thierry wrote: > Reading the docs about regular expressions, I am under the impression > that calling > re.match(pattern, string) > is exactly the same as > re.search(r'\A'+pattern, string) Correct. > Same for fullmatch, that amounts to > re.search(r'\A'+pattern+r'\Z', string) Correct. > The docs devote a chapter to "6.2.5.3. search() vs. match()", but they > only discuss how match() is different from search() with '^', completely > eluding the case of search() with r'\A'. > > At first I thought those functions could have been introduced at a time > when r'\A' and r'\Z' did not exist, but then I noticed that re.fullmatch > is a recent addition (python 3.4) > > Maybe re.match has an implementation that makes it more efficient? But > then why would I ever use r'\A', since that anchor makes a pattern match > in only a single position, and is therefore useless in functions like > re.findall, re.finditer or re.split? (Thank you for pointing out “\A” and “\Z”; this strongly suggests that even in raw mode you should always match literal “\” with the regular expression “\\”, or IOW that you should always use re.escape() when constructing regular expressions from arbitrary strings for matching WinDOS/UNC paths, for example.) If you would use re.search(r'\Afoo.*^bar$.*baz\Z', string, flags=re.DOTALL | re.MULTILINE) you could match only strings that start with “foo”, have a line following that which contains only “bar”, and end with “baz”. (In multi-line mode, the meaning of “^” and “$” change to start-of-line and end-of-line, respectively.) Presumably, re.fullmatch() was introduced in Python 3.4 so that you can write re.fullmatch(r'foo.*^bar$.*baz', string, flags=re.DOTALL | re.MULTILINE) instead, since you are not actually searching, and would make sure that you *always* want to match against the whole string, regardless of the expression. | Note that even in MULTILINE mode, re.match() will only match at the | beginning of the string and not at the beginning of each line. and that | re.search(pattern, string, flags=0) | Scan through string looking for the first location where the regular | expression pattern produces a match […] So with both re.search() and re.fullmatch(), you are more flexible should the expression be dynamically constructed: you can always use re.search(). <https://docs.python.org/3/library/re.html#re.search> Please add your last name, Thierry #1701. -- PointedEars Twitter: @PointedEars2 Please do not cc me. / Bitte keine Kopien per E-Mail.
[toc] | [prev] | [next] | [standalone]
| From | Thierry Closen <no@mail.com> |
|---|---|
| Date | 2015-12-22 18:18 +0100 |
| Message-ID | <20151222181825.67b87012@eeearch> |
| In reply to | #100720 |
I found the story behind the creation of re.fullmatch().
I had no luck before because I was searching under "www.python.org/dev",
while in reality it sprang out of a bug report:
https://bugs.python.org/issue16203
In summary, there were repeated bugs where during maintenance of code
the $ symbol disappeared from patterns, hence the decision to create a
function that anchors the pattern to the end of the string independently
of the presence of that symbol.
I am perplexed by what I discovered, as I would never have thought that
such prominent functions can be created to scratch such a minor itch:
The creation of fullmatch() might address this very specific issue, but
I would tend to think that if really certain symbols disappear from
patterns inside a code base, this should be seen as the sign of more
profound problems in the code maintenance processes.
Anyway, the discussion around that bug inspired me another argument that
is more satisfying:
When I was saying that
re.fullmatch(pattern, string)
is exactly the same as
re.search(r'\A'+pattern+r'\Z', string)
I was wrong.
For example if pattern starts with an inline flag like (?i), we cannot
simply stick \A in front of it.
Other example, consider pattern is 'a|b'. We end up with:
re.search(r'\Aa|b\Z', string)
which is not what we want.
To avoid that problem we need to add parentheses:
re.search(r'\A('+pattern+r')\Z', string)
But now we created a group, and if the pattern already contained groups
and backreferences we may just have broken it.
So we need to use a non-capturing group:
re.search(r'\A(?:'+pattern+r')\Z', string)
...and now I think we can say we are at a level of complexity where we
cannot reasonably expect the average user to always remember to write
exactly this, so it makes sense to add an easy-to-use fullmatch function
to the re namespace.
It may not be the real historical reason behind re.fullmatch, but
personally I will stick with that one :)
Cheers,
Thierry
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web