Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #17362 > unrolled thread
| Started by | John Gordon <gordon@panix.com> |
|---|---|
| First post | 2011-12-16 16:49 +0000 |
| Last post | 2011-12-19 19:58 -0700 |
| Articles | 15 — 8 participants |
Back to article view | Back to comp.lang.python
re.sub(): replace longest match instead of leftmost match? John Gordon <gordon@panix.com> - 2011-12-16 16:49 +0000
Re: re.sub(): replace longest match instead of leftmost match? Devin Jeanpierre <jeanpierreda@gmail.com> - 2011-12-16 11:56 -0500
Re: re.sub(): replace longest match instead of leftmost match? John Gordon <gordon@panix.com> - 2011-12-16 21:04 +0000
Re: re.sub(): replace longest match instead of leftmost match? MRAB <python@mrabarnett.plus.com> - 2011-12-16 21:36 +0000
Re: re.sub(): replace longest match instead of leftmost match? Duncan Booth <duncan.booth@invalid.invalid> - 2011-12-19 15:46 +0000
Re: re.sub(): replace longest match instead of leftmost match? MRAB <python@mrabarnett.plus.com> - 2011-12-16 17:36 +0000
Re: re.sub(): replace longest match instead of leftmost match? Ian Kelly <ian.g.kelly@gmail.com> - 2011-12-16 10:57 -0700
Re: re.sub(): replace longest match instead of leftmost match? Ian Kelly <ian.g.kelly@gmail.com> - 2011-12-16 10:59 -0700
Re: re.sub(): replace longest match instead of leftmost match? John Gordon <gordon@panix.com> - 2011-12-16 21:06 +0000
Re: re.sub(): replace longest match instead of leftmost match? MRAB <python@mrabarnett.plus.com> - 2011-12-16 18:19 +0000
Re: re.sub(): replace longest match instead of leftmost match? Roy Smith <roy@panix.com> - 2011-12-16 13:36 -0500
Re: re.sub(): replace longest match instead of leftmost match? John Gordon <gordon@panix.com> - 2011-12-16 21:07 +0000
Re: re.sub(): replace longest match instead of leftmost match? Terry Reedy <tjreedy@udel.edu> - 2011-12-16 17:26 -0500
Re: re.sub(): replace longest match instead of leftmost match? ting@thsu.org - 2011-12-19 15:15 -0800
Re: re.sub(): replace longest match instead of leftmost match? Ian Kelly <ian.g.kelly@gmail.com> - 2011-12-19 19:58 -0700
| From | John Gordon <gordon@panix.com> |
|---|---|
| Date | 2011-12-16 16:49 +0000 |
| Subject | re.sub(): replace longest match instead of leftmost match? |
| Message-ID | <jcfsrk$skh$1@reader1.panix.com> |
According to the documentation on re.sub(), it replaces the leftmost
matching pattern.
However, I want to replace the *longest* matching pattern, which is
not necessarily the leftmost match. Any suggestions?
I'm working with IPv6 CIDR strings, and I want to replace the longest
match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces
the leftmost match, even if there is a longer match later in the string.
I'm also looking for a regexp that will remove leading zeroes in each
four-digit group, but will leave a single zero if the group was all
zeroes.
Thanks!
--
John Gordon A is for Amy, who fell down the stairs
gordon@panix.com B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"
[toc] | [next] | [standalone]
| From | Devin Jeanpierre <jeanpierreda@gmail.com> |
|---|---|
| Date | 2011-12-16 11:56 -0500 |
| Message-ID | <mailman.3737.1324054637.27778.python-list@python.org> |
| In reply to | #17362 |
You could use re.finditer to find the longest match, and then replace it manually by hand (via string slicing). (a match is the longest if (m.end() - m.start()) is the largest -- so, max(re.finditer(...), key=lambda m: (m.end() = m.start())) -- Devin P.S. does anyone else get bothered by how it's slice.start and slice.stop, but match.start() and match.end() ? On Fri, Dec 16, 2011 at 11:49 AM, John Gordon <gordon@panix.com> wrote: > According to the documentation on re.sub(), it replaces the leftmost > matching pattern. > > However, I want to replace the *longest* matching pattern, which is > not necessarily the leftmost match. Any suggestions? > > I'm working with IPv6 CIDR strings, and I want to replace the longest > match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces > the leftmost match, even if there is a longer match later in the string. > > I'm also looking for a regexp that will remove leading zeroes in each > four-digit group, but will leave a single zero if the group was all > zeroes. > > Thanks! > > -- > John Gordon A is for Amy, who fell down the stairs > gordon@panix.com B is for Basil, assaulted by bears > -- Edward Gorey, "The Gashlycrumb Tinies" > > -- > http://mail.python.org/mailman/listinfo/python-list
[toc] | [prev] | [next] | [standalone]
| From | John Gordon <gordon@panix.com> |
|---|---|
| Date | 2011-12-16 21:04 +0000 |
| Message-ID | <jcgbph$nnj$1@reader1.panix.com> |
| In reply to | #17363 |
In <mailman.3737.1324054637.27778.python-list@python.org> Devin Jeanpierre <jeanpierreda@gmail.com> writes:
> You could use re.finditer to find the longest match, and then replace
> it manually by hand (via string slicing).
> (a match is the longest if (m.end() - m.start()) is the largest --
> so, max(re.finditer(...), key=3Dlambda m: (m.end() =3D m.start()))
I ended up doing something similar:
# find the longest match
longest_match = ''
for word in re.findall('((0000:?)+)', ip6):
if len(word[0]) > len(longest_match):
longest_match = word[0]
# if we found a match, replace it with a colon
if longest_match:
ip6 = re.sub(longest_match, ':', ip6, 1)
Thanks!
--
John Gordon A is for Amy, who fell down the stairs
gordon@panix.com B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2011-12-16 21:36 +0000 |
| Message-ID | <mailman.3752.1324071372.27778.python-list@python.org> |
| In reply to | #17384 |
On 16/12/2011 21:04, John Gordon wrote:
> In<mailman.3737.1324054637.27778.python-list@python.org> Devin Jeanpierre<jeanpierreda@gmail.com> writes:
>
>> You could use re.finditer to find the longest match, and then replace
>> it manually by hand (via string slicing).
>
>> (a match is the longest if (m.end() - m.start()) is the largest --
>> so, max(re.finditer(...), key=3Dlambda m: (m.end() =3D m.start()))
>
> I ended up doing something similar:
>
> # find the longest match
> longest_match = ''
> for word in re.findall('((0000:?)+)', ip6):
> if len(word[0])> len(longest_match):
> longest_match = word[0]
>
> # if we found a match, replace it with a colon
> if longest_match:
> ip6 = re.sub(longest_match, ':', ip6, 1)
>
For a simple replace, using re is probably overkill. The .replace
method is a better solution:
ip6 = longest_match.replace(ip6, ':', 1)
[toc] | [prev] | [next] | [standalone]
| From | Duncan Booth <duncan.booth@invalid.invalid> |
|---|---|
| Date | 2011-12-19 15:46 +0000 |
| Message-ID | <Xns9FC0A066BC25Aduncanbooth@127.0.0.1> |
| In reply to | #17387 |
MRAB <python@mrabarnett.plus.com> wrote:
> On 16/12/2011 21:04, John Gordon wrote:
>> In<mailman.3737.1324054637.27778.python-list@python.org> Devin
>> Jeanpierre<jeanpierreda@gmail.com> writes:
>>
>>> You could use re.finditer to find the longest match, and then
>>> replace it manually by hand (via string slicing).
>>
>>> (a match is the longest if (m.end() - m.start()) is the largest --
>>> so, max(re.finditer(...), key=3Dlambda m: (m.end() =3D m.start()))
>>
>> I ended up doing something similar:
>>
>> # find the longest match
>> longest_match = ''
>> for word in re.findall('((0000:?)+)', ip6):
>> if len(word[0])> len(longest_match):
>> longest_match = word[0]
>>
>> # if we found a match, replace it with a colon
>> if longest_match:
>> ip6 = re.sub(longest_match, ':', ip6, 1)
>>
> For a simple replace, using re is probably overkill. The .replace
> method is a better solution:
>
> ip6 = longest_match.replace(ip6, ':', 1)
I think you got longest_match/ip6 backwards there.
Anyway, for those who like brevity:
try:
ip6 = ip6.replace(max(re.findall('((?:0000:?)+)', ip6), key=len), ':', 1)
except ValueError: pass
--
Duncan Booth http://kupuguy.blogspot.com
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2011-12-16 17:36 +0000 |
| Message-ID | <mailman.3738.1324056991.27778.python-list@python.org> |
| In reply to | #17362 |
On 16/12/2011 16:49, John Gordon wrote: > According to the documentation on re.sub(), it replaces the leftmost > matching pattern. > > However, I want to replace the *longest* matching pattern, which is > not necessarily the leftmost match. Any suggestions? > > I'm working with IPv6 CIDR strings, and I want to replace the longest > match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces > the leftmost match, even if there is a longer match later in the string. > > I'm also looking for a regexp that will remove leading zeroes in each > four-digit group, but will leave a single zero if the group was all > zeroes. > How about this: result = re.sub(r"\b0+(\d)\b", r"\1", string)
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2011-12-16 10:57 -0700 |
| Message-ID | <mailman.3740.1324058258.27778.python-list@python.org> |
| In reply to | #17362 |
On Fri, Dec 16, 2011 at 10:36 AM, MRAB <python@mrabarnett.plus.com> wrote: > On 16/12/2011 16:49, John Gordon wrote: >> >> According to the documentation on re.sub(), it replaces the leftmost >> matching pattern. >> >> However, I want to replace the *longest* matching pattern, which is >> not necessarily the leftmost match. Any suggestions? >> >> I'm working with IPv6 CIDR strings, and I want to replace the longest >> match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces >> the leftmost match, even if there is a longer match later in the string. >> >> I'm also looking for a regexp that will remove leading zeroes in each >> four-digit group, but will leave a single zero if the group was all >> zeroes. >> > How about this: > > result = re.sub(r"\b0+(\d)\b", r"\1", string) Close. pattern = r'\b0+([1-9a-f]+|0)\b' re.sub(pattern, r'\1', string, flags=re.IGNORECASE) Cheers, Ian
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2011-12-16 10:59 -0700 |
| Message-ID | <mailman.3742.1324058429.27778.python-list@python.org> |
| In reply to | #17362 |
On Fri, Dec 16, 2011 at 10:57 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Fri, Dec 16, 2011 at 10:36 AM, MRAB <python@mrabarnett.plus.com> wrote:
>> On 16/12/2011 16:49, John Gordon wrote:
>>>
>>> According to the documentation on re.sub(), it replaces the leftmost
>>> matching pattern.
>>>
>>> However, I want to replace the *longest* matching pattern, which is
>>> not necessarily the leftmost match. Any suggestions?
>>>
>>> I'm working with IPv6 CIDR strings, and I want to replace the longest
>>> match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces
>>> the leftmost match, even if there is a longer match later in the string.
>>>
>>> I'm also looking for a regexp that will remove leading zeroes in each
>>> four-digit group, but will leave a single zero if the group was all
>>> zeroes.
>>>
>> How about this:
>>
>> result = re.sub(r"\b0+(\d)\b", r"\1", string)
>
> Close.
>
> pattern = r'\b0+([1-9a-f]+|0)\b'
> re.sub(pattern, r'\1', string, flags=re.IGNORECASE)
Doh, that's still not quite right.
pattern = r'\b0{1,3}([1-9a-f][0-9a-f]*|0)\b'
re.sub(pattern, r'\1', string, flags=re.IGNORECASE)
[toc] | [prev] | [next] | [standalone]
| From | John Gordon <gordon@panix.com> |
|---|---|
| Date | 2011-12-16 21:06 +0000 |
| Message-ID | <jcgbru$nnj$2@reader1.panix.com> |
| In reply to | #17369 |
In <mailman.3742.1324058429.27778.python-list@python.org> Ian Kelly <ian.g.kelly@gmail.com> writes:
> >>> I'm also looking for a regexp that will remove leading zeroes in each
> >>> four-digit group, but will leave a single zero if the group was all
> >>> zeroes.
> pattern = r'\b0{1,3}([1-9a-f][0-9a-f]*|0)\b'
> re.sub(pattern, r'\1', string, flags=re.IGNORECASE)
Perfect. Thanks Ian!
--
John Gordon A is for Amy, who fell down the stairs
gordon@panix.com B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2011-12-16 18:19 +0000 |
| Message-ID | <mailman.3743.1324059573.27778.python-list@python.org> |
| In reply to | #17362 |
On 16/12/2011 17:57, Ian Kelly wrote: > On Fri, Dec 16, 2011 at 10:36 AM, MRAB<python@mrabarnett.plus.com> wrote: >> On 16/12/2011 16:49, John Gordon wrote: >>> >>> According to the documentation on re.sub(), it replaces the leftmost >>> matching pattern. >>> >>> However, I want to replace the *longest* matching pattern, which is >>> not necessarily the leftmost match. Any suggestions? >>> >>> I'm working with IPv6 CIDR strings, and I want to replace the longest >>> match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces >>> the leftmost match, even if there is a longer match later in the string. >>> >>> I'm also looking for a regexp that will remove leading zeroes in each >>> four-digit group, but will leave a single zero if the group was all >>> zeroes. >>> >> How about this: >> >> result = re.sub(r"\b0+(\d)\b", r"\1", string) > > Close. > > pattern = r'\b0+([1-9a-f]+|0)\b' > re.sub(pattern, r'\1', string, flags=re.IGNORECASE) > Ah, OK. The OP said "digit" instead of "hex digit". That's my excuse. :-)
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2011-12-16 13:36 -0500 |
| Message-ID | <roy-7C4E8A.13361716122011@news.panix.com> |
| In reply to | #17362 |
In article <jcfsrk$skh$1@reader1.panix.com>, John Gordon <gordon@panix.com> wrote: > I'm working with IPv6 CIDR strings, and I want to replace the longest > match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces > the leftmost match, even if there is a longer match later in the string. > > I'm also looking for a regexp that will remove leading zeroes in each > four-digit group, but will leave a single zero if the group was all > zeroes. Having done quite a bit of IPv6 work, my opinion here is that you're trying to do The Wrong Thing. What you want is an IPv6 class which represents an address in some canonical form. It would have constructors which accept any of the RFC-2373 defined formats. It would also have string formatting methods to convert the internal form into any of these formats. Then, instead of attempting to regex your way directly from one string representation to another, you would do something like: addr_string = "FEDC:BA98:7654:3210:FEDC:BA98:7654:321" print IPv6(addr_string).to_short_form()
[toc] | [prev] | [next] | [standalone]
| From | John Gordon <gordon@panix.com> |
|---|---|
| Date | 2011-12-16 21:07 +0000 |
| Message-ID | <jcgbui$nnj$3@reader1.panix.com> |
| In reply to | #17374 |
In <roy-7C4E8A.13361716122011@news.panix.com> Roy Smith <roy@panix.com> writes:
> Having done quite a bit of IPv6 work, my opinion here is that you're
> trying to do The Wrong Thing.
> What you want is an IPv6 class which represents an address in some
> canonical form. It would have constructors which accept any of the
> RFC-2373 defined formats. It would also have string formatting methods
> to convert the internal form into any of these formats.
> Then, instead of attempting to regex your way directly from one string
> representation to another, you would do something like:
> addr_string = "FEDC:BA98:7654:3210:FEDC:BA98:7654:321"
> print IPv6(addr_string).to_short_form()
This does sound like a more robust solution. I'll give it some thought.
Thanks Roy!
--
John Gordon A is for Amy, who fell down the stairs
gordon@panix.com B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2011-12-16 17:26 -0500 |
| Message-ID | <mailman.3756.1324074610.27778.python-list@python.org> |
| In reply to | #17374 |
On 12/16/2011 1:36 PM, Roy Smith wrote: > What you want is an IPv6 class which represents an address in some > canonical form. It would have constructors which accept any of the > RFC-2373 defined formats. It would also have string formatting methods > to convert the internal form into any of these formats. > > Then, instead of attempting to regex your way directly from one string > representation to another, you would do something like: > > addr_string = "FEDC:BA98:7654:3210:FEDC:BA98:7654:321" > print IPv6(addr_string).to_short_form() There are at least 2 third-party IP classes in use. I would not be surprised if at least one of them does this. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | ting@thsu.org |
|---|---|
| Date | 2011-12-19 15:15 -0800 |
| Message-ID | <b0409b9d-ef6b-4349-936b-5de719235d6f@m10g2000vbc.googlegroups.com> |
| In reply to | #17362 |
On Dec 16, 11:49 am, John Gordon <gor...@panix.com> wrote: > I'm working with IPv6 CIDR strings, and I want to replace the longest > match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces > the leftmost match, even if there is a longer match later in the string. Typically this means that your regular expression is not specific enough. That is, if you get multiple matches, and you need to sort through those matches before performing a replace, it usually means that you should rewrite your expression to get a single match. Invariably this happens when you try to take short cuts. I can't blame you for using a short cut, as sometimes short cuts just work, but once you find that your short cut fails, you need to step back and rethink the problem, rather than try to hack your short cut. I don't know what you are doing, but off the top of my head, I'd check to see if the CIDR string is wrapped in a header message and include the header as part of the search pattern, or if you know the IPv6 strings are interspersed with IPv4 strings, I would rewrite the regex to exclude IPv4 strings. -- // T.Hsu
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2011-12-19 19:58 -0700 |
| Message-ID | <mailman.3842.1324349925.27778.python-list@python.org> |
| In reply to | #17527 |
On Mon, Dec 19, 2011 at 4:15 PM, <ting@thsu.org> wrote: > On Dec 16, 11:49 am, John Gordon <gor...@panix.com> wrote: >> I'm working with IPv6 CIDR strings, and I want to replace the longest >> match of "(0000:|0000$)+" with ":". But when I use re.sub() it replaces >> the leftmost match, even if there is a longer match later in the string. > > Typically this means that your regular expression is not specific > enough. > > That is, if you get multiple matches, and you need to sort through > those matches before performing a replace, it usually means that you > should rewrite your expression to get a single match. > > Invariably this happens when you try to take short cuts. I can't blame > you for using a short cut, as sometimes short cuts just work, but once > you find that your short cut fails, you need to step back and rethink > the problem, rather than try to hack your short cut. The problem isn't short cuts. To narrow down multiple matches to a single longest match here, two additional criteria need to be met: there must be no other match anywhere in the search string that is longer than the match being considered, and there must be no match of equal length preceding the match being considered. Note that in the general case, the language is not regular and it would not even be possible to get a single longest match using a regular expression. For IPv6, it is possible only because IPv6 addresses are bounded in length. In English, that regular expression would be constructed like this: Any block of four or more groups of zeroes OR Zero or more blocks containing (zero to two groups of zero followed by a non-zero group) followed by a block of three or more groups of zeroes followed by zero or more blocks containing (a non-zero group followed by zero to three groups of zeroes) OR Zero or more blocks containing (an optional zero group followed by a non-zero group) followed by a block of two or more groups of zeroes followed by zero or more blocks containing (a non-zero group followed by zero to to two groups of zeroes) OR Zero or more non-zero groups followed by a group of zeroes followed by zero or more blocks containing (a non-zero group followed by an optional zero group). If anyone wants to give a crack at translating that to an actual regular expression, I'd be interested in seeing it. The added complexity is so great, though, that I for one would vastly prefer the simple solution already proposed of getting all the matches and iterating to find the longest. Cheers, Ian
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web