Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #53169 > unrolled thread
| Started by | John Levine <johnl@iecc.com> |
|---|---|
| First post | 2013-08-28 16:44 +0000 |
| Last post | 2013-08-29 00:26 -0700 |
| Articles | 9 — 7 participants |
Back to article view | Back to comp.lang.python
String splitting with exceptions John Levine <johnl@iecc.com> - 2013-08-28 16:44 +0000
Re: String splitting with exceptions Skip Montanaro <skip@pobox.com> - 2013-08-28 11:55 -0500
Re: String splitting with exceptions random832@fastmail.us - 2013-08-28 13:14 -0400
Re: String splitting with exceptions John Levine <johnl@iecc.com> - 2013-08-28 21:35 +0000
Re: String splitting with exceptions Tim Chase <python.list@tim.thechases.com> - 2013-08-28 12:32 -0500
Re: String splitting with exceptions Neil Cerutti <neilc@norwich.edu> - 2013-08-28 18:18 +0000
Re: String splitting with exceptions Neil Cerutti <neilc@norwich.edu> - 2013-08-28 18:08 +0000
Re: String splitting with exceptions Peter Otten <__peter__@web.de> - 2013-08-28 20:31 +0200
Re: String splitting with exceptions wxjmfauth@gmail.com - 2013-08-29 00:26 -0700
| From | John Levine <johnl@iecc.com> |
|---|---|
| Date | 2013-08-28 16:44 +0000 |
| Subject | String splitting with exceptions |
| Message-ID | <kvl9e5$19gk$1@leila.iecc.com> |
I have a crufty old DNS provisioning system that I'm rewriting and I hope improving in python. (It's based on tinydns if you know what that is.) The record formats are, in the worst case, like this: foo.[DOM]::[IP6::4361:6368:6574]:600:: What I would like to do is to split this string into a list like this: [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ] Colons are separators except when they're inside square brackets. I have been messing around with re.split() and re.findall() and haven't been able to come up with either a working separator pattern for split() or a working field pattern for findall(). I came pretty close with findall() but can't get it to reliably match the nothing between two adjacent colons not inside brackets. Any suggestions? I realize I could do it in a loop where I pick stuff off the front of the string, but yuck. This is in python 2.7.5. -- Regards, John Levine, johnl@iecc.com, Primary Perpetrator of "The Internet for Dummies", Please consider the environment before reading this e-mail. http://jl.ly
[toc] | [next] | [standalone]
| From | Skip Montanaro <skip@pobox.com> |
|---|---|
| Date | 2013-08-28 11:55 -0500 |
| Message-ID | <mailman.318.1377708933.19984.python-list@python.org> |
| In reply to | #53169 |
> The record formats are, in the worst case, like this: > > foo.[DOM]::[IP6::4361:6368:6574]:600:: > Any suggestions? Write a little parser that can handle the record format? Skip
[toc] | [prev] | [next] | [standalone]
| From | random832@fastmail.us |
|---|---|
| Date | 2013-08-28 13:14 -0400 |
| Message-ID | <mailman.319.1377710047.19984.python-list@python.org> |
| In reply to | #53169 |
On Wed, Aug 28, 2013, at 12:44, John Levine wrote:
> I have a crufty old DNS provisioning system that I'm rewriting and I
> hope improving in python. (It's based on tinydns if you know what
> that is.)
>
> The record formats are, in the worst case, like this:
>
> foo.[DOM]::[IP6::4361:6368:6574]:600::
>
> What I would like to do is to split this string into a list like this:
>
> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
>
> Colons are separators except when they're inside square brackets. I
> have been messing around with re.split() and re.findall() and haven't
> been able to come up with either a working separator pattern for
> split() or a working field pattern for findall(). I came pretty
> close with findall() but can't get it to reliably match the
> nothing between two adjacent colons not inside brackets.
>
> Any suggestions? I realize I could do it in a loop where I pick stuff
> off the front of the string, but yuck.
>
> This is in python 2.7.5.
Can you have brackets within brackets? If so, this is impossible to deal
with within a regex.
Otherwise:
>>> re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']
I'm not sure why _your_ list only has one empty string at the end. Is
the record always terminated by a colon that is not meant to imply an
empty field after it? If so, remove the question mark:
>>> re.findall('((?:[^[:]|\[[^]]*\])*):',s)
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']
I've done this kind of thing (for validation, not capturing) for email
addresses (there are some obscure bits of email address syntax that need
it) before, so it came to mind immediately.
[toc] | [prev] | [next] | [standalone]
| From | John Levine <johnl@iecc.com> |
|---|---|
| Date | 2013-08-28 21:35 +0000 |
| Message-ID | <kvlqek$hat$1@leila.iecc.com> |
| In reply to | #53171 |
>Can you have brackets within brackets? If so, this is impossible to deal
>with within a regex.
Nope. It's a regular language, not a CFL.
>Otherwise:
>>>> re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
>['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']
That seems to do it, thanks.
--
Regards,
John Levine, johnl@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. http://jl.ly
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2013-08-28 12:32 -0500 |
| Message-ID | <mailman.320.1377711095.19984.python-list@python.org> |
| In reply to | #53169 |
On 2013-08-28 13:14, random832@fastmail.us wrote:
> On Wed, Aug 28, 2013, at 12:44, John Levine wrote:
> > I have a crufty old DNS provisioning system that I'm rewriting
> > and I hope improving in python. (It's based on tinydns if you
> > know what that is.)
> >
> > The record formats are, in the worst case, like this:
> >
> > foo.[DOM]::[IP6::4361:6368:6574]:600::
>
> Otherwise:
> >>> re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
> ['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']
>
> I'm not sure why _your_ list only has one empty string at the end.
I wondered that. I also wondered about bracketed quoting that
doesn't start at the beginning of a field:
foo.[one:two]::[IP6::1234:5678:9101]:600::
^
This might be bogus, or one might want to catch this case.
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2013-08-28 18:18 +0000 |
| Message-ID | <b86t73Ff4flU2@mid.individual.net> |
| In reply to | #53172 |
On 2013-08-28, Tim Chase <python.list@tim.thechases.com> wrote:
> On 2013-08-28 13:14, random832@fastmail.us wrote:
>> On Wed, Aug 28, 2013, at 12:44, John Levine wrote:
>> > I have a crufty old DNS provisioning system that I'm rewriting
>> > and I hope improving in python. (It's based on tinydns if you
>> > know what that is.)
>> >
>> > The record formats are, in the worst case, like this:
>> >
>> > foo.[DOM]::[IP6::4361:6368:6574]:600::
>>
>> Otherwise:
>> >>> re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
>> ['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']
>>
>> I'm not sure why _your_ list only has one empty string at the end.
>
> I wondered that.
Good point. My little parser fails on that, too. It'll miss *all*
final fields. My parser needs "if s: yield s[b:]" at the end, to
operate like str.split, where the empty string is special.
--
Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2013-08-28 18:08 +0000 |
| Message-ID | <b86skbFf4flU1@mid.individual.net> |
| In reply to | #53169 |
On 2013-08-28, John Levine <johnl@iecc.com> wrote:
> I have a crufty old DNS provisioning system that I'm rewriting and I
> hope improving in python. (It's based on tinydns if you know what
> that is.)
>
> The record formats are, in the worst case, like this:
>
> foo.[DOM]::[IP6::4361:6368:6574]:600::
>
> What I would like to do is to split this string into a list like this:
>
> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
>
> Colons are separators except when they're inside square
> brackets. I have been messing around with re.split() and
> re.findall() and haven't been able to come up with either a
> working separator pattern for split() or a working field
> pattern for findall(). I came pretty close with findall() but
> can't get it to reliably match the nothing between two adjacent
> colons not inside brackets.
>
> Any suggestions? I realize I could do it in a loop where I pick
> stuff off the front of the string, but yuck.
A little parser, as Skip suggested, is a good way to go.
The brackets make your string context-sensitive, a difficult
concept to cleanly parse with a regex.
I initially hoped a csv module dialect could work, but the quote
character is (currently) hard-coded to be a single, simple
character, i.e., I can't tell it to treat [xxx] as "xxx".
What about Skip's suggestion? A little parser. It might seem
crass or something, but it really is easier than musceling a
regex into a context sensitive grammer.
def dns_split(s):
in_brackets = False
b = 0 # index of beginning of current string
for i, c in enumerate(s):
if not in_brackets:
if c == "[":
in_brackets = True
elif c == ':':
yield s[b:i]
b = i+1
elif c == "]":
in_brackets = False
>>> print(list(dns_split(s)))
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']
It'll gag on nested brackets (fixable with a counter) and has no
error handling (requires thought), but it's a start.
--
Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-08-28 20:31 +0200 |
| Message-ID | <mailman.321.1377714653.19984.python-list@python.org> |
| In reply to | #53174 |
Neil Cerutti wrote:
> On 2013-08-28, John Levine <johnl@iecc.com> wrote:
>> I have a crufty old DNS provisioning system that I'm rewriting and I
>> hope improving in python. (It's based on tinydns if you know what
>> that is.)
>>
>> The record formats are, in the worst case, like this:
>>
>> foo.[DOM]::[IP6::4361:6368:6574]:600::
>>
>> What I would like to do is to split this string into a list like this:
>>
>> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
>>
>> Colons are separators except when they're inside square
>> brackets. I have been messing around with re.split() and
>> re.findall() and haven't been able to come up with either a
>> working separator pattern for split() or a working field
>> pattern for findall(). I came pretty close with findall() but
>> can't get it to reliably match the nothing between two adjacent
>> colons not inside brackets.
>>
>> Any suggestions? I realize I could do it in a loop where I pick
>> stuff off the front of the string, but yuck.
>
> A little parser, as Skip suggested, is a good way to go.
>
> The brackets make your string context-sensitive, a difficult
> concept to cleanly parse with a regex.
>
> I initially hoped a csv module dialect could work, but the quote
> character is (currently) hard-coded to be a single, simple
> character, i.e., I can't tell it to treat [xxx] as "xxx".
>
> What about Skip's suggestion? A little parser. It might seem
> crass or something, but it really is easier than musceling a
> regex into a context sensitive grammer.
>
> def dns_split(s):
> in_brackets = False
> b = 0 # index of beginning of current string
> for i, c in enumerate(s):
> if not in_brackets:
> if c == "[":
> in_brackets = True
> elif c == ':':
> yield s[b:i]
> b = i+1
> elif c == "]":
> in_brackets = False
I think you need one more yield outside the loop.
>>>> print(list(dns_split(s)))
> ['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']
>
> It'll gag on nested brackets (fixable with a counter) and has no
> error handling (requires thought), but it's a start.
Something similar on top of regex:
>>> def split(s):
... start = level = 0
... for m in re.compile(r"[[:\]]").finditer(s):
... if m.group() == "[": level += 1
... elif m.group() == "]":
... assert level
... level -= 1
... elif level == 0:
... yield s[start:m.start()]
... start = m.end()
... yield s[start:]
...
>>> list(split("a[b:c:]:d"))
['a[b:c:]', 'd']
>>> list(split("a[b:c[:]]:d"))
['a[b:c[:]]', 'd']
>>> list(split(""))
['']
>>> list(split(":"))
['', '']
>>> list(split(":x"))
['', 'x']
>>> list(split("[:x]"))
['[:x]']
>>> list(split(":[:x]"))
['', '[:x]']
>>> list(split(":[:[:]:x]"))
['', '[:[:]:x]']
>>> list(split("[:::]"))
['[:::]']
>>> s = "foo.[DOM]::[IP6::4361:6368:6574]:600::"
>>> list(split(s))
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']
Note that there is one more empty string which I believe the OP forgot.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-08-29 00:26 -0700 |
| Message-ID | <f1014141-3977-49b6-86ac-2de9e72ee3f0@googlegroups.com> |
| In reply to | #53169 |
Le mercredi 28 août 2013 18:44:53 UTC+2, John Levine a écrit :
> I have a crufty old DNS provisioning system that I'm rewriting and I
>
> hope improving in python. (It's based on tinydns if you know what
>
> that is.)
>
>
>
> The record formats are, in the worst case, like this:
>
>
>
> foo.[DOM]::[IP6::4361:6368:6574]:600::
>
>
>
> What I would like to do is to split this string into a list like this:
>
>
>
> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
>
>
>
> Colons are separators except when they're inside square brackets. I
>
> have been messing around with re.split() and re.findall() and haven't
>
> been able to come up with either a working separator pattern for
>
> split() or a working field pattern for findall(). I came pretty
>
> close with findall() but can't get it to reliably match the
>
> nothing between two adjacent colons not inside brackets.
>
>
>
> Any suggestions? I realize I could do it in a loop where I pick stuff
>
> off the front of the string, but yuck.
>
>
>
> This is in python 2.7.5.
>
>
>
> --
>
> Regards,
>
> John Levine, johnl@iecc.com, Primary Perpetrator of "The Internet for Dummies",
>
> Please consider the environment before reading this e-mail. http://jl.ly
----------
Basic idea: protect -> split -> unprotect
>>> s = 'foo.[DOM]::[IP6::4361:6368:6574]:600::'
>>> r = s.replace('[IP6::', '***')
>>> a = r.split('::')
>>> a
['foo.[DOM]', '***4361:6368:6574]:600', '']
>>> a[1] = a[1].replace('***', '[IP6::')
>>> a
['foo.[DOM]', '[IP6::4361:6368:6574]:600', '']
jmf
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web