Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #53169 > unrolled thread

String splitting with exceptions

Started byJohn Levine <johnl@iecc.com>
First post2013-08-28 16:44 +0000
Last post2013-08-29 00:26 -0700
Articles 9 — 7 participants

Back to article view | Back to comp.lang.python


Contents

  String splitting with exceptions John Levine <johnl@iecc.com> - 2013-08-28 16:44 +0000
    Re: String splitting with exceptions Skip Montanaro <skip@pobox.com> - 2013-08-28 11:55 -0500
    Re: String splitting with exceptions random832@fastmail.us - 2013-08-28 13:14 -0400
      Re: String splitting with exceptions John Levine <johnl@iecc.com> - 2013-08-28 21:35 +0000
    Re: String splitting with exceptions Tim Chase <python.list@tim.thechases.com> - 2013-08-28 12:32 -0500
      Re: String splitting with exceptions Neil Cerutti <neilc@norwich.edu> - 2013-08-28 18:18 +0000
    Re: String splitting with exceptions Neil Cerutti <neilc@norwich.edu> - 2013-08-28 18:08 +0000
      Re: String splitting with exceptions Peter Otten <__peter__@web.de> - 2013-08-28 20:31 +0200
    Re: String splitting with exceptions wxjmfauth@gmail.com - 2013-08-29 00:26 -0700

#53169 — String splitting with exceptions

FromJohn Levine <johnl@iecc.com>
Date2013-08-28 16:44 +0000
SubjectString splitting with exceptions
Message-ID<kvl9e5$19gk$1@leila.iecc.com>
I have a crufty old DNS provisioning system that I'm rewriting and I
hope improving in python.  (It's based on tinydns if you know what
that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::

What I would like to do is to split this string into a list like this:

[ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]

Colons are separators except when they're inside square brackets.  I
have been messing around with re.split() and re.findall() and haven't
been able to come up with either a working separator pattern for
split() or a working field pattern for findall().  I came pretty
close with findall() but can't get it to reliably match the
nothing between two adjacent colons not inside brackets.

Any suggestions? I realize I could do it in a loop where I pick stuff
off the front of the string, but yuck.

This is in python 2.7.5.

-- 
Regards,
John Levine, johnl@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. http://jl.ly

[toc] | [next] | [standalone]


#53170

FromSkip Montanaro <skip@pobox.com>
Date2013-08-28 11:55 -0500
Message-ID<mailman.318.1377708933.19984.python-list@python.org>
In reply to#53169
> The record formats are, in the worst case, like this:
>
> foo.[DOM]::[IP6::4361:6368:6574]:600::

> Any suggestions?

Write a little parser that can handle the record format?

Skip

[toc] | [prev] | [next] | [standalone]


#53171

Fromrandom832@fastmail.us
Date2013-08-28 13:14 -0400
Message-ID<mailman.319.1377710047.19984.python-list@python.org>
In reply to#53169
On Wed, Aug 28, 2013, at 12:44, John Levine wrote:
> I have a crufty old DNS provisioning system that I'm rewriting and I
> hope improving in python.  (It's based on tinydns if you know what
> that is.)
> 
> The record formats are, in the worst case, like this:
> 
> foo.[DOM]::[IP6::4361:6368:6574]:600::
> 
> What I would like to do is to split this string into a list like this:
> 
> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
> 
> Colons are separators except when they're inside square brackets.  I
> have been messing around with re.split() and re.findall() and haven't
> been able to come up with either a working separator pattern for
> split() or a working field pattern for findall().  I came pretty
> close with findall() but can't get it to reliably match the
> nothing between two adjacent colons not inside brackets.
> 
> Any suggestions? I realize I could do it in a loop where I pick stuff
> off the front of the string, but yuck.
> 
> This is in python 2.7.5.

Can you have brackets within brackets? If so, this is impossible to deal
with within a regex.

Otherwise:
>>> re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

I'm not sure why _your_ list only has one empty string at the end. Is
the record always terminated by a colon that is not meant to imply an
empty field after it? If so, remove the question mark:

>>> re.findall('((?:[^[:]|\[[^]]*\])*):',s)
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']

I've done this kind of thing (for validation, not capturing) for email
addresses (there are some obscure bits of email address syntax that need
it) before, so it came to mind immediately.

[toc] | [prev] | [next] | [standalone]


#53186

FromJohn Levine <johnl@iecc.com>
Date2013-08-28 21:35 +0000
Message-ID<kvlqek$hat$1@leila.iecc.com>
In reply to#53171
>Can you have brackets within brackets? If so, this is impossible to deal
>with within a regex.

Nope.  It's a regular language, not a CFL.

>Otherwise:
>>>> re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
>['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

That seems to do it, thanks.

-- 
Regards,
John Levine, johnl@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. http://jl.ly

[toc] | [prev] | [next] | [standalone]


#53172

FromTim Chase <python.list@tim.thechases.com>
Date2013-08-28 12:32 -0500
Message-ID<mailman.320.1377711095.19984.python-list@python.org>
In reply to#53169
On 2013-08-28 13:14, random832@fastmail.us wrote:
> On Wed, Aug 28, 2013, at 12:44, John Levine wrote:
> > I have a crufty old DNS provisioning system that I'm rewriting
> > and I hope improving in python.  (It's based on tinydns if you
> > know what that is.)
> > 
> > The record formats are, in the worst case, like this:
> > 
> > foo.[DOM]::[IP6::4361:6368:6574]:600::
> 
> Otherwise:
> >>> re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
> ['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']
> 
> I'm not sure why _your_ list only has one empty string at the end.

I wondered that.  I also wondered about bracketed quoting that
doesn't start at the beginning of a field:

  foo.[one:two]::[IP6::1234:5678:9101]:600::
          ^
This might be bogus, or one might want to catch this case.

-tkc

[toc] | [prev] | [next] | [standalone]


#53175

FromNeil Cerutti <neilc@norwich.edu>
Date2013-08-28 18:18 +0000
Message-ID<b86t73Ff4flU2@mid.individual.net>
In reply to#53172
On 2013-08-28, Tim Chase <python.list@tim.thechases.com> wrote:
> On 2013-08-28 13:14, random832@fastmail.us wrote:
>> On Wed, Aug 28, 2013, at 12:44, John Levine wrote:
>> > I have a crufty old DNS provisioning system that I'm rewriting
>> > and I hope improving in python.  (It's based on tinydns if you
>> > know what that is.)
>> > 
>> > The record formats are, in the worst case, like this:
>> > 
>> > foo.[DOM]::[IP6::4361:6368:6574]:600::
>> 
>> Otherwise:
>> >>> re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
>> ['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']
>> 
>> I'm not sure why _your_ list only has one empty string at the end.
>
> I wondered that.

Good point. My little parser fails on that, too. It'll miss *all*
final fields. My parser needs "if s: yield s[b:]" at the end, to
operate like str.split, where the empty string is special.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#53174

FromNeil Cerutti <neilc@norwich.edu>
Date2013-08-28 18:08 +0000
Message-ID<b86skbFf4flU1@mid.individual.net>
In reply to#53169
On 2013-08-28, John Levine <johnl@iecc.com> wrote:
> I have a crufty old DNS provisioning system that I'm rewriting and I
> hope improving in python.  (It's based on tinydns if you know what
> that is.)
>
> The record formats are, in the worst case, like this:
>
> foo.[DOM]::[IP6::4361:6368:6574]:600::
>
> What I would like to do is to split this string into a list like this:
>
> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
>
> Colons are separators except when they're inside square
> brackets.  I have been messing around with re.split() and
> re.findall() and haven't been able to come up with either a
> working separator pattern for split() or a working field
> pattern for findall().  I came pretty close with findall() but
> can't get it to reliably match the nothing between two adjacent
> colons not inside brackets.
>
> Any suggestions? I realize I could do it in a loop where I pick
> stuff off the front of the string, but yuck.

A little parser, as Skip suggested, is a good way to go.

The brackets make your string context-sensitive, a difficult
concept to cleanly parse with a regex.

I initially hoped a csv module dialect could work, but the quote
character is (currently) hard-coded to be a single, simple
character, i.e., I can't tell it to treat [xxx] as "xxx".

What about Skip's suggestion? A little parser. It might seem
crass or something, but it really is easier than musceling a
regex into a context sensitive grammer.

def dns_split(s):
    in_brackets = False
    b = 0 # index of beginning of current string
    for i, c in enumerate(s):
        if not in_brackets:
            if c == "[":
                in_brackets = True
            elif c == ':':
                yield s[b:i]
                b = i+1
        elif c == "]":
            in_brackets = False

>>> print(list(dns_split(s)))
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']

It'll gag on nested brackets (fixable with a counter) and has no
error handling (requires thought), but it's a start.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#53176

FromPeter Otten <__peter__@web.de>
Date2013-08-28 20:31 +0200
Message-ID<mailman.321.1377714653.19984.python-list@python.org>
In reply to#53174
Neil Cerutti wrote:

> On 2013-08-28, John Levine <johnl@iecc.com> wrote:
>> I have a crufty old DNS provisioning system that I'm rewriting and I
>> hope improving in python.  (It's based on tinydns if you know what
>> that is.)
>>
>> The record formats are, in the worst case, like this:
>>
>> foo.[DOM]::[IP6::4361:6368:6574]:600::
>>
>> What I would like to do is to split this string into a list like this:
>>
>> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
>>
>> Colons are separators except when they're inside square
>> brackets.  I have been messing around with re.split() and
>> re.findall() and haven't been able to come up with either a
>> working separator pattern for split() or a working field
>> pattern for findall().  I came pretty close with findall() but
>> can't get it to reliably match the nothing between two adjacent
>> colons not inside brackets.
>>
>> Any suggestions? I realize I could do it in a loop where I pick
>> stuff off the front of the string, but yuck.
> 
> A little parser, as Skip suggested, is a good way to go.
> 
> The brackets make your string context-sensitive, a difficult
> concept to cleanly parse with a regex.
> 
> I initially hoped a csv module dialect could work, but the quote
> character is (currently) hard-coded to be a single, simple
> character, i.e., I can't tell it to treat [xxx] as "xxx".
> 
> What about Skip's suggestion? A little parser. It might seem
> crass or something, but it really is easier than musceling a
> regex into a context sensitive grammer.
> 
> def dns_split(s):
>     in_brackets = False
>     b = 0 # index of beginning of current string
>     for i, c in enumerate(s):
>         if not in_brackets:
>             if c == "[":
>                 in_brackets = True
>             elif c == ':':
>                 yield s[b:i]
>                 b = i+1
>         elif c == "]":
>             in_brackets = False

I think you need one more yield outside the loop.

>>>> print(list(dns_split(s)))
> ['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']
> 
> It'll gag on nested brackets (fixable with a counter) and has no
> error handling (requires thought), but it's a start.
 
Something similar on top of regex:

>>> def split(s):
...     start = level = 0
...     for m in re.compile(r"[[:\]]").finditer(s):
...             if m.group() == "[": level += 1
...             elif m.group() == "]":
...                     assert level
...                     level -= 1
...             elif level == 0:
...                     yield s[start:m.start()]
...                     start = m.end()
...     yield s[start:]
... 
>>> list(split("a[b:c:]:d"))
['a[b:c:]', 'd']
>>> list(split("a[b:c[:]]:d"))
['a[b:c[:]]', 'd']
>>> list(split(""))
['']
>>> list(split(":"))
['', '']
>>> list(split(":x"))
['', 'x']
>>> list(split("[:x]"))
['[:x]']
>>> list(split(":[:x]"))
['', '[:x]']
>>> list(split(":[:[:]:x]"))
['', '[:[:]:x]']
>>> list(split("[:::]"))
['[:::]']
>>> s = "foo.[DOM]::[IP6::4361:6368:6574]:600::"
>>> list(split(s))
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

Note that there is one more empty string which I believe the OP forgot.

[toc] | [prev] | [next] | [standalone]


#53223

Fromwxjmfauth@gmail.com
Date2013-08-29 00:26 -0700
Message-ID<f1014141-3977-49b6-86ac-2de9e72ee3f0@googlegroups.com>
In reply to#53169
Le mercredi 28 août 2013 18:44:53 UTC+2, John Levine a écrit :
> I have a crufty old DNS provisioning system that I'm rewriting and I
> 
> hope improving in python.  (It's based on tinydns if you know what
> 
> that is.)
> 
> 
> 
> The record formats are, in the worst case, like this:
> 
> 
> 
> foo.[DOM]::[IP6::4361:6368:6574]:600::
> 
> 
> 
> What I would like to do is to split this string into a list like this:
> 
> 
> 
> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
> 
> 
> 
> Colons are separators except when they're inside square brackets.  I
> 
> have been messing around with re.split() and re.findall() and haven't
> 
> been able to come up with either a working separator pattern for
> 
> split() or a working field pattern for findall().  I came pretty
> 
> close with findall() but can't get it to reliably match the
> 
> nothing between two adjacent colons not inside brackets.
> 
> 
> 
> Any suggestions? I realize I could do it in a loop where I pick stuff
> 
> off the front of the string, but yuck.
> 
> 
> 
> This is in python 2.7.5.
> 
> 
> 
> -- 
> 
> Regards,
> 
> John Levine, johnl@iecc.com, Primary Perpetrator of "The Internet for Dummies",
> 
> Please consider the environment before reading this e-mail. http://jl.ly

----------

Basic idea: protect -> split -> unprotect

>>> s = 'foo.[DOM]::[IP6::4361:6368:6574]:600::'
>>> r = s.replace('[IP6::', '***')
>>> a = r.split('::')
>>> a
['foo.[DOM]', '***4361:6368:6574]:600', '']
>>> a[1] = a[1].replace('***', '[IP6::')
>>> a
['foo.[DOM]', '[IP6::4361:6368:6574]:600', '']

jmf

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web