Groups > comp.lang.python > #53169 > unrolled thread

String splitting with exceptions

Started by	John Levine <johnl@iecc.com>
First post	2013-08-28 16:44 +0000
Last post	2013-08-29 00:26 -0700
Articles	9 — 7 participants

Back to article view | Back to comp.lang.python

  String splitting with exceptions John Levine <johnl@iecc.com> - 2013-08-28 16:44 +0000
    Re: String splitting with exceptions Skip Montanaro <skip@pobox.com> - 2013-08-28 11:55 -0500
    Re: String splitting with exceptions random832@fastmail.us - 2013-08-28 13:14 -0400
      Re: String splitting with exceptions John Levine <johnl@iecc.com> - 2013-08-28 21:35 +0000
    Re: String splitting with exceptions Tim Chase <python.list@tim.thechases.com> - 2013-08-28 12:32 -0500
      Re: String splitting with exceptions Neil Cerutti <neilc@norwich.edu> - 2013-08-28 18:18 +0000
    Re: String splitting with exceptions Neil Cerutti <neilc@norwich.edu> - 2013-08-28 18:08 +0000
      Re: String splitting with exceptions Peter Otten <__peter__@web.de> - 2013-08-28 20:31 +0200
    Re: String splitting with exceptions wxjmfauth@gmail.com - 2013-08-29 00:26 -0700

#53169 — String splitting with exceptions

From	John Levine <johnl@iecc.com>
Date	2013-08-28 16:44 +0000
Subject	String splitting with exceptions
Message-ID	<kvl9e5$19gk$1@leila.iecc.com>

I have a crufty old DNS provisioning system that I'm rewriting and I
hope improving in python.  (It's based on tinydns if you know what
that is.)

The record formats are, in the worst case, like this:

foo.[DOM]::[IP6::4361:6368:6574]:600::

What I would like to do is to split this string into a list like this:

[ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]

Colons are separators except when they're inside square brackets.  I
have been messing around with re.split() and re.findall() and haven't
been able to come up with either a working separator pattern for
split() or a working field pattern for findall().  I came pretty
close with findall() but can't get it to reliably match the
nothing between two adjacent colons not inside brackets.

Any suggestions? I realize I could do it in a loop where I pick stuff
off the front of the string, but yuck.

This is in python 2.7.5.

-- 
Regards,
John Levine, johnl@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. http://jl.ly

[toc] | [next] | [standalone]

#53170

From	Skip Montanaro <skip@pobox.com>
Date	2013-08-28 11:55 -0500
Message-ID	<mailman.318.1377708933.19984.python-list@python.org>
In reply to	#53169

> The record formats are, in the worst case, like this:
>
> foo.[DOM]::[IP6::4361:6368:6574]:600::

> Any suggestions?

Write a little parser that can handle the record format?

Skip

[toc] | [prev] | [next] | [standalone]

#53171

From	random832@fastmail.us
Date	2013-08-28 13:14 -0400
Message-ID	<mailman.319.1377710047.19984.python-list@python.org>
In reply to	#53169

On Wed, Aug 28, 2013, at 12:44, John Levine wrote:
> I have a crufty old DNS provisioning system that I'm rewriting and I
> hope improving in python.  (It's based on tinydns if you know what
> that is.)
> 
> The record formats are, in the worst case, like this:
> 
> foo.[DOM]::[IP6::4361:6368:6574]:600::
> 
> What I would like to do is to split this string into a list like this:
> 
> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
> 
> Colons are separators except when they're inside square brackets.  I
> have been messing around with re.split() and re.findall() and haven't
> been able to come up with either a working separator pattern for
> split() or a working field pattern for findall().  I came pretty
> close with findall() but can't get it to reliably match the
> nothing between two adjacent colons not inside brackets.
> 
> Any suggestions? I realize I could do it in a loop where I pick stuff
> off the front of the string, but yuck.
> 
> This is in python 2.7.5.

Can you have brackets within brackets? If so, this is impossible to deal
with within a regex.

Otherwise:
>>> re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

I'm not sure why _your_ list only has one empty string at the end. Is
the record always terminated by a colon that is not meant to imply an
empty field after it? If so, remove the question mark:

>>> re.findall('((?:[^[:]|\[[^]]*\])*):',s)
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']

I've done this kind of thing (for validation, not capturing) for email
addresses (there are some obscure bits of email address syntax that need
it) before, so it came to mind immediately.

[toc] | [prev] | [next] | [standalone]

#53186

From	John Levine <johnl@iecc.com>
Date	2013-08-28 21:35 +0000
Message-ID	<kvlqek$hat$1@leila.iecc.com>
In reply to	#53171

>Can you have brackets within brackets? If so, this is impossible to deal
>with within a regex.

Nope.  It's a regular language, not a CFL.

>Otherwise:
>>>> re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
>['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

That seems to do it, thanks.

-- 
Regards,
John Levine, johnl@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. http://jl.ly

[toc] | [prev] | [next] | [standalone]

#53172

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-08-28 12:32 -0500
Message-ID	<mailman.320.1377711095.19984.python-list@python.org>
In reply to	#53169

On 2013-08-28 13:14, random832@fastmail.us wrote:
> On Wed, Aug 28, 2013, at 12:44, John Levine wrote:
> > I have a crufty old DNS provisioning system that I'm rewriting
> > and I hope improving in python.  (It's based on tinydns if you
> > know what that is.)
> > 
> > The record formats are, in the worst case, like this:
> > 
> > foo.[DOM]::[IP6::4361:6368:6574]:600::
> 
> Otherwise:
> >>> re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
> ['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']
> 
> I'm not sure why _your_ list only has one empty string at the end.

I wondered that.  I also wondered about bracketed quoting that
doesn't start at the beginning of a field:

  foo.[one:two]::[IP6::1234:5678:9101]:600::
          ^
This might be bogus, or one might want to catch this case.

-tkc

[toc] | [prev] | [next] | [standalone]

#53175

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-08-28 18:18 +0000
Message-ID	<b86t73Ff4flU2@mid.individual.net>
In reply to	#53172

On 2013-08-28, Tim Chase <python.list@tim.thechases.com> wrote:
> On 2013-08-28 13:14, random832@fastmail.us wrote:
>> On Wed, Aug 28, 2013, at 12:44, John Levine wrote:
>> > I have a crufty old DNS provisioning system that I'm rewriting
>> > and I hope improving in python.  (It's based on tinydns if you
>> > know what that is.)
>> > 
>> > The record formats are, in the worst case, like this:
>> > 
>> > foo.[DOM]::[IP6::4361:6368:6574]:600::
>> 
>> Otherwise:
>> >>> re.findall('((?:[^[:]|\[[^]]*\])*):?',s)
>> ['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']
>> 
>> I'm not sure why _your_ list only has one empty string at the end.
>
> I wondered that.

Good point. My little parser fails on that, too. It'll miss *all*
final fields. My parser needs "if s: yield s[b:]" at the end, to
operate like str.split, where the empty string is special.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#53174

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-08-28 18:08 +0000
Message-ID	<b86skbFf4flU1@mid.individual.net>
In reply to	#53169

On 2013-08-28, John Levine <johnl@iecc.com> wrote:
> I have a crufty old DNS provisioning system that I'm rewriting and I
> hope improving in python.  (It's based on tinydns if you know what
> that is.)
>
> The record formats are, in the worst case, like this:
>
> foo.[DOM]::[IP6::4361:6368:6574]:600::
>
> What I would like to do is to split this string into a list like this:
>
> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
>
> Colons are separators except when they're inside square
> brackets.  I have been messing around with re.split() and
> re.findall() and haven't been able to come up with either a
> working separator pattern for split() or a working field
> pattern for findall().  I came pretty close with findall() but
> can't get it to reliably match the nothing between two adjacent
> colons not inside brackets.
>
> Any suggestions? I realize I could do it in a loop where I pick
> stuff off the front of the string, but yuck.

A little parser, as Skip suggested, is a good way to go.

The brackets make your string context-sensitive, a difficult
concept to cleanly parse with a regex.

I initially hoped a csv module dialect could work, but the quote
character is (currently) hard-coded to be a single, simple
character, i.e., I can't tell it to treat [xxx] as "xxx".

What about Skip's suggestion? A little parser. It might seem
crass or something, but it really is easier than musceling a
regex into a context sensitive grammer.

def dns_split(s):
    in_brackets = False
    b = 0 # index of beginning of current string
    for i, c in enumerate(s):
        if not in_brackets:
            if c == "[":
                in_brackets = True
            elif c == ':':
                yield s[b:i]
                b = i+1
        elif c == "]":
            in_brackets = False

>>> print(list(dns_split(s)))
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']

It'll gag on nested brackets (fixable with a counter) and has no
error handling (requires thought), but it's a start.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#53176

From	Peter Otten <__peter__@web.de>
Date	2013-08-28 20:31 +0200
Message-ID	<mailman.321.1377714653.19984.python-list@python.org>
In reply to	#53174

Neil Cerutti wrote:

> On 2013-08-28, John Levine <johnl@iecc.com> wrote:
>> I have a crufty old DNS provisioning system that I'm rewriting and I
>> hope improving in python.  (It's based on tinydns if you know what
>> that is.)
>>
>> The record formats are, in the worst case, like this:
>>
>> foo.[DOM]::[IP6::4361:6368:6574]:600::
>>
>> What I would like to do is to split this string into a list like this:
>>
>> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
>>
>> Colons are separators except when they're inside square
>> brackets.  I have been messing around with re.split() and
>> re.findall() and haven't been able to come up with either a
>> working separator pattern for split() or a working field
>> pattern for findall().  I came pretty close with findall() but
>> can't get it to reliably match the nothing between two adjacent
>> colons not inside brackets.
>>
>> Any suggestions? I realize I could do it in a loop where I pick
>> stuff off the front of the string, but yuck.
> 
> A little parser, as Skip suggested, is a good way to go.
> 
> The brackets make your string context-sensitive, a difficult
> concept to cleanly parse with a regex.
> 
> I initially hoped a csv module dialect could work, but the quote
> character is (currently) hard-coded to be a single, simple
> character, i.e., I can't tell it to treat [xxx] as "xxx".
> 
> What about Skip's suggestion? A little parser. It might seem
> crass or something, but it really is easier than musceling a
> regex into a context sensitive grammer.
> 
> def dns_split(s):
>     in_brackets = False
>     b = 0 # index of beginning of current string
>     for i, c in enumerate(s):
>         if not in_brackets:
>             if c == "[":
>                 in_brackets = True
>             elif c == ':':
>                 yield s[b:i]
>                 b = i+1
>         elif c == "]":
>             in_brackets = False

I think you need one more yield outside the loop.

>>>> print(list(dns_split(s)))
> ['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']
> 
> It'll gag on nested brackets (fixable with a counter) and has no
> error handling (requires thought), but it's a start.
 
Something similar on top of regex:

>>> def split(s):
...     start = level = 0
...     for m in re.compile(r"[[:\]]").finditer(s):
...             if m.group() == "[": level += 1
...             elif m.group() == "]":
...                     assert level
...                     level -= 1
...             elif level == 0:
...                     yield s[start:m.start()]
...                     start = m.end()
...     yield s[start:]
... 
>>> list(split("a[b:c:]:d"))
['a[b:c:]', 'd']
>>> list(split("a[b:c[:]]:d"))
['a[b:c[:]]', 'd']
>>> list(split(""))
['']
>>> list(split(":"))
['', '']
>>> list(split(":x"))
['', 'x']
>>> list(split("[:x]"))
['[:x]']
>>> list(split(":[:x]"))
['', '[:x]']
>>> list(split(":[:[:]:x]"))
['', '[:[:]:x]']
>>> list(split("[:::]"))
['[:::]']
>>> s = "foo.[DOM]::[IP6::4361:6368:6574]:600::"
>>> list(split(s))
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

Note that there is one more empty string which I believe the OP forgot.

[toc] | [prev] | [next] | [standalone]

#53223

From	wxjmfauth@gmail.com
Date	2013-08-29 00:26 -0700
Message-ID	<f1014141-3977-49b6-86ac-2de9e72ee3f0@googlegroups.com>
In reply to	#53169

Le mercredi 28 août 2013 18:44:53 UTC+2, John Levine a écrit :
> I have a crufty old DNS provisioning system that I'm rewriting and I
> 
> hope improving in python.  (It's based on tinydns if you know what
> 
> that is.)
> 
> 
> 
> The record formats are, in the worst case, like this:
> 
> 
> 
> foo.[DOM]::[IP6::4361:6368:6574]:600::
> 
> 
> 
> What I would like to do is to split this string into a list like this:
> 
> 
> 
> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
> 
> 
> 
> Colons are separators except when they're inside square brackets.  I
> 
> have been messing around with re.split() and re.findall() and haven't
> 
> been able to come up with either a working separator pattern for
> 
> split() or a working field pattern for findall().  I came pretty
> 
> close with findall() but can't get it to reliably match the
> 
> nothing between two adjacent colons not inside brackets.
> 
> 
> 
> Any suggestions? I realize I could do it in a loop where I pick stuff
> 
> off the front of the string, but yuck.
> 
> 
> 
> This is in python 2.7.5.
> 
> 
> 
> -- 
> 
> Regards,
> 
> John Levine, johnl@iecc.com, Primary Perpetrator of "The Internet for Dummies",
> 
> Please consider the environment before reading this e-mail. http://jl.ly

----------

Basic idea: protect -> split -> unprotect

>>> s = 'foo.[DOM]::[IP6::4361:6368:6574]:600::'
>>> r = s.replace('[IP6::', '***')
>>> a = r.split('::')
>>> a
['foo.[DOM]', '***4361:6368:6574]:600', '']
>>> a[1] = a[1].replace('***', '[IP6::')
>>> a
['foo.[DOM]', '[IP6::4361:6368:6574]:600', '']

jmf

[toc] | [prev] | [standalone]

csiph-web

String splitting with exceptions

Contents

#53169 — String splitting with exceptions

#53170

#53171

#53186

#53172

#53175

#53174

#53176

#53223