Groups > comp.lang.python > #85766 > unrolled thread

Noob Parsing question

Started by	kai.peters@gmail.com
First post	2015-02-17 20:07 -0800
Last post	2015-02-18 08:57 -0800
Articles	5 — 2 participants

Back to article view | Back to comp.lang.python

  Noob Parsing question kai.peters@gmail.com - 2015-02-17 20:07 -0800
    Re: Noob Parsing question Chris Angelico <rosuav@gmail.com> - 2015-02-18 15:16 +1100
      Re: Noob Parsing question kai.peters@gmail.com - 2015-02-17 20:35 -0800
        Re: Noob Parsing question Chris Angelico <rosuav@gmail.com> - 2015-02-18 15:54 +1100
          Re: Noob Parsing question kai.peters@gmail.com - 2015-02-18 08:57 -0800

#85766 — Noob Parsing question

From	kai.peters@gmail.com
Date	2015-02-17 20:07 -0800
Subject	Noob Parsing question
Message-ID	<c41fcec3-ea9f-4cce-8f6b-0f51d8cf3912@googlegroups.com>

Given

data = '{[<a=14^b=Fred^c=45.22^><a=22^b=Joe^><a=17^c=3.20^>][<a=72^b=Soup^>]}'

How can I efficiently get dictionaries for each of the data blocks framed by <> ?

Thanks for any help

KP

[toc] | [next] | [standalone]

#85767

From	Chris Angelico <rosuav@gmail.com>
Date	2015-02-18 15:16 +1100
Message-ID	<mailman.18802.1424232968.18130.python-list@python.org>
In reply to	#85766

On Wed, Feb 18, 2015 at 3:07 PM,  <kai.peters@gmail.com> wrote:
> Given
>
> data = '{[<a=14^b=Fred^c=45.22^><a=22^b=Joe^><a=17^c=3.20^>][<a=72^b=Soup^>]}'
>
> How can I efficiently get dictionaries for each of the data blocks framed by <> ?
>
> Thanks for any help

The question here is: What _can't_ happen? For instance, what happens
if Fred's name contains a greater-than symbol, or a caret?

If those absolutely cannot happen, your parser can be fairly
straight-forward. Just put together some basic splitting (maybe a
regex), and then split on the caret inside that. Otherwise, you may
need a more stateful parser.

ChrisA

[toc] | [prev] | [next] | [standalone]

#85768

From	kai.peters@gmail.com
Date	2015-02-17 20:35 -0800
Message-ID	<af5861ab-1ba2-435d-a494-6e7ff759064e@googlegroups.com>
In reply to	#85767

> > Given
> >
> > data = '{[<a=14^b=Fred^c=45.22^><a=22^b=Joe^><a=17^c=3.20^>][<a=72^b=Soup^>]}'
> >
> > How can I efficiently get dictionaries for each of the data blocks framed by <> ?
> >
> > Thanks for any help
> 
> The question here is: What _can't_ happen? For instance, what happens
> if Fred's name contains a greater-than symbol, or a caret?
> 
> If those absolutely cannot happen, your parser can be fairly
> straight-forward. Just put together some basic splitting (maybe a
> regex), and then split on the caret inside that. Otherwise, you may
> need a more stateful parser.
> 
> ChrisA

The data string is guaranteed to be clean - no such irregularities occur.

[toc] | [prev] | [next] | [standalone]

#85769

From	Chris Angelico <rosuav@gmail.com>
Date	2015-02-18 15:54 +1100
Message-ID	<mailman.18803.1424235256.18130.python-list@python.org>
In reply to	#85768

On Wed, Feb 18, 2015 at 3:35 PM,  <kai.peters@gmail.com> wrote:
>> > Given
>> >
>> > data = '{[<a=14^b=Fred^c=45.22^><a=22^b=Joe^><a=17^c=3.20^>][<a=72^b=Soup^>]}'
>> >
>> > How can I efficiently get dictionaries for each of the data blocks framed by <> ?
>> >
>> > Thanks for any help
>>
>> The question here is: What _can't_ happen? For instance, what happens
>> if Fred's name contains a greater-than symbol, or a caret?
>>
>> If those absolutely cannot happen, your parser can be fairly
>> straight-forward. Just put together some basic splitting (maybe a
>> regex), and then split on the caret inside that. Otherwise, you may
>> need a more stateful parser.
>>
>> ChrisA
>
> The data string is guaranteed to be clean - no such irregularities occur.

Okay!

(Side point: You've stripped off all citations, here, so it's not
clear who said what. My shorthand signature isn't as useful as the
full line identifying date, time, and person. It's polite to keep
those lines, at least for the first level of quoting.)

What you want can be done with a regular expression. (Yes, yes, I
know; now you have two problems.)

>>> data = '{[<a=14^b=Fred^c=45.22^><a=22^b=Joe^><a=17^c=3.20^>][<a=72^b=Soup^>]}'
>>> re.findall("<.*?>",data)
['<a=14^b=Fred^c=45.22^>', '<a=22^b=Joe^>', '<a=17^c=3.20^>', '<a=72^b=Soup^>']

>From there, you can crack open the different pieces:

>>> for piece in re.findall("<.*?>",data):
...     d = {}
...     for elem in piece[1:-2].split("^"):
...         key, value = elem.split("=",1)
...         d[key] = value
...     print(d)
...
{'c': '45.22', 'b': 'Fred', 'a': '14'}
{'b': 'Joe', 'a': '22'}
{'c': '3.20', 'a': '17'}
{'b': 'Soup', 'a': '72'}

If you need some of those to be integers or floats, you'll need to do
some post-processing on it, but this guarantees that you get the data
out reliably. It depends on not having any of the special characters
"=^<>" inside the elements, but other than that, it should be safe.

ChrisA

[toc] | [prev] | [next] | [standalone]

#85800

From	kai.peters@gmail.com
Date	2015-02-18 08:57 -0800
Message-ID	<ffa4884b-86bf-4299-accf-8ae6f85d4715@googlegroups.com>
In reply to	#85769

> >> > Given
> >> >
> >> > data = '{[<a=14^b=Fred^c=45.22^><a=22^b=Joe^><a=17^c=3.20^>][<a=72^b=Soup^>]}'
> >> >
> >> > How can I efficiently get dictionaries for each of the data blocks framed by <> ?
> >> >
> >> > Thanks for any help
> >>
> >> The question here is: What _can't_ happen? For instance, what happens
> >> if Fred's name contains a greater-than symbol, or a caret?
> >>
> >> If those absolutely cannot happen, your parser can be fairly
> >> straight-forward. Just put together some basic splitting (maybe a
> >> regex), and then split on the caret inside that. Otherwise, you may
> >> need a more stateful parser.
> >>
> >> ChrisA
> >
> > The data string is guaranteed to be clean - no such irregularities occur.
> 
> Okay!
> 
> (Side point: You've stripped off all citations, here, so it's not
> clear who said what. My shorthand signature isn't as useful as the
> full line identifying date, time, and person. It's polite to keep
> those lines, at least for the first level of quoting.)
> 
> What you want can be done with a regular expression. (Yes, yes, I
> know; now you have two problems.)
> 
> >>> data = '{[<a=14^b=Fred^c=45.22^><a=22^b=Joe^><a=17^c=3.20^>][<a=72^b=Soup^>]}'
> >>> re.findall("<.*?>",data)
> ['<a=14^b=Fred^c=45.22^>', '<a=22^b=Joe^>', '<a=17^c=3.20^>', '<a=72^b=Soup^>']
> 
> >From there, you can crack open the different pieces:
> 
> >>> for piece in re.findall("<.*?>",data):
> ...     d = {}
> ...     for elem in piece[1:-2].split("^"):
> ...         key, value = elem.split("=",1)
> ...         d[key] = value
> ...     print(d)
> ...
> {'c': '45.22', 'b': 'Fred', 'a': '14'}
> {'b': 'Joe', 'a': '22'}
> {'c': '3.20', 'a': '17'}
> {'b': 'Soup', 'a': '72'}
> 
> If you need some of those to be integers or floats, you'll need to do
> some post-processing on it, but this guarantees that you get the data
> out reliably. It depends on not having any of the special characters
> "=^<>" inside the elements, but other than that, it should be safe.
> 
> ChrisA

Thanks for your help - much appreciated!

KP

[toc] | [prev] | [standalone]

csiph-web

Noob Parsing question

Contents

#85766 — Noob Parsing question

#85767

#85768

#85769

#85800