Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #97253 > unrolled thread

Question about regular expression

Started bymassi_srb@msn.com
First post2015-09-30 11:34 -0700
Last post2015-10-01 21:31 +0000
Articles 10 — 5 participants

Back to article view | Back to comp.lang.python


Contents

  Question about regular expression massi_srb@msn.com - 2015-09-30 11:34 -0700
    Re: Question about regular expression Emile van Sebille <emile@fenx.com> - 2015-09-30 11:50 -0700
    Re: Question about regular expression Tim Chase <python.list@tim.thechases.com> - 2015-09-30 14:20 -0500
    Re: Question about regular expression Denis McMahon <denismfmcmahon@gmail.com> - 2015-09-30 23:30 +0000
      Re: Question about regular expression Denis McMahon <denismfmcmahon@gmail.com> - 2015-10-02 18:25 +0000
    Re: Question about regular expression Emile van Sebille <emile@fenx.com> - 2015-09-30 20:58 -0700
    Re: Question about regular expression Tim Chase <python.list@tim.thechases.com> - 2015-10-01 07:39 -0500
    Re: Question about regular expression Rob Gaddi <rgaddi@technologyhighland.invalid> - 2015-10-01 15:53 +0000
      Re: Question about regular expression Denis McMahon <denismfmcmahon@gmail.com> - 2015-10-01 21:41 +0000
    Re: Question about regular expression Denis McMahon <denismfmcmahon@gmail.com> - 2015-10-01 21:31 +0000

#97253 — Question about regular expression

Frommassi_srb@msn.com
Date2015-09-30 11:34 -0700
SubjectQuestion about regular expression
Message-ID<811788b6-9955-4dcc-bf49-9647891d17ec@googlegroups.com>
Hi everyone,

firstly the description of my problem. I have a string in the following form:

s = "name1 name2(1) name3 name4 (1, 4) name5(2) ..."

that is a string made up of groups in the form 'name' (letters only) plus possibly a tuple containing 1 or 2 integer values. Blanks can be placed between names and tuples or not, but they surely are placed beween two groups. I would like to process this string in order to get a dictionary like this:

d = {
    "name1":(0, 0),
    "name2":(1, 0),
    "name3":(0, 0),
    "name4":(1, 4),
    "name5":(2, 0),
}

I guess this problem can be tackled with regular expressions, but I have no idea bout how to use them in this case (I'm not a regexp guy). Can anyone give me a hint? any possible different approach is absolutely welcome.

Thanks in advance!

[toc] | [next] | [standalone]


#97254

FromEmile van Sebille <emile@fenx.com>
Date2015-09-30 11:50 -0700
Message-ID<mailman.274.1443639036.28679.python-list@python.org>
In reply to#97253
On 9/30/2015 11:34 AM, massi_srb@msn.com wrote:
> Hi everyone,
>
> firstly the description of my problem. I have a string in the following form:
>
> s = "name1 name2(1) name3 name4 (1, 4) name5(2) ..."
>
> that is a string made up of groups in the form 'name' (letters only) plus possibly a tuple containing 1 or 2 integer values. Blanks can be placed between names and tuples or not, but they surely are placed beween two groups. I would like to process this string in order to get a dictionary like this:
>
> d = {
>      "name1":(0, 0),
>      "name2":(1, 0),
>      "name3":(0, 0),
>      "name4":(1, 4),
>      "name5":(2, 0),
> }
>
> I guess this problem can be tackled with regular expressions,

Stop there!  :)

I'd use string functions.  If you can control the string output to drop 
the spaces and always output in namex(a,b)<space>namey(c,d)... format, 
try starting with

 >>> "name1 name2(1) name3 name4(1,4) name5(2)".split()
['name1', 'name2(1)', 'name3', 'name4(1,4)', 'name5(2)']

then create the dict from the result.

Emile

[toc] | [prev] | [next] | [standalone]


#97255

FromTim Chase <python.list@tim.thechases.com>
Date2015-09-30 14:20 -0500
Message-ID<mailman.276.1443641310.28679.python-list@python.org>
In reply to#97253
On 2015-09-30 11:34, massi_srb@msn.com wrote:
> firstly the description of my problem. I have a string in the
> following form:
> 
> s = "name1 name2(1) name3 name4 (1, 4) name5(2) ..."
> 
> that is a string made up of groups in the form 'name' (letters
> only) plus possibly a tuple containing 1 or 2 integer values.
> Blanks can be placed between names and tuples or not, but they
> surely are placed beween two groups. I would like to process this
> string in order to get a dictionary like this:
> 
> d = {
>     "name1":(0, 0),
>     "name2":(1, 0),
>     "name3":(0, 0),
>     "name4":(1, 4),
>     "name5":(2, 0),
> }
> 
> I guess this problem can be tackled with regular expressions, b

First out of the gate, I suggest you follow Emile's advice and try
using string expressions.  However, if you *want* to do it with
regular expressions, you can.  It's ugly and might be fragile, but

#############################################################
import re
s = "name1 name2(1) name3 name4 (1, 4) name5(2) ..."
r = re.compile(r"""
    \b       # start at a word boundary
    (\w+)    # capture the word
    \s*      # optional whitespace
    (?:      # start an optional grouping for things in the parens
     \(      # a literal open-paren
      \s*    # optional whitespace
      (\d+)  # capture the number in those parens
      (?:    # start a second optional grouping for the stuff after a comma
       \s*   # optional whitespace
       ,     # a literal comma
       \s*   # optional whitespace
       (\d+) # the second number
      )?     # make the command and following number optional
     \)      # a literal close-paren
    )?       # make that stuff in parens optional
    """, re.X)
d = {}
for m in r.finditer(s):
    a, b, c  = m.groups()
    d[a] = (int(b or 0), int(c or 0))

from pprint import pprint
pprint(d)
#############################################################


I'd stick with the commented version of the regexp if you were to use
this anywhere so that others can follow what you're doing.

-tkc




[toc] | [prev] | [next] | [standalone]


#97262

FromDenis McMahon <denismfmcmahon@gmail.com>
Date2015-09-30 23:30 +0000
Message-ID<muhrb7$elp$2@dont-email.me>
In reply to#97253
On Wed, 30 Sep 2015 11:34:04 -0700, massi_srb wrote:

> firstly the description of my problem. I have a string in the following
> form: .....

The way I solved this was to:

1) replace all the punctuation in the string with spaces

2) split the string on space

3) process each thing in the list to test if it was a number or word

4a) add words to the dictionary as keys with value of a default list, or
4b) add numbers to the dictionary in the list at the appropriate position

5) convert the list values of the dictionary to tuples

It seems to work on my test case:

s = "fred jim(1) alice tom (1, 4) peter (2) andrew(3,4) janet( 7,6 ) james
( 7 ) mike ( 9 )"

d = {'mike': (9, 0), 'janet': (7, 6), 'james': (7, 0), 'jim': (1, 0), 
'andrew': (3, 4), 'alice': (0, 0), 'tom': (1, 4), 'peter': (2, 0), 'fred': 
(0, 0)}




-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]


#97345

FromDenis McMahon <denismfmcmahon@gmail.com>
Date2015-10-02 18:25 +0000
Message-ID<mumi6u$9d6$1@dont-email.me>
In reply to#97262
On Wed, 30 Sep 2015 23:30:47 +0000, Denis McMahon wrote:

> On Wed, 30 Sep 2015 11:34:04 -0700, massi_srb wrote:
> 
>> firstly the description of my problem. I have a string in the following
>> form: .....
> 
> The way I solved this was to:
> 
> 1) replace all the punctuation in the string with spaces
> 
> 2) split the string on space
> 
> 3) process each thing in the list to test if it was a number or word
> 
> 4a) add words to the dictionary as keys with value of a default list, or
> 4b) add numbers to the dictionary in the list at the appropriate
> position
> 
> 5) convert the list values of the dictionary to tuples
> 
> It seems to work on my test case:
> 
> s = "fred jim(1) alice tom (1, 4) peter (2) andrew(3,4) janet( 7,6 )
> james ( 7 ) mike ( 9 )"
> 
> d = {'mike': (9, 0), 'janet': (7, 6), 'james': (7, 0), 'jim': (1, 0),
> 'andrew': (3, 4), 'alice': (0, 0), 'tom': (1, 4), 'peter': (2, 0),
> 'fred':
> (0, 0)}

Oh yeah, the code:

#!/usr/bin/python

import re

s = 'fred jim(1) alice tom (1, 4) peter (2) andrew(3,4) janet( 7,6 ) james
( 7 ) mike ( 9 ) jon  (  6  ,  3  )   charles(0,12)'

bits = s.replace('(', ' ').replace(',', ' ').replace(')', ' ').split(' ')

d = {}

namep = re.compile('^[A-Za-z]+$')
numbp = re.compile('^[0-9]+$')

for bit in bits:
    if namep.match(bit):
        d[bit] = [0,0]
        w = bit
        nums = 0
    if numbp.match(bit):
        n = int(bit)
        d[w][nums] = n
        nums += 1

d = {x:tuple(d[x]) for x in d}

print s
print d

It uses regex to determine if the list element being processed is a name 
or a number, which makes for 2 very simple patterns.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]


#97263

FromEmile van Sebille <emile@fenx.com>
Date2015-09-30 20:58 -0700
Message-ID<mailman.281.1443671907.28679.python-list@python.org>
In reply to#97253
On 9/30/2015 12:20 PM, Tim Chase wrote:
> On 2015-09-30 11:34, massi_srb@msn.com wrote:
<snip>
>> I guess this problem can be tackled with regular expressions, b
> ... However, if you *want* to do it with
> regular expressions, you can.  It's ugly and might be fragile, but
>
> #############################################################
> import re
> s = "name1 name2(1) name3 name4 (1, 4) name5(2) ..."
> r = re.compile(r"""
>      \b       # start at a word boundary
>      (\w+)    # capture the word
>      \s*      # optional whitespace
>      (?:      # start an optional grouping for things in the parens
>       \(      # a literal open-paren
>        \s*    # optional whitespace
>        (\d+)  # capture the number in those parens
>        (?:    # start a second optional grouping for the stuff after a comma
>         \s*   # optional whitespace
>         ,     # a literal comma
>         \s*   # optional whitespace
>         (\d+) # the second number
>        )?     # make the command and following number optional
>       \)      # a literal close-paren
>      )?       # make that stuff in parens optional
>      """, re.X)
> d = {}
> for m in r.finditer(s):
>      a, b, c  = m.groups()
>      d[a] = (int(b or 0), int(c or 0))
>
> from pprint import pprint
> pprint(d)
> #############################################################

:)

>
> I'd stick with the commented version of the regexp if you were to use
> this anywhere so that others can follow what you're doing.

... and this is why I use python.  That looks too much like a hex sector 
disk dump rot /x20.  :)

No-really-that's-sick-ly yr's,

Emile



[toc] | [prev] | [next] | [standalone]


#97280

FromTim Chase <python.list@tim.thechases.com>
Date2015-10-01 07:39 -0500
Message-ID<mailman.291.1443703705.28679.python-list@python.org>
In reply to#97253
On 2015-10-01 01:48, gal kauffman wrote:
> items = s.replace(' (', '(').replace(', ',',').split()

s = "name1  (1)"

Your suggestion doesn't catch cases where more than one space can
occur before the paren.

-tkc


[toc] | [prev] | [next] | [standalone]


#97286

FromRob Gaddi <rgaddi@technologyhighland.invalid>
Date2015-10-01 15:53 +0000
Message-ID<mujku2$ia9$2@dont-email.me>
In reply to#97253
On Wed, 30 Sep 2015 11:34:04 -0700, massi_srb wrote:

> Hi everyone,
> 
> firstly the description of my problem. I have a string in the following
> form:
> 
> s = "name1 name2(1) name3 name4 (1, 4) name5(2) ..."
> 
> that is a string made up of groups in the form 'name' (letters only)
> plus possibly a tuple containing 1 or 2 integer values. Blanks can be
> placed between names and tuples or not, but they surely are placed
> beween two groups. I would like to process this string in order to get a
> dictionary like this:
> 
> d = {
>     "name1":(0, 0),
>     "name2":(1, 0),
>     "name3":(0, 0),
>     "name4":(1, 4),
>     "name5":(2, 0),
> }
> 
> I guess this problem can be tackled with regular expressions, but I have
> no idea bout how to use them in this case (I'm not a regexp guy). Can
> anyone give me a hint? any possible different approach is absolutely
> welcome.
> 
> Thanks in advance!

There's a quote for this.  'Some people, when confronted with a problem, 
think “I know, I'll use regular expressions.”  Now they have two 
problems.'  

That one's not always true, but any time you're debating a regex solution 
it should at least come to mind.  Lots of people have posted lots of pure 
Python solutions.  I will simply comment that using any of them will make 
you fundamentally happier as time goes on than trying to shoehorn a regex 
in.

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.

[toc] | [prev] | [next] | [standalone]


#97311

FromDenis McMahon <denismfmcmahon@gmail.com>
Date2015-10-01 21:41 +0000
Message-ID<muk99h$ogn$2@dont-email.me>
In reply to#97286
On Thu, 01 Oct 2015 15:53:38 +0000, Rob Gaddi wrote:

> There's a quote for this.  'Some people, when confronted with a problem,
> think “I know, I'll use regular expressions.”  Now they have two
> problems.'

I actually used 2 regexes:

wordpatt = re.compile('[a-zA-Z]+')

numpatt = re.compile('[0-9]+')

replace all '(', ',' and ')' in the string with spaces

split the string on space

create an empty dict d

process each thing in the split list setting d[word]=[0,0] for each word 
element (wordpatt.match(thing)) (a list because I want to be able to 
modify it)

setting d[word][n] = int(num) for each num element (numpatt.match(thing)) 
with n depending on whether it was the first or second num following the 
previous word

then:

d = {x:tuple(d[x]) for x in d}

to convert the lists in the new dic to tuples

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]


#97310

FromDenis McMahon <denismfmcmahon@gmail.com>
Date2015-10-01 21:31 +0000
Message-ID<muk8od$ogn$1@dont-email.me>
In reply to#97253
On Thu, 01 Oct 2015 01:48:03 -0700, gal kauffman wrote:

> items = s.replace(' (', '(').replace(', ',',').split()
> 
> items_dict = dict()
> for item in items:
>     if '(' not in item:
>         item += '(0,0)'
>     if ',' not in item:
>         item = item.replace(')', ',0)')
> 
>     name, raw_data = item.split('(') data_tuple = tuple((int(v) for v in
> raw_data.replace(')','').split(',')))
> 
>     items_dict[name] = data_tuple

Please don't top post.

What happens if there's more whitespace than you allow for preceding a 
'(' or following a ',', or if there's whitespace following '('?

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web