Groups > comp.lang.python > #8210 > unrolled thread

Python Regular Expressions

Started by	Andy Barnes <andy.barnes@gmail.com>
First post	2011-06-22 07:00 -0700
Last post	2011-06-22 21:50 -0700
Articles	5 — 5 participants

Back to article view | Back to comp.lang.python

  Python Regular Expressions Andy  Barnes <andy.barnes@gmail.com> - 2011-06-22 07:00 -0700
    Re: Python Regular Expressions Andy Barnes <andy.barnes@gmail.com> - 2011-06-22 07:26 -0700
      Re: Python Regular Expressions Neil Cerutti <neilc@norwich.edu> - 2011-06-22 14:58 +0000
    Re: Python Regular Expressions Peter Otten <__peter__@web.de> - 2011-06-22 17:05 +0200
    Re: Python Regular Expressions Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2011-06-22 21:50 -0700

#8210 — Python Regular Expressions

From	Andy Barnes <andy.barnes@gmail.com>
Date	2011-06-22 07:00 -0700
Subject	Python Regular Expressions
Message-ID	<3b65fd1f-d377-49b8-b00a-f80462a1213d@35g2000prp.googlegroups.com>

Hi,

I am hoping someone here can help me with a problem I'd like to
resolve with Python. I have used it before for some other projects but
have never needed to use Regular Expressions before. It's quite
possible I am following completley the wrong tack for this task (so
any advice appreciated).

I have a source file in csv format that follows certain rules. I
basically want to parse the source file and spit out a second file
built from some rules and the content of the first file.

Source File Format:

Name, Type, Create, Study, Read, Teach, Prerequisite
# column headers

Distil Mana, Lore, n/a, 70, 38, 21
Theurgic Lore, Lore, n/a, 105, 70, 30, Distil Mana
Talismantic Lore, Lore, n/a, 150, 100, 50
Advanced Talismantic Lore, Lore, n/a, 100, 60, 30, Talismantic Lore,
Theurgic Lore

The input file I have has over 700 unique entries. I have tried to
cover the four main exceptions above. Before I detail them - this is
what I would like the above input file, to be output as (dot
diagramming language incase anyone recognises it):

Name, Type, Create, Study, Read, Teach, Prerequisite
# column headers

DistilMana [label="{ Distil Mana |{Type|7}|{70|38|21}}"];
TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];
DistilMana -> TheurgicLore;
TalismanticLore [label="{ Talismantic Lore |{Lore|n/a}|{150|100|
50}}"];
AdvanvedTalismanticLore [label="{ Advanced Talismantic Lore |{Lore|n/
a}|{100|60|30}}"];
TalismanticLore -> AdvanvedTalismanticLore;
TheurgicLore -> AdvanvedTalismanticLore;

It's quite a complicated find and replace operation that can be broken
down into some easy stages. The main thing the sample above showed was
that some of the entries won't list any prerequisits - these only need
the descriptor entry creating. Some of them have more than one
prerequisite. A line is needed for each prerequisite listed, linking
it to it's parent.

You can also see that the 'name' needs to have spaces removed and it's
repeated a few times in the process. I Hope it's easy to see what I am
trying to achieve from the above. I'd be very happy to accept
assistance in automating the conversion of my ever expanding csv file,
into the dot format described above.

Andy

[toc] | [next] | [standalone]

#8213

From	Andy Barnes <andy.barnes@gmail.com>
Date	2011-06-22 07:26 -0700
Message-ID	<77c4973c-7315-4b4d-8eae-3f5770dfb530@22g2000prx.googlegroups.com>
In reply to	#8210

to expand. I have parsed one of the lines manually to try and break
the process I'm trying to automate down.

source:
Theurgic Lore, Lore, n/a, 105, 70, 30, Distil Mana

output:
TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];
DistilMana -> TheurgicLore;

This is the steps I would take to do this conversion manually:

1) Take everything prior to the first comma and remove all the spaces,
insert it into a newline:

TheurgicLore

2) append the following string ' [label="{ '

TheurgicLore [label="{

3) append everything prior to the first comma (this time we don't need
to remove the spaces)

TheurgicLore [label="{ Theurgic Lore

4) append the following string ' |{'

TheurgicLore [label="{ Theurgic Lore |{

5) append everything between the 1st and 2nd comma of the source file
followed by a '|'

TheurgicLore [label="{ Theurgic Lore |{Lore|

6) append everything between the 2nd and 3rd comma of the source file
followed by a '}|{'

TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{

7) append everything between the 3rd and 4th comma of the source file
followed by a '|'

TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|

8) append everything between the 4th and 5th comma of the source file
followed by a '|'

TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|

9) append everything between the 5th and 6th comma of the source file
followed by a '}}"];'

TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];

Those 9 steps spit out my fist line of output file as above
"TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];" I
now have to parse the dependancies onto a newline.

# this next process needs to be repeated for each prerequisite, so if
there are two pre-requisites it would need to keep parsing for more
comma's.
1a) take everything between the 6th and 7th comma and put it at the
start of a new line (remove spaces)

DistilMana

2a) append '-> '

DistilMana ->

3a) append everything prior to the first comma, with spaces removed

DistilMana -> TheurgicLore

This should now be all the steps to spit out:

TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];
DistilMana -> TheurgicLore;

[toc] | [prev] | [next] | [standalone]

#8214

From	Neil Cerutti <neilc@norwich.edu>
Date	2011-06-22 14:58 +0000
Message-ID	<96ee9bFlrrU1@mid.individual.net>
In reply to	#8213

On 2011-06-22, Andy Barnes <andy.barnes@gmail.com> wrote:
> to expand. I have parsed one of the lines manually to try and break
> the process I'm trying to automate down.
>
> source:
> Theurgic Lore, Lore, n/a, 105, 70, 30, Distil Mana
>
> output:
> TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];
> DistilMana -> TheurgicLore;
>
> This is the steps I would take to do this conversion manually:

It seems to me that parsing the file into an intermediate model
and then using that model to serialize your output would be
easier to understand and more robust than modifying the csv
entries in place. It decouples deciphering the meaning of the
data from emitting the data, which is more robust and expansable.

The amount of ingenuity required is less, though. ;)

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#8215

From	Peter Otten <__peter__@web.de>
Date	2011-06-22 17:05 +0200
Message-ID	<mailman.281.1308755152.1164.python-list@python.org>
In reply to	#8210

Andy Barnes wrote:

> Hi,
> 
> I am hoping someone here can help me with a problem I'd like to
> resolve with Python. I have used it before for some other projects but
> have never needed to use Regular Expressions before. It's quite
> possible I am following completley the wrong tack for this task (so
> any advice appreciated).
> 
> I have a source file in csv format that follows certain rules. I
> basically want to parse the source file and spit out a second file
> built from some rules and the content of the first file.
> 
> Source File Format:
> 
> Name, Type, Create, Study, Read, Teach, Prerequisite
> # column headers
> 
> Distil Mana, Lore, n/a, 70, 38, 21
> Theurgic Lore, Lore, n/a, 105, 70, 30, Distil Mana
> Talismantic Lore, Lore, n/a, 150, 100, 50
> Advanced Talismantic Lore, Lore, n/a, 100, 60, 30, Talismantic Lore,
> Theurgic Lore
> 
> The input file I have has over 700 unique entries. I have tried to
> cover the four main exceptions above. Before I detail them - this is
> what I would like the above input file, to be output as (dot
> diagramming language incase anyone recognises it):
> 
> Name, Type, Create, Study, Read, Teach, Prerequisite
> # column headers
> 
> DistilMana [label="{ Distil Mana |{Type|7}|{70|38|21}}"];
> TheurgicLore [label="{ Theurgic Lore |{Lore|n/a}|{105|70|30}}"];
> DistilMana -> TheurgicLore;
> TalismanticLore [label="{ Talismantic Lore |{Lore|n/a}|{150|100|
> 50}}"];
> AdvanvedTalismanticLore [label="{ Advanced Talismantic Lore |{Lore|n/
> a}|{100|60|30}}"];
> TalismanticLore -> AdvanvedTalismanticLore;
> TheurgicLore -> AdvanvedTalismanticLore;
> 
> It's quite a complicated find and replace operation that can be broken
> down into some easy stages. The main thing the sample above showed was
> that some of the entries won't list any prerequisits - these only need
> the descriptor entry creating. Some of them have more than one
> prerequisite. A line is needed for each prerequisite listed, linking
> it to it's parent.
> 
> You can also see that the 'name' needs to have spaces removed and it's
> repeated a few times in the process. I Hope it's easy to see what I am
> trying to achieve from the above. I'd be very happy to accept
> assistance in automating the conversion of my ever expanding csv file,
> into the dot format described above.

Forget about regexes. If there's any complexity it's in writing the output 
rather than reading the input file. You can tackle that by putting your data 
into a dictionary and using a format string:

import sys

def camelized(s):
    return "".join(s.split())

template = """%(camel)s [label="{ %(name)s |{%(type)s|%(create)s}|
{%(study)s|%(read)s|%(teach)s}}"];"""

def process(instream, outstream):
    instream = (line for line in instream if not (line.isspace() or 
line.startswith("#")))
    rows = (map(str.strip, line.split(",")) for line in instream)
    headers = map(str.lower, next(rows))
    for row in rows:
        rowdict = dict(zip(headers, row))
        camel = rowdict["camel"] = camelized(rowdict["name"])
        print template % rowdict
        for for_lack_of_better_name in row[len(headers)-1:]:
            print "%s -> %s;" % (camelized(for_lack_of_better_name), camel)

if __name__ == "__main__":
    from StringIO import StringIO
    instream = StringIO("""\
Name, Type, Create, Study, Read, Teach, Prerequisite
# column headers

Distil Mana, Lore, n/a, 70, 38, 21
Theurgic Lore, Lore, n/a, 105, 70, 30, Distil Mana
Talismantic Lore, Lore, n/a, 150, 100, 50
Advanced Talismantic Lore, Lore, n/a, 100, 60, 30, Talismantic Lore, 
Theurgic Lore
""")
    process(instream, sys.stdout)

[toc] | [prev] | [next] | [standalone]

#8263

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2011-06-22 21:50 -0700
Message-ID	<mailman.312.1308804907.1164.python-list@python.org>
In reply to	#8210

On Wed, 22 Jun 2011 07:00:42 -0700 (PDT), Andy  Barnes
<andy.barnes@gmail.com> declaimed the following in
gmane.comp.python.general:

> 
> I have a source file in csv format that follows certain rules. I

	So why not use the CSV module to read&split the fields...
> It's quite a complicated find and replace operation that can be broken
> down into some easy stages. The main thing the sample above showed was
> that some of the entries won't list any prerequisits - these only need
> the descriptor entry creating. Some of them have more than one
> prerequisite. A line is needed for each prerequisite listed, linking
> it to it's parent.
>
	That's output formatting, I won't bother trying to create an
algorithm for that...
 
> You can also see that the 'name' needs to have spaces removed and it's
> repeated a few times in the process. I Hope it's easy to see what I am

	strippedName = "".join(originalName.split())
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [standalone]

csiph-web

Python Regular Expressions

Contents

#8210 — Python Regular Expressions

#8213

#8214

#8215

#8263