Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #73789 > unrolled thread
| Started by | flebber <flebber.crue@gmail.com> |
|---|---|
| First post | 2014-07-01 07:04 -0700 |
| Last post | 2014-07-04 15:24 +0200 |
| Articles | 12 — 5 participants |
Back to article view | Back to comp.lang.python
fixing an horrific formatted csv file. flebber <flebber.crue@gmail.com> - 2014-07-01 07:04 -0700
Re: fixing an horrific formatted csv file. MRAB <python@mrabarnett.plus.com> - 2014-07-01 15:32 +0100
Re: fixing an horrific formatted csv file. "F.R." <anthra.norell@bluewin.ch> - 2014-07-01 22:49 +0200
Re: fixing an horrific formatted csv file. flebber <flebber.crue@gmail.com> - 2014-07-01 14:41 -0700
Re: fixing an horrific formatted csv file. Chris Angelico <rosuav@gmail.com> - 2014-07-02 11:20 +1000
Re: fixing an horrific formatted csv file. flebber <flebber.crue@gmail.com> - 2014-07-02 02:13 -0700
Re: fixing an horrific formatted csv file. "F.R." <anthra.norell@bluewin.ch> - 2014-07-02 17:51 +0200
Re: fixing an horrific formatted csv file. flebber <flebber.crue@gmail.com> - 2014-07-03 21:12 -0700
Re: fixing an horrific formatted csv file. Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-07-04 18:19 +1200
Re: fixing an horrific formatted csv file. flebber <flebber.crue@gmail.com> - 2014-07-04 03:48 -0700
Re: fixing an horrific formatted csv file. flebber <flebber.crue@gmail.com> - 2014-07-04 03:28 -0700
Re: fixing an horrific formatted csv file. "F.R." <anthra.norell@bluewin.ch> - 2014-07-04 15:24 +0200
| From | flebber <flebber.crue@gmail.com> |
|---|---|
| Date | 2014-07-01 07:04 -0700 |
| Subject | fixing an horrific formatted csv file. |
| Message-ID | <47e2e29d-b5c3-4aa6-abf9-3b1e46eb0dec@googlegroups.com> |
What I am trying to do is to reformat a csv file into something more usable.
currently the file has no headers, multiple lines with varying columns that are not related.
This is a sample
Meeting,05/07/14,RHIL,Rosehill Gardens,Weights,TAB,+3m Entire Circuit, ,
Race,1,CIVIC STAKES,CIVIC,CIVIC,1350,~ ,3U ,~ ,QLT ,54,0,0,5/07/2014,, , , , ,No class restriction, Quality, For Three-Years-Old and Upwards, No sex restriction, (Listed),Of $100000. First $60000, second $20000, third $10000, fourth $5000, fifth $2000, sixth $1000, seventh $1000, eighth $1000
Horse,1,Bennetta,0,"Grahame Begg",Randwick,,0,0,16-3-1-3 $390450.00,,0,0,0,,98.00,M,
Horse,2,Breakfast in Bed,0,"David Vandyke",Warwick Farm,,0,0,20-6-1-5 $201250.00,,0,0,0,,81.00,M,
Horse,3,Capital Commander,0,"Gerald Ryan",Rosehill,,0,0,43-9-9-3 $438625.00,,0,0,0,,85.00,M,
Horse,4,Coup Ay Tee (NZ),0,"Chris Waller",Rosehill,,0,0,35-9-6-5 $519811.00,,0,0,0,,101.00,G,
Horse,5,Generalife,0,"John O'Shea",Warwick Farm,,0,0,19-6-1-3 $235045.00,,0,0,0,,87.00,G,
Horse,6,He's Your Man (FR),0,"Chris Waller",Rosehill,,0,0,13-2-3-1 $108110.00,,0,0,0,,93.00,G,
Horse,7,Hidden Kisses,0,"Chris Waller",Rosehill,,0,0,40-8-8-5 $565750.00,,0,0,0,,96.00,M,
Horse,8,Oakfield Commands,0,"Gerald Ryan",Rosehill,,0,0,22-7-4-6 $269530.00,,0,0,0,,94.00,G,
Horse,9,Taxmeifyoucan,0,"Gregory Hickman",Warwick Farm,,0,0,18-2-4-4 $539730.00,,0,0,0,,91.00,G,
Horse,10,The Peak,0,"Bart & James Cummings",Randwick,,0,0,15-6-1-0 $426732.00,,0,0,0,,95.00,G,
Horse,11,Tougher Than Ever (NZ),0,"Chris Waller",Rosehill,,0,0,17-3-2-3 $321613.00,,0,0,0,,97.00,H,
Horse,12,TROMSO,0,"Chris Waller",Rosehill,,0,0,47-8-11-2 $622300.00,,0,0,0,,103.00,G,
Race,2,FLYING WELTER - BENCHMARK 95 HCP,BM95,BM95,1100,BM95 ,3U ,~ ,HCP ,54,0,0,5/07/2014,, , , , ,BenchMark 95, Handicap, For Three-Years-Old and Upwards, No sex restriction,Of $85000. First $48750, second $16750, third $8350, fourth $4150, fifth $2000, sixth $1000, seventh $1000, eighth $1000, ninth $1000, tenth $1000
Horse,1,Big Bonanza,0,"Don Robb",Wyong,,0,57.5,31-9-4-3 $366860.00,,0,0,0,,92.00,G,
Horse,2,Casual Choice,0,"Joseph Pride",Warwick Farm,,0,54,8-2-3-0 $105930.00,,0,0,0,
So what I am trying to so is end up with an output like this.
Meeting, Date, Race, Number, Name, Trainer, Location
Rosehill, 05/07/14, 1, 1,Bennetta,"Grahame Begg",Randwick,
Rosehill, 05/07/14, 1, 2,Breakfast in Bed,"David Vandyke",Warwick Farm,
So as a start i thought i would try inserting the Meeting and Race number however I am just not getting it right.
import csv
outfile = open("/home/sayth/Scripts/cleancsv.csv", "w")
with open('/home/sayth/Scripts/test.csv') as f:
f_csv = csv.reader(f)
headers = next(f_csv)
for row in f_csv:
meeting = row[3] in row[0] == 'Meeting'
new = row.insert(0, meeting)
while row[1] in row[0] == 'Race' < 9: # pref less than next found row[0]
# grab row[1] as id number
id = row[1]
# from row[0] and insert it in first position
new_lines = new.insert(1, id)
outfile.write(new_lines)
outfile.close()
How should I go about this?
Thanks
Sayth
[toc] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2014-07-01 15:32 +0100 |
| Message-ID | <mailman.11369.1404225146.18130.python-list@python.org> |
| In reply to | #73789 |
On 2014-07-01 15:04, flebber wrote:
> What I am trying to do is to reformat a csv file into something more usable.
> currently the file has no headers, multiple lines with varying columns that are not related.
>
> This is a sample
>
> Meeting,05/07/14,RHIL,Rosehill Gardens,Weights,TAB,+3m Entire Circuit, ,
> Race,1,CIVIC STAKES,CIVIC,CIVIC,1350,~ ,3U ,~ ,QLT ,54,0,0,5/07/2014,, , , , ,No class restriction, Quality, For Three-Years-Old and Upwards, No sex restriction, (Listed),Of $100000. First $60000, second $20000, third $10000, fourth $5000, fifth $2000, sixth $1000, seventh $1000, eighth $1000
> Horse,1,Bennetta,0,"Grahame Begg",Randwick,,0,0,16-3-1-3 $390450.00,,0,0,0,,98.00,M,
> Horse,2,Breakfast in Bed,0,"David Vandyke",Warwick Farm,,0,0,20-6-1-5 $201250.00,,0,0,0,,81.00,M,
> Horse,3,Capital Commander,0,"Gerald Ryan",Rosehill,,0,0,43-9-9-3 $438625.00,,0,0,0,,85.00,M,
> Horse,4,Coup Ay Tee (NZ),0,"Chris Waller",Rosehill,,0,0,35-9-6-5 $519811.00,,0,0,0,,101.00,G,
> Horse,5,Generalife,0,"John O'Shea",Warwick Farm,,0,0,19-6-1-3 $235045.00,,0,0,0,,87.00,G,
> Horse,6,He's Your Man (FR),0,"Chris Waller",Rosehill,,0,0,13-2-3-1 $108110.00,,0,0,0,,93.00,G,
> Horse,7,Hidden Kisses,0,"Chris Waller",Rosehill,,0,0,40-8-8-5 $565750.00,,0,0,0,,96.00,M,
> Horse,8,Oakfield Commands,0,"Gerald Ryan",Rosehill,,0,0,22-7-4-6 $269530.00,,0,0,0,,94.00,G,
> Horse,9,Taxmeifyoucan,0,"Gregory Hickman",Warwick Farm,,0,0,18-2-4-4 $539730.00,,0,0,0,,91.00,G,
> Horse,10,The Peak,0,"Bart & James Cummings",Randwick,,0,0,15-6-1-0 $426732.00,,0,0,0,,95.00,G,
> Horse,11,Tougher Than Ever (NZ),0,"Chris Waller",Rosehill,,0,0,17-3-2-3 $321613.00,,0,0,0,,97.00,H,
> Horse,12,TROMSO,0,"Chris Waller",Rosehill,,0,0,47-8-11-2 $622300.00,,0,0,0,,103.00,G,
> Race,2,FLYING WELTER - BENCHMARK 95 HCP,BM95,BM95,1100,BM95 ,3U ,~ ,HCP ,54,0,0,5/07/2014,, , , , ,BenchMark 95, Handicap, For Three-Years-Old and Upwards, No sex restriction,Of $85000. First $48750, second $16750, third $8350, fourth $4150, fifth $2000, sixth $1000, seventh $1000, eighth $1000, ninth $1000, tenth $1000
> Horse,1,Big Bonanza,0,"Don Robb",Wyong,,0,57.5,31-9-4-3 $366860.00,,0,0,0,,92.00,G,
> Horse,2,Casual Choice,0,"Joseph Pride",Warwick Farm,,0,54,8-2-3-0 $105930.00,,0,0,0,
>
> So what I am trying to so is end up with an output like this.
>
> Meeting, Date, Race, Number, Name, Trainer, Location
> Rosehill, 05/07/14, 1, 1,Bennetta,"Grahame Begg",Randwick,
> Rosehill, 05/07/14, 1, 2,Breakfast in Bed,"David Vandyke",Warwick Farm,
>
> So as a start i thought i would try inserting the Meeting and Race number however I am just not getting it right.
>
> import csv
>
> outfile = open("/home/sayth/Scripts/cleancsv.csv", "w")
> with open('/home/sayth/Scripts/test.csv') as f:
> f_csv = csv.reader(f)
> headers = next(f_csv)
> for row in f_csv:
> meeting = row[3] in row[0] == 'Meeting'
> new = row.insert(0, meeting)
> while row[1] in row[0] == 'Race' < 9: # pref less than next found row[0]
>
> # grab row[1] as id number
> id = row[1]
> # from row[0] and insert it in first position
> new_lines = new.insert(1, id)
> outfile.write(new_lines)
> outfile.close()
>
> How should I go about this?
>
There's no point in reading the first row as the headers because it
clearly doesn't contain just the headings.
First write a row for the header.
Then, for each row:
If the first field is 'Meeting', then remember the meeting, etc.
If the first field is 'Race', then remember the race, etc.
If the first field is 'Horse', then write the row with the additional
fields for race, etc.
And so on.
BTW, the indentation for the 'outfile.close()' line is wrong. It would,
of course, be better to use the 'with' statement for that file too.
[toc] | [prev] | [next] | [standalone]
| From | "F.R." <anthra.norell@bluewin.ch> |
|---|---|
| Date | 2014-07-01 22:49 +0200 |
| Message-ID | <mailman.11385.1404247829.18130.python-list@python.org> |
| In reply to | #73789 |
On 07/01/2014 04:04 PM, flebber wrote:
> What I am trying to do is to reformat a csv file into something more usable.
> currently the file has no headers, multiple lines with varying columns that are not related.
>
> This is a sample
>
> Meeting,05/07/14,RHIL,Rosehill Gardens,Weights,TAB,+3m Entire Circuit, ,
> Race,1,CIVIC STAKES,CIVIC,CIVIC,1350,~ ,3U ,~ ,QLT ,54,0,0,5/07/2014,, , , , ,No class restriction, Quality, For Three-Years-Old and Upwards, No sex restriction, (Listed),Of $100000. First $60000, second $20000, third $10000, fourth $5000, fifth $2000, sixth $1000, seventh $1000, eighth $1000
> Horse,1,Bennetta,0,"Grahame Begg",Randwick,,0,0,16-3-1-3 $390450.00,,0,0,0,,98.00,M,
> Horse,2,Breakfast in Bed,0,"David Vandyke",Warwick Farm,,0,0,20-6-1-5 $201250.00,,0,0,0,,81.00,M,
> Horse,3,Capital Commander,0,"Gerald Ryan",Rosehill,,0,0,43-9-9-3 $438625.00,,0,0,0,,85.00,M,
> Horse,4,Coup Ay Tee (NZ),0,"Chris Waller",Rosehill,,0,0,35-9-6-5 $519811.00,,0,0,0,,101.00,G,
> Horse,5,Generalife,0,"John O'Shea",Warwick Farm,,0,0,19-6-1-3 $235045.00,,0,0,0,,87.00,G,
> Horse,6,He's Your Man (FR),0,"Chris Waller",Rosehill,,0,0,13-2-3-1 $108110.00,,0,0,0,,93.00,G,
> Horse,7,Hidden Kisses,0,"Chris Waller",Rosehill,,0,0,40-8-8-5 $565750.00,,0,0,0,,96.00,M,
> Horse,8,Oakfield Commands,0,"Gerald Ryan",Rosehill,,0,0,22-7-4-6 $269530.00,,0,0,0,,94.00,G,
> Horse,9,Taxmeifyoucan,0,"Gregory Hickman",Warwick Farm,,0,0,18-2-4-4 $539730.00,,0,0,0,,91.00,G,
> Horse,10,The Peak,0,"Bart & James Cummings",Randwick,,0,0,15-6-1-0 $426732.00,,0,0,0,,95.00,G,
> Horse,11,Tougher Than Ever (NZ),0,"Chris Waller",Rosehill,,0,0,17-3-2-3 $321613.00,,0,0,0,,97.00,H,
> Horse,12,TROMSO,0,"Chris Waller",Rosehill,,0,0,47-8-11-2 $622300.00,,0,0,0,,103.00,G,
> Race,2,FLYING WELTER - BENCHMARK 95 HCP,BM95,BM95,1100,BM95 ,3U ,~ ,HCP ,54,0,0,5/07/2014,, , , , ,BenchMark 95, Handicap, For Three-Years-Old and Upwards, No sex restriction,Of $85000. First $48750, second $16750, third $8350, fourth $4150, fifth $2000, sixth $1000, seventh $1000, eighth $1000, ninth $1000, tenth $1000
> Horse,1,Big Bonanza,0,"Don Robb",Wyong,,0,57.5,31-9-4-3 $366860.00,,0,0,0,,92.00,G,
> Horse,2,Casual Choice,0,"Joseph Pride",Warwick Farm,,0,54,8-2-3-0 $105930.00,,0,0,0,
>
> So what I am trying to so is end up with an output like this.
>
> Meeting, Date, Race, Number, Name, Trainer, Location
> Rosehill, 05/07/14, 1, 1,Bennetta,"Grahame Begg",Randwick,
> Rosehill, 05/07/14, 1, 2,Breakfast in Bed,"David Vandyke",Warwick Farm,
>
> So as a start i thought i would try inserting the Meeting and Race number however I am just not getting it right.
>
> import csv
>
> outfile = open("/home/sayth/Scripts/cleancsv.csv", "w")
> with open('/home/sayth/Scripts/test.csv') as f:
> f_csv = csv.reader(f)
> headers = next(f_csv)
> for row in f_csv:
> meeting = row[3] in row[0] == 'Meeting'
> new = row.insert(0, meeting)
> while row[1] in row[0] == 'Race' < 9: # pref less than next found row[0]
>
> # grab row[1] as id number
> id = row[1]
> # from row[0] and insert it in first position
> new_lines = new.insert(1, id)
> outfile.write(new_lines)
> outfile.close()
>
> How should I go about this?
>
> Thanks
>
> Sayth
Reformatting is what I do most and over time I have acquired some
practice. Complete solutions are not
often proposed, possibly sneered on for their officiousness. In that
case I apologize. I couldn't resist. It is such a nice example. Having
solved it, I figure why not share it . . .
Frederic
------------------------------------------------------------------------------------------------------------
def race_table (csv_text):
input_table = [[item.strip(' "') for item in record.split (',')]
for record in csv_text.splitlines ()]
# At this point look at input_table to find the record indices
output_table = []
for record in input_table:
if record [0] == 'Meeting':
meeting = record [3]
elif record [0] == 'Race':
date = record [13]
race = record [1]
elif record [0] == 'Horse':
number = record [1]
name = record [2]
trainer = record [4]
location = record [5]
output_table.append ((meeting, date, race, number, name,
trainer, location))
return output_table
>>> for record in race_table (your_csv_text): print record
('Rosehill Gardens', '5/07/2014', '1', '1', 'Bennetta', 'Grahame Begg',
'Randwick')
('Rosehill Gardens', '5/07/2014', '1', '2', 'Breakfast in Bed', 'David
Vandyke', 'Warwick Farm')
('Rosehill Gardens', '5/07/2014', '1', '3', 'Capital Commander', 'Gerald
Ryan', 'Rosehill')
('Rosehill Gardens', '5/07/2014', '1', '4', 'Coup Ay Tee (NZ)', 'Chris
Waller', 'Rosehill')
('Rosehill Gardens', '5/07/2014', '1', '5', 'Generalife', "John O'Shea",
'Warwick Farm')
('Rosehill Gardens', '5/07/2014', '1', '6', "He's Your Man (FR)", 'Chris
Waller', 'Rosehill')
('Rosehill Gardens', '5/07/2014', '1', '7', 'Hidden Kisses', 'Chris
Waller', 'Rosehill')
('Rosehill Gardens', '5/07/2014', '1', '8', 'Oakfield Commands', 'Gerald
Ryan', 'Rosehill')
('Rosehill Gardens', '5/07/2014', '1', '9', 'Taxmeifyoucan', 'Gregory
Hickman', 'Warwick Farm')
('Rosehill Gardens', '5/07/2014', '1', '10', 'The Peak', 'Bart & James
Cummings', 'Randwick')
('Rosehill Gardens', '5/07/2014', '1', '11', 'Tougher Than Ever (NZ)',
'Chris Waller', 'Rosehill')
('Rosehill Gardens', '5/07/2014', '1', '12', 'TROMSO', 'Chris Waller',
'Rosehill')
('Rosehill Gardens', '5/07/2014', '2', '1', 'Big Bonanza', 'Don Robb',
'Wyong')
('Rosehill Gardens', '5/07/2014', '2', '2', 'Casual Choice', 'Joseph
Pride', 'Warwick Farm')
>>> TM = TX.Table_Maker (headings =
('Meeting','Date','Race','Number','Name','Trainer','Location'))
>>> TM (race_table (your_csv_text)).write ()
Meeting | Date | Race | Number | Name
| Trainer | Location |
Rosehill Gardens | 5/07/2014 | 1 | 1 | Bennetta
| Grahame Begg | Randwick |
Rosehill Gardens | 5/07/2014 | 1 | 2 | Breakfast in Bed
| David Vandyke | Warwick Farm |
Rosehill Gardens | 5/07/2014 | 1 | 3 | Capital Commander
| Gerald Ryan | Rosehill |
Rosehill Gardens | 5/07/2014 | 1 | 4 | Coup Ay Tee (NZ)
| Chris Waller | Rosehill |
Rosehill Gardens | 5/07/2014 | 1 | 5 | Generalife
| John O'Shea | Warwick Farm |
Rosehill Gardens | 5/07/2014 | 1 | 6 | He's Your Man (FR)
| Chris Waller | Rosehill |
Rosehill Gardens | 5/07/2014 | 1 | 7 | Hidden Kisses
| Chris Waller | Rosehill |
Rosehill Gardens | 5/07/2014 | 1 | 8 | Oakfield Commands
| Gerald Ryan | Rosehill |
Rosehill Gardens | 5/07/2014 | 1 | 9 | Taxmeifyoucan
| Gregory Hickman | Warwick Farm |
Rosehill Gardens | 5/07/2014 | 1 | 10 | The Peak
| Bart & James Cummings | Randwick |
Rosehill Gardens | 5/07/2014 | 1 | 11 | Tougher Than Ever (NZ)
| Chris Waller | Rosehill |
Rosehill Gardens | 5/07/2014 | 1 | 12 | TROMSO
| Chris Waller | Rosehill |
Rosehill Gardens | 5/07/2014 | 2 | 1 | Big Bonanza
| Don Robb | Wyong |
Rosehill Gardens | 5/07/2014 | 2 | 2 | Casual Choice
| Joseph Pride | Warwick Farm |
[toc] | [prev] | [next] | [standalone]
| From | flebber <flebber.crue@gmail.com> |
|---|---|
| Date | 2014-07-01 14:41 -0700 |
| Message-ID | <0d3871c6-81d4-4168-9408-ad85299b0955@googlegroups.com> |
| In reply to | #73809 |
That's a really cool solution. I understand why providing full solutions is frowned upon, because it doesn't assist in learning. Which is true, it's incredibly helpful in this case. The python cookbook is really good and what I was using as a start for dealing with csv. But it doesn't even go anywhere near this. Lots of examples with simple inputs. Anyway Thanks again Sayth
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-07-02 11:20 +1000 |
| Message-ID | <mailman.11392.1404264061.18130.python-list@python.org> |
| In reply to | #73811 |
On Wed, Jul 2, 2014 at 7:41 AM, flebber <flebber.crue@gmail.com> wrote: > I understand why providing full solutions is frowned upon, because it doesn't assist in learning. Which is true, it's incredibly helpful in this case. In this case, my main reason for not providing a full solution is that the work tends to be iterative. When I have a huge and messy file, what I usually do is grab the first half-dozen lines and work out how I'd go about fixing them manually, then write a script that does that. Then run the script on the whole file, and see where it either chokes or produces wrong data. Pick up the first few lines of wrong data, figure out how to tweak the program to handle those. Rinse and repeat. Often, what that results in is a file that gets progressively tidier. When the scope of the mess is infinite (like with human-entered data - believe you me, you haven't seen messy until you've seen what a committee can do to a simple job), this means you stop working on the script at exactly the point where it stops being worth the effort - which is something that only you can decide. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | flebber <flebber.crue@gmail.com> |
|---|---|
| Date | 2014-07-02 02:13 -0700 |
| Message-ID | <a84826ea-4018-40bc-88c1-812be5417e6b@googlegroups.com> |
| In reply to | #73819 |
> >>> TM = TX.Table_Maker (headings =
> ('Meeting','Date','Race','Number','Name','Trainer','Location'))
> >>> TM (race_table (your_csv_text)).write ()
Where do I find TX? Found this mention in the list, was it available in pip by any name?
https://mail.python.org/pipermail/python-list/2014-February/667464.html
Sayth
[toc] | [prev] | [next] | [standalone]
| From | "F.R." <anthra.norell@bluewin.ch> |
|---|---|
| Date | 2014-07-02 17:51 +0200 |
| Message-ID | <mailman.11411.1404316334.18130.python-list@python.org> |
| In reply to | #73830 |
On 07/02/2014 11:13 AM, flebber wrote:
>>>>> TM = TX.Table_Maker (headings =
>> ('Meeting','Date','Race','Number','Name','Trainer','Location'))
>>>>> TM (race_table (your_csv_text)).write ()
> Where do I find TX? Found this mention in the list, was it available in pip by any name?
> https://mail.python.org/pipermail/python-list/2014-February/667464.html
>
> Sayth
I'd have to make it available. I proposed it some time ago and received
a couple of suggestions in return. It is a modular transformation
framework written entirely in python (2.7). It consists essentially of a
base class "Transformer" that handles input and output in such a way
that Transformer objects can be chained. It saved me from drowning an a
horrible and growing tangle of hacks. Finding something usable I had
previously done took time. Understanding how it worked took more time
and adapting it took still more time, so that writing yet another hack
from scratch was faster.
A number of hacks I could quickly wrap into a Transformer object
and so could start building a library of standard Transformers. The
Table_Maker is one of them. The table making code is quite bad. It
suffers from feature overload. I would clean it up for distribution.
I'd be happy to distribute the base class and a few standard
Translators, such as I use every day. (File Reader, File Writer, DB Run
Command, DB Write, Table Maker, PDF To Text, Text To Lines, Lines To
Text, Sort, Sort And Unique, etc.) Writing one's own Transformers is a
breeze. Testing too, because a Transformer keeps its input and output
and, in line with the system's design philosophy, does only its own
single thing.
A Chain is a list of Transformers that run in sequence. It is
itself derived from Transformer and is a functional equivalent. So
Chains nest. Fixing a Chain that nothing comes out of is a
straightforward matter too. It will still have run up to the failing
element. Chain.show () reveals the culprit as the first one to have no
output.
I am not up to date on distributing and would depend on qualified
help on that.
Frederic
--------------------------------------------------------------------------------
A brief overview
The TX solution to your race table would be (TX is the name of the module):
class Race_Table (TX.Transformer):
'''
In: CSV text
Out: Tabular data (2-dimensional list)
'''
name = 'Race_Table'
@TX.setup # Checks timestamps to prevent needless reruns in
the absence of new input
def transform (self):
for line in self.Input.data:
# See my post
self.Output.take (output_table)
Example file to file:
>>> Race_Schedule_F2F = TX.Chain (TX.File_Reader (), Race_Table (),
TX.List_To_CSV (delimiter = ';'), TX.File_Writer (terminal = out_file_name)
>>> Race_Schedule_F2F (input_file_name) # Does it all!
Example web to database:
>>> Race_Schedule_WWW2DB = TX.Chain (TX.WWW_Reader (),
Race_Schedule_HTML_Reader (), Race_Table (), TX.DB_Writer (table_name =
'horses'))
>>> Race_Schedule_WWW2DB (url) # Does is all! You'd have to write
the Race_Schedule_HTML_Reader
Verify your table:
>>> Table_Viewer = TX.Chain (TX.Table_Maker (), TX.Table_Writer ())
>>> Race_Schedule_WWW2DB.show_tree () # See which one should display
Chain
Chain[0] - WWW Reader
Chain[1] - Race_Schedule_HTML_Reader
Chain[2] - Race_Table
Chain[3] - DB Writer
>>> print Table_Viewer (Race_Schedule_WWW2DB[2]()) # All
Transformers keep their data
(Display of table)
Verify database:
>>> print Table_Viewer (TX.DB_Reader (table_name = 'horses')())
(Display of database table)
[toc] | [prev] | [next] | [standalone]
| From | flebber <flebber.crue@gmail.com> |
|---|---|
| Date | 2014-07-03 21:12 -0700 |
| Message-ID | <11ecf009-6f81-4fa5-bee9-b52b9407f0af@googlegroups.com> |
| In reply to | #73840 |
I have taken the code and gone a little further, but I need to be able to protect myself against commas and single quotes in names.
How is it the best to do this?
so in my file I had on line 44 this trainer name.
"Michael, Wayne & John Hawkes"
and in line 95 this horse name.
Inz'n'out
this throws of my capturing correct item 9. How do I protect against this?
Here is current code.
import re
from sys import argv
SCRIPT, FILENAME = argv
def out_file_name(file_name):
"""take an input file and keep the name with appended _clean"""
file_parts = file_name.split(".",)
output_file = file_parts[0] + '_clean.' + file_parts[1]
return output_file
def race_table(text_file):
"""utility to reorganise poorly made csv entry"""
input_table = [[item.strip(' "') for item in record.split(',')]
for record in text_file.splitlines()]
# At this point look at input_table to find the record indices
output_table = []
for record in input_table:
if record[0] == 'Meeting':
meeting = record[3]
elif record[0] == 'Race':
date = record[13]
race = record[1]
elif record[0] == 'Horse':
number = record[1]
name = record[2]
results = record[9]
res_split = re.split('[- ]', results)
starts = res_split[0]
wins = res_split[1]
seconds = res_split[2]
thirds = res_split[3]
prizemoney = res_split[4]
trainer = record[4]
location = record[5]
print(name, wins, seconds)
output_table.append((meeting, date, race, number, name,
starts, wins, seconds, thirds, prizemoney,
trainer, location))
return output_table
MY_FILE = out_file_name(FILENAME)
# with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
# for line in race_table(f_in.readline()):
# new_row = line
with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
CONTENT = f_in.read()
# print(content)
FILE_CONTENTS = race_table(CONTENT)
# print new_name
f_out.write(str(FILE_CONTENTS))
if __name__ == '__main__':
pass
[toc] | [prev] | [next] | [standalone]
| From | Gregory Ewing <greg.ewing@canterbury.ac.nz> |
|---|---|
| Date | 2014-07-04 18:19 +1200 |
| Message-ID | <c1mvavFhpj7U1@mid.individual.net> |
| In reply to | #73936 |
flebber wrote: > so in my file I had on line 44 this trainer name. > > "Michael, Wayne & John Hawkes" > > and in line 95 this horse name. Inz'n'out > > this throws of my capturing correct item 9. How do I protect against this? Use python's csv module to read the file. Don't try to do it yourself; the rules for handling embedded commas and quotes in csv are quite complicated. As long as the file is a well-formed csv file, the csv module should parse fields like that correctly. -- Greg
[toc] | [prev] | [next] | [standalone]
| From | flebber <flebber.crue@gmail.com> |
|---|---|
| Date | 2014-07-04 03:48 -0700 |
| Message-ID | <bee0ef3a-4ec3-4ca6-aa82-dffccf09dc67@googlegroups.com> |
| In reply to | #73937 |
On Friday, 4 July 2014 16:19:09 UTC+10, Gregory Ewing wrote:
> flebber wrote:
>
> > so in my file I had on line 44 this trainer name.
>
> >
>
> > "Michael, Wayne & John Hawkes"
>
> >
>
> > and in line 95 this horse name. Inz'n'out
>
> >
>
> > this throws of my capturing correct item 9. How do I protect against this?
>
>
>
> Use python's csv module to read the file. Don't try to
>
> do it yourself; the rules for handling embedded commas
>
> and quotes in csv are quite complicated. As long as
>
> the file is a well-formed csv file, the csv module
>
> should parse fields like that correctly.
>
>
>
> --
>
> Greg
True Greg worked easier
def race_table(text_file):
"""utility to reorganise poorly made csv entry"""
# input_table = [[item.strip(' "') for item in record.split(',')]
# for record in text_file.splitlines()]
# At this point look at input_table to find the record indices
# identity = string.maketrans("", "")
# print(input_table)
# input_table = [s.translate(identity, ",'") for s
# in input_table]
output_table = []
for record in text_file:
if record[0] == 'Meeting':
meeting = record[3]
elif record[0] == 'Race':
date = record[13]
race = record[1]
elif record[0] == 'Horse':
number = record[1]
name = record[2]
results = record[9]
res_split = re.split('[- ]', results)
starts = res_split[0]
wins = res_split[1]
seconds = res_split[2]
thirds = res_split[3]
try:
prizemoney = res_split[4]
finally:
prizemoney = 0
trainer = record[4]
location = record[5]
print(name, wins, seconds)
output_table.append((meeting, date, race, number, name,
starts, wins, seconds, thirds, prizemoney,
trainer, location))
return output_table
MY_FILE = out_file_name(FILENAME)
# with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
# for line in race_table(f_in.readline()):
# new_row = line
with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
CONTENT = csv.reader(f_in)
# print(content)
FILE_CONTENTS = race_table(CONTENT)
# print new_name
f_out.write(str(FILE_CONTENTS))
if __name__ == '__main__':
pass
Sayth
[toc] | [prev] | [next] | [standalone]
| From | flebber <flebber.crue@gmail.com> |
|---|---|
| Date | 2014-07-04 03:28 -0700 |
| Message-ID | <c783c959-081d-42d1-8bd0-96342bd2557c@googlegroups.com> |
| In reply to | #73936 |
On Friday, 4 July 2014 14:12:15 UTC+10, flebber wrote:
> I have taken the code and gone a little further, but I need to be able to protect myself against commas and single quotes in names.
>
>
>
> How is it the best to do this?
>
>
>
> so in my file I had on line 44 this trainer name.
>
>
>
> "Michael, Wayne & John Hawkes"
>
>
>
> and in line 95 this horse name.
>
> Inz'n'out
>
>
>
> this throws of my capturing correct item 9. How do I protect against this?
>
>
>
> Here is current code.
>
>
>
> import re
>
> from sys import argv
>
> SCRIPT, FILENAME = argv
>
>
>
>
>
> def out_file_name(file_name):
>
> """take an input file and keep the name with appended _clean"""
>
> file_parts = file_name.split(".",)
>
> output_file = file_parts[0] + '_clean.' + file_parts[1]
>
> return output_file
>
>
>
>
>
> def race_table(text_file):
>
> """utility to reorganise poorly made csv entry"""
>
> input_table = [[item.strip(' "') for item in record.split(',')]
>
> for record in text_file.splitlines()]
>
> # At this point look at input_table to find the record indices
>
> output_table = []
>
> for record in input_table:
>
> if record[0] == 'Meeting':
>
> meeting = record[3]
>
> elif record[0] == 'Race':
>
> date = record[13]
>
> race = record[1]
>
> elif record[0] == 'Horse':
>
> number = record[1]
>
> name = record[2]
>
> results = record[9]
>
> res_split = re.split('[- ]', results)
>
> starts = res_split[0]
>
> wins = res_split[1]
>
> seconds = res_split[2]
>
> thirds = res_split[3]
>
> prizemoney = res_split[4]
>
> trainer = record[4]
>
> location = record[5]
>
> print(name, wins, seconds)
>
> output_table.append((meeting, date, race, number, name,
>
> starts, wins, seconds, thirds, prizemoney,
>
> trainer, location))
>
> return output_table
>
>
>
> MY_FILE = out_file_name(FILENAME)
>
>
>
> # with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
>
> # for line in race_table(f_in.readline()):
>
> # new_row = line
>
> with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
>
> CONTENT = f_in.read()
>
> # print(content)
>
> FILE_CONTENTS = race_table(CONTENT)
>
> # print new_name
>
> f_out.write(str(FILE_CONTENTS))
>
>
>
>
>
> if __name__ == '__main__':
>
> pass
So I found this on stack overflow
In [2]: import string
In [3]: identity = string.maketrans("", "")
In [4]: x = ['+5556', '-1539', '-99', '+1500']
In [5]: x = [s.translate(identity, "+-") for s in x]
In [6]: x
Out[6]: ['5556', '1539', '99', '1500']
but it fails in my file, due to I believe mine being a list of list. Is there an easy way to iterate the sublists without flattening?
Current code.
input_table = [[item.strip(' "') for item in record.split(',')]
for record in text_file.splitlines()]
# At this point look at input_table to find the record indices
identity = string.maketrans("", "")
print(input_table)
input_table = [s.translate(identity, ",'") for s
in input_table]
Sayth
[toc] | [prev] | [next] | [standalone]
| From | "F.R." <anthra.norell@bluewin.ch> |
|---|---|
| Date | 2014-07-04 15:24 +0200 |
| Message-ID | <mailman.11491.1404480345.18130.python-list@python.org> |
| In reply to | #73947 |
On 07/04/2014 12:28 PM, flebber wrote:
> On Friday, 4 July 2014 14:12:15 UTC+10, flebber wrote:
>> I have taken the code and gone a little further, but I need to be able to protect myself against commas and single quotes in names.
>>
>>
>>
>> How is it the best to do this?
>>
>>
>>
>> so in my file I had on line 44 this trainer name.
>>
>>
>>
>> "Michael, Wayne & John Hawkes"
>>
>>
>>
>> and in line 95 this horse name.
>>
>> Inz'n'out
>>
>>
>>
>> this throws of my capturing correct item 9. How do I protect against this?
>>
>>
>>
>> Here is current code.
>>
>>
>>
>> import re
>>
>> from sys import argv
>>
>> SCRIPT, FILENAME = argv
>>
>>
>>
>>
>>
>> def out_file_name(file_name):
>>
>> """take an input file and keep the name with appended _clean"""
>>
>> file_parts = file_name.split(".",)
>>
>> output_file = file_parts[0] + '_clean.' + file_parts[1]
>>
>> return output_file
>>
>>
>>
>>
>>
>> def race_table(text_file):
>>
>> """utility to reorganise poorly made csv entry"""
>>
>> input_table = [[item.strip(' "') for item in record.split(',')]
>>
>> for record in text_file.splitlines()]
>>
>> # At this point look at input_table to find the record indices
>>
>> output_table = []
>>
>> for record in input_table:
>>
>> if record[0] == 'Meeting':
>>
>> meeting = record[3]
>>
>> elif record[0] == 'Race':
>>
>> date = record[13]
>>
>> race = record[1]
>>
>> elif record[0] == 'Horse':
>>
>> number = record[1]
>>
>> name = record[2]
>>
>> results = record[9]
>>
>> res_split = re.split('[- ]', results)
>>
>> starts = res_split[0]
>>
>> wins = res_split[1]
>>
>> seconds = res_split[2]
>>
>> thirds = res_split[3]
>>
>> prizemoney = res_split[4]
>>
>> trainer = record[4]
>>
>> location = record[5]
>>
>> print(name, wins, seconds)
>>
>> output_table.append((meeting, date, race, number, name,
>>
>> starts, wins, seconds, thirds, prizemoney,
>>
>> trainer, location))
>>
>> return output_table
>>
>>
>>
>> MY_FILE = out_file_name(FILENAME)
>>
>>
>>
>> # with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
>>
>> # for line in race_table(f_in.readline()):
>>
>> # new_row = line
>>
>> with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
>>
>> CONTENT = f_in.read()
>>
>> # print(content)
>>
>> FILE_CONTENTS = race_table(CONTENT)
>>
>> # print new_name
>>
>> f_out.write(str(FILE_CONTENTS))
>>
>>
>>
>>
>>
>> if __name__ == '__main__':
>>
>> pass
> So I found this on stack overflow
>
> In [2]: import string
>
> In [3]: identity = string.maketrans("", "")
>
> In [4]: x = ['+5556', '-1539', '-99', '+1500']
>
> In [5]: x = [s.translate(identity, "+-") for s in x]
>
> In [6]: x
> Out[6]: ['5556', '1539', '99', '1500']
>
> but it fails in my file, due to I believe mine being a list of list. Is there an easy way to iterate the sublists without flattening?
>
> Current code.
>
> input_table = [[item.strip(' "') for item in record.split(',')]
> for record in text_file.splitlines()]
> # At this point look at input_table to find the record indices
> identity = string.maketrans("", "")
> print(input_table)
> input_table = [s.translate(identity, ",'") for s
> in input_table]
>
> Sayth
Take Gregory's advice and use the csv module. Don't reinvent a csv
parser. My "csv" splitter was the simplest approach possible, which I
tend to use with undocumented formats, tweaking for unexpected features
as they come along.
Frederic
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web