Groups > comp.lang.python > #73789 > unrolled thread

fixing an horrific formatted csv file.

Started by	flebber <flebber.crue@gmail.com>
First post	2014-07-01 07:04 -0700
Last post	2014-07-04 15:24 +0200
Articles	12 — 5 participants

Back to article view | Back to comp.lang.python

  fixing an horrific formatted csv file. flebber <flebber.crue@gmail.com> - 2014-07-01 07:04 -0700
    Re: fixing an horrific formatted csv file. MRAB <python@mrabarnett.plus.com> - 2014-07-01 15:32 +0100
    Re: fixing an horrific formatted csv file. "F.R." <anthra.norell@bluewin.ch> - 2014-07-01 22:49 +0200
      Re: fixing an horrific formatted csv file. flebber <flebber.crue@gmail.com> - 2014-07-01 14:41 -0700
        Re: fixing an horrific formatted csv file. Chris Angelico <rosuav@gmail.com> - 2014-07-02 11:20 +1000
          Re: fixing an horrific formatted csv file. flebber <flebber.crue@gmail.com> - 2014-07-02 02:13 -0700
            Re: fixing an horrific formatted csv file. "F.R." <anthra.norell@bluewin.ch> - 2014-07-02 17:51 +0200
              Re: fixing an horrific formatted csv file. flebber <flebber.crue@gmail.com> - 2014-07-03 21:12 -0700
                Re: fixing an horrific formatted csv file. Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-07-04 18:19 +1200
                  Re: fixing an horrific formatted csv file. flebber <flebber.crue@gmail.com> - 2014-07-04 03:48 -0700
                Re: fixing an horrific formatted csv file. flebber <flebber.crue@gmail.com> - 2014-07-04 03:28 -0700
                  Re: fixing an horrific formatted csv file. "F.R." <anthra.norell@bluewin.ch> - 2014-07-04 15:24 +0200

#73789 — fixing an horrific formatted csv file.

From	flebber <flebber.crue@gmail.com>
Date	2014-07-01 07:04 -0700
Subject	fixing an horrific formatted csv file.
Message-ID	<47e2e29d-b5c3-4aa6-abf9-3b1e46eb0dec@googlegroups.com>

What I am trying to do is to reformat a csv file into something more usable.
currently the file has no headers, multiple lines with varying columns that are not related.

This is a sample

Meeting,05/07/14,RHIL,Rosehill Gardens,Weights,TAB,+3m Entire Circuit,          ,
Race,1,CIVIC STAKES,CIVIC,CIVIC,1350,~         ,3U        ,~         ,QLT       ,54,0,0,5/07/2014,,          ,          ,          ,          ,No class restriction, Quality, For Three-Years-Old and Upwards, No sex restriction, (Listed),Of $100000. First $60000, second $20000, third $10000, fourth $5000, fifth $2000, sixth $1000, seventh $1000, eighth $1000
Horse,1,Bennetta,0,"Grahame Begg",Randwick,,0,0,16-3-1-3 $390450.00,,0,0,0,,98.00,M,
Horse,2,Breakfast in Bed,0,"David Vandyke",Warwick Farm,,0,0,20-6-1-5 $201250.00,,0,0,0,,81.00,M,
Horse,3,Capital Commander,0,"Gerald Ryan",Rosehill,,0,0,43-9-9-3 $438625.00,,0,0,0,,85.00,M,
Horse,4,Coup Ay Tee (NZ),0,"Chris Waller",Rosehill,,0,0,35-9-6-5 $519811.00,,0,0,0,,101.00,G,
Horse,5,Generalife,0,"John O'Shea",Warwick Farm,,0,0,19-6-1-3 $235045.00,,0,0,0,,87.00,G,
Horse,6,He's Your Man (FR),0,"Chris Waller",Rosehill,,0,0,13-2-3-1 $108110.00,,0,0,0,,93.00,G,
Horse,7,Hidden Kisses,0,"Chris Waller",Rosehill,,0,0,40-8-8-5 $565750.00,,0,0,0,,96.00,M,
Horse,8,Oakfield Commands,0,"Gerald Ryan",Rosehill,,0,0,22-7-4-6 $269530.00,,0,0,0,,94.00,G,
Horse,9,Taxmeifyoucan,0,"Gregory Hickman",Warwick Farm,,0,0,18-2-4-4 $539730.00,,0,0,0,,91.00,G,
Horse,10,The Peak,0,"Bart & James Cummings",Randwick,,0,0,15-6-1-0 $426732.00,,0,0,0,,95.00,G,
Horse,11,Tougher Than Ever (NZ),0,"Chris Waller",Rosehill,,0,0,17-3-2-3 $321613.00,,0,0,0,,97.00,H,
Horse,12,TROMSO,0,"Chris Waller",Rosehill,,0,0,47-8-11-2 $622300.00,,0,0,0,,103.00,G,
Race,2,FLYING WELTER - BENCHMARK 95 HCP,BM95,BM95,1100,BM95      ,3U        ,~         ,HCP       ,54,0,0,5/07/2014,,          ,          ,          ,          ,BenchMark 95, Handicap, For Three-Years-Old and Upwards, No sex restriction,Of $85000. First $48750, second $16750, third $8350, fourth $4150, fifth $2000, sixth $1000, seventh $1000, eighth $1000, ninth $1000, tenth $1000
Horse,1,Big Bonanza,0,"Don Robb",Wyong,,0,57.5,31-9-4-3 $366860.00,,0,0,0,,92.00,G,
Horse,2,Casual Choice,0,"Joseph Pride",Warwick Farm,,0,54,8-2-3-0 $105930.00,,0,0,0,

So what I am trying to so is end up with an output like this.

Meeting, Date, Race, Number, Name, Trainer, Location
Rosehill, 05/07/14, 1, 1,Bennetta,"Grahame Begg",Randwick,
Rosehill, 05/07/14, 1, 2,Breakfast in Bed,"David Vandyke",Warwick Farm,

So as a start i thought i would try inserting the Meeting and Race number however I am just not getting it right.

import csv

outfile = open("/home/sayth/Scripts/cleancsv.csv", "w")
with open('/home/sayth/Scripts/test.csv') as f:
    f_csv = csv.reader(f)
    headers = next(f_csv)
    for row in f_csv:
        meeting = row[3] in row[0] == 'Meeting'
        new = row.insert(0, meeting)
        while row[1] in row[0] == 'Race' < 9:  # pref less than next found row[0]

            # grab row[1] as id number
            id = row[1]
            # from row[0] and insert it in first position
            new_lines = new.insert(1, id)
            outfile.write(new_lines)
        outfile.close()

How should I go about this?

Thanks

Sayth

[toc] | [next] | [standalone]

#73791

From	MRAB <python@mrabarnett.plus.com>
Date	2014-07-01 15:32 +0100
Message-ID	<mailman.11369.1404225146.18130.python-list@python.org>
In reply to	#73789

On 2014-07-01 15:04, flebber wrote:
> What I am trying to do is to reformat a csv file into something more usable.
> currently the file has no headers, multiple lines with varying columns that are not related.
>
> This is a sample
>
> Meeting,05/07/14,RHIL,Rosehill Gardens,Weights,TAB,+3m Entire Circuit,          ,
> Race,1,CIVIC STAKES,CIVIC,CIVIC,1350,~         ,3U        ,~         ,QLT       ,54,0,0,5/07/2014,,          ,          ,          ,          ,No class restriction, Quality, For Three-Years-Old and Upwards, No sex restriction, (Listed),Of $100000. First $60000, second $20000, third $10000, fourth $5000, fifth $2000, sixth $1000, seventh $1000, eighth $1000
> Horse,1,Bennetta,0,"Grahame Begg",Randwick,,0,0,16-3-1-3 $390450.00,,0,0,0,,98.00,M,
> Horse,2,Breakfast in Bed,0,"David Vandyke",Warwick Farm,,0,0,20-6-1-5 $201250.00,,0,0,0,,81.00,M,
> Horse,3,Capital Commander,0,"Gerald Ryan",Rosehill,,0,0,43-9-9-3 $438625.00,,0,0,0,,85.00,M,
> Horse,4,Coup Ay Tee (NZ),0,"Chris Waller",Rosehill,,0,0,35-9-6-5 $519811.00,,0,0,0,,101.00,G,
> Horse,5,Generalife,0,"John O'Shea",Warwick Farm,,0,0,19-6-1-3 $235045.00,,0,0,0,,87.00,G,
> Horse,6,He's Your Man (FR),0,"Chris Waller",Rosehill,,0,0,13-2-3-1 $108110.00,,0,0,0,,93.00,G,
> Horse,7,Hidden Kisses,0,"Chris Waller",Rosehill,,0,0,40-8-8-5 $565750.00,,0,0,0,,96.00,M,
> Horse,8,Oakfield Commands,0,"Gerald Ryan",Rosehill,,0,0,22-7-4-6 $269530.00,,0,0,0,,94.00,G,
> Horse,9,Taxmeifyoucan,0,"Gregory Hickman",Warwick Farm,,0,0,18-2-4-4 $539730.00,,0,0,0,,91.00,G,
> Horse,10,The Peak,0,"Bart & James Cummings",Randwick,,0,0,15-6-1-0 $426732.00,,0,0,0,,95.00,G,
> Horse,11,Tougher Than Ever (NZ),0,"Chris Waller",Rosehill,,0,0,17-3-2-3 $321613.00,,0,0,0,,97.00,H,
> Horse,12,TROMSO,0,"Chris Waller",Rosehill,,0,0,47-8-11-2 $622300.00,,0,0,0,,103.00,G,
> Race,2,FLYING WELTER - BENCHMARK 95 HCP,BM95,BM95,1100,BM95      ,3U        ,~         ,HCP       ,54,0,0,5/07/2014,,          ,          ,          ,          ,BenchMark 95, Handicap, For Three-Years-Old and Upwards, No sex restriction,Of $85000. First $48750, second $16750, third $8350, fourth $4150, fifth $2000, sixth $1000, seventh $1000, eighth $1000, ninth $1000, tenth $1000
> Horse,1,Big Bonanza,0,"Don Robb",Wyong,,0,57.5,31-9-4-3 $366860.00,,0,0,0,,92.00,G,
> Horse,2,Casual Choice,0,"Joseph Pride",Warwick Farm,,0,54,8-2-3-0 $105930.00,,0,0,0,
>
> So what I am trying to so is end up with an output like this.
>
> Meeting, Date, Race, Number, Name, Trainer, Location
> Rosehill, 05/07/14, 1, 1,Bennetta,"Grahame Begg",Randwick,
> Rosehill, 05/07/14, 1, 2,Breakfast in Bed,"David Vandyke",Warwick Farm,
>
> So as a start i thought i would try inserting the Meeting and Race number however I am just not getting it right.
>
> import csv
>
> outfile = open("/home/sayth/Scripts/cleancsv.csv", "w")
> with open('/home/sayth/Scripts/test.csv') as f:
>      f_csv = csv.reader(f)
>      headers = next(f_csv)
>      for row in f_csv:
>          meeting = row[3] in row[0] == 'Meeting'
>          new = row.insert(0, meeting)
>          while row[1] in row[0] == 'Race' < 9:  # pref less than next found row[0]
>
>              # grab row[1] as id number
>              id = row[1]
>              # from row[0] and insert it in first position
>              new_lines = new.insert(1, id)
>              outfile.write(new_lines)
>          outfile.close()
>
> How should I go about this?
>
There's no point in reading the first row as the headers because it
clearly doesn't contain just the headings.

First write a row for the header.

Then, for each row:

If the first field is 'Meeting', then remember the meeting, etc.

If the first field is 'Race', then remember the race, etc.

If the first field is 'Horse', then write the row with the additional
fields for race, etc.

And so on.

BTW, the indentation for the 'outfile.close()' line is wrong. It would,
of course, be better to use the 'with' statement for that file too.

[toc] | [prev] | [next] | [standalone]

#73809

From	"F.R." <anthra.norell@bluewin.ch>
Date	2014-07-01 22:49 +0200
Message-ID	<mailman.11385.1404247829.18130.python-list@python.org>
In reply to	#73789

On 07/01/2014 04:04 PM, flebber wrote:
> What I am trying to do is to reformat a csv file into something more usable.
> currently the file has no headers, multiple lines with varying columns that are not related.
>
> This is a sample
>
> Meeting,05/07/14,RHIL,Rosehill Gardens,Weights,TAB,+3m Entire Circuit,          ,
> Race,1,CIVIC STAKES,CIVIC,CIVIC,1350,~         ,3U        ,~         ,QLT       ,54,0,0,5/07/2014,,          ,          ,          ,          ,No class restriction, Quality, For Three-Years-Old and Upwards, No sex restriction, (Listed),Of $100000. First $60000, second $20000, third $10000, fourth $5000, fifth $2000, sixth $1000, seventh $1000, eighth $1000
> Horse,1,Bennetta,0,"Grahame Begg",Randwick,,0,0,16-3-1-3 $390450.00,,0,0,0,,98.00,M,
> Horse,2,Breakfast in Bed,0,"David Vandyke",Warwick Farm,,0,0,20-6-1-5 $201250.00,,0,0,0,,81.00,M,
> Horse,3,Capital Commander,0,"Gerald Ryan",Rosehill,,0,0,43-9-9-3 $438625.00,,0,0,0,,85.00,M,
> Horse,4,Coup Ay Tee (NZ),0,"Chris Waller",Rosehill,,0,0,35-9-6-5 $519811.00,,0,0,0,,101.00,G,
> Horse,5,Generalife,0,"John O'Shea",Warwick Farm,,0,0,19-6-1-3 $235045.00,,0,0,0,,87.00,G,
> Horse,6,He's Your Man (FR),0,"Chris Waller",Rosehill,,0,0,13-2-3-1 $108110.00,,0,0,0,,93.00,G,
> Horse,7,Hidden Kisses,0,"Chris Waller",Rosehill,,0,0,40-8-8-5 $565750.00,,0,0,0,,96.00,M,
> Horse,8,Oakfield Commands,0,"Gerald Ryan",Rosehill,,0,0,22-7-4-6 $269530.00,,0,0,0,,94.00,G,
> Horse,9,Taxmeifyoucan,0,"Gregory Hickman",Warwick Farm,,0,0,18-2-4-4 $539730.00,,0,0,0,,91.00,G,
> Horse,10,The Peak,0,"Bart & James Cummings",Randwick,,0,0,15-6-1-0 $426732.00,,0,0,0,,95.00,G,
> Horse,11,Tougher Than Ever (NZ),0,"Chris Waller",Rosehill,,0,0,17-3-2-3 $321613.00,,0,0,0,,97.00,H,
> Horse,12,TROMSO,0,"Chris Waller",Rosehill,,0,0,47-8-11-2 $622300.00,,0,0,0,,103.00,G,
> Race,2,FLYING WELTER - BENCHMARK 95 HCP,BM95,BM95,1100,BM95      ,3U        ,~         ,HCP       ,54,0,0,5/07/2014,,          ,          ,          ,          ,BenchMark 95, Handicap, For Three-Years-Old and Upwards, No sex restriction,Of $85000. First $48750, second $16750, third $8350, fourth $4150, fifth $2000, sixth $1000, seventh $1000, eighth $1000, ninth $1000, tenth $1000
> Horse,1,Big Bonanza,0,"Don Robb",Wyong,,0,57.5,31-9-4-3 $366860.00,,0,0,0,,92.00,G,
> Horse,2,Casual Choice,0,"Joseph Pride",Warwick Farm,,0,54,8-2-3-0 $105930.00,,0,0,0,
>
> So what I am trying to so is end up with an output like this.
>
> Meeting, Date, Race, Number, Name, Trainer, Location
> Rosehill, 05/07/14, 1, 1,Bennetta,"Grahame Begg",Randwick,
> Rosehill, 05/07/14, 1, 2,Breakfast in Bed,"David Vandyke",Warwick Farm,
>
> So as a start i thought i would try inserting the Meeting and Race number however I am just not getting it right.
>
> import csv
>
> outfile = open("/home/sayth/Scripts/cleancsv.csv", "w")
> with open('/home/sayth/Scripts/test.csv') as f:
>      f_csv = csv.reader(f)
>      headers = next(f_csv)
>      for row in f_csv:
>          meeting = row[3] in row[0] == 'Meeting'
>          new = row.insert(0, meeting)
>          while row[1] in row[0] == 'Race' < 9:  # pref less than next found row[0]
>
>              # grab row[1] as id number
>              id = row[1]
>              # from row[0] and insert it in first position
>              new_lines = new.insert(1, id)
>              outfile.write(new_lines)
>          outfile.close()
>
> How should I go about this?
>
> Thanks
>
> Sayth

Reformatting is what I do most and over time I have acquired some 
practice. Complete solutions are not
often proposed, possibly sneered on for their officiousness. In that 
case I apologize. I couldn't resist. It is such a nice example. Having 
solved it, I figure why not share it . . .

Frederic

------------------------------------------------------------------------------------------------------------

def race_table (csv_text):
     input_table = [[item.strip(' "') for item in record.split (',')] 
for record in csv_text.splitlines ()]
     # At this point look at input_table to find the record indices
     output_table = []
     for record in input_table:
         if record [0] == 'Meeting':
             meeting = record [3]
         elif record [0] == 'Race':
             date = record [13]
             race = record [1]
         elif record [0] == 'Horse':
             number = record [1]
             name = record [2]
             trainer = record [4]
             location = record [5]
             output_table.append ((meeting, date, race, number, name, 
trainer, location))
     return output_table


 >>> for record in race_table (your_csv_text): print record

('Rosehill Gardens', '5/07/2014', '1', '1', 'Bennetta', 'Grahame Begg', 
'Randwick')
('Rosehill Gardens', '5/07/2014', '1', '2', 'Breakfast in Bed', 'David 
Vandyke', 'Warwick Farm')
('Rosehill Gardens', '5/07/2014', '1', '3', 'Capital Commander', 'Gerald 
Ryan', 'Rosehill')
('Rosehill Gardens', '5/07/2014', '1', '4', 'Coup Ay Tee (NZ)', 'Chris 
Waller', 'Rosehill')
('Rosehill Gardens', '5/07/2014', '1', '5', 'Generalife', "John O'Shea", 
'Warwick Farm')
('Rosehill Gardens', '5/07/2014', '1', '6', "He's Your Man (FR)", 'Chris 
Waller', 'Rosehill')
('Rosehill Gardens', '5/07/2014', '1', '7', 'Hidden Kisses', 'Chris 
Waller', 'Rosehill')
('Rosehill Gardens', '5/07/2014', '1', '8', 'Oakfield Commands', 'Gerald 
Ryan', 'Rosehill')
('Rosehill Gardens', '5/07/2014', '1', '9', 'Taxmeifyoucan', 'Gregory 
Hickman', 'Warwick Farm')
('Rosehill Gardens', '5/07/2014', '1', '10', 'The Peak', 'Bart & James 
Cummings', 'Randwick')
('Rosehill Gardens', '5/07/2014', '1', '11', 'Tougher Than Ever (NZ)', 
'Chris Waller', 'Rosehill')
('Rosehill Gardens', '5/07/2014', '1', '12', 'TROMSO', 'Chris Waller', 
'Rosehill')
('Rosehill Gardens', '5/07/2014', '2', '1', 'Big Bonanza', 'Don Robb', 
'Wyong')
('Rosehill Gardens', '5/07/2014', '2', '2', 'Casual Choice', 'Joseph 
Pride', 'Warwick Farm')

 >>> TM = TX.Table_Maker (headings = 
('Meeting','Date','Race','Number','Name','Trainer','Location'))
 >>> TM (race_table (your_csv_text)).write ()

  Meeting          | Date      | Race | Number | Name                   
| Trainer               | Location     |

  Rosehill Gardens | 5/07/2014 |    1 |      1 | Bennetta               
| Grahame Begg          | Randwick     |
  Rosehill Gardens | 5/07/2014 |    1 |      2 | Breakfast in Bed       
| David Vandyke         | Warwick Farm |
  Rosehill Gardens | 5/07/2014 |    1 |      3 | Capital Commander      
| Gerald Ryan           | Rosehill     |
  Rosehill Gardens | 5/07/2014 |    1 |      4 | Coup Ay Tee (NZ)       
| Chris Waller          | Rosehill     |
  Rosehill Gardens | 5/07/2014 |    1 |      5 | Generalife             
| John O'Shea           | Warwick Farm |
  Rosehill Gardens | 5/07/2014 |    1 |      6 | He's Your Man (FR)     
| Chris Waller          | Rosehill     |
  Rosehill Gardens | 5/07/2014 |    1 |      7 | Hidden Kisses          
| Chris Waller          | Rosehill     |
  Rosehill Gardens | 5/07/2014 |    1 |      8 | Oakfield Commands      
| Gerald Ryan           | Rosehill     |
  Rosehill Gardens | 5/07/2014 |    1 |      9 | Taxmeifyoucan          
| Gregory Hickman       | Warwick Farm |
  Rosehill Gardens | 5/07/2014 |    1 |     10 | The Peak               
| Bart & James Cummings | Randwick     |
  Rosehill Gardens | 5/07/2014 |    1 |     11 | Tougher Than Ever (NZ) 
| Chris Waller          | Rosehill     |
  Rosehill Gardens | 5/07/2014 |    1 |     12 | TROMSO                 
| Chris Waller          | Rosehill     |
  Rosehill Gardens | 5/07/2014 |    2 |      1 | Big Bonanza            
| Don Robb              | Wyong        |
  Rosehill Gardens | 5/07/2014 |    2 |      2 | Casual Choice          
| Joseph Pride          | Warwick Farm |

[toc] | [prev] | [next] | [standalone]

#73811

From	flebber <flebber.crue@gmail.com>
Date	2014-07-01 14:41 -0700
Message-ID	<0d3871c6-81d4-4168-9408-ad85299b0955@googlegroups.com>
In reply to	#73809

That's a really cool solution.

I understand why providing full solutions is frowned upon, because it doesn't assist in learning. Which is true,  it's incredibly helpful in this case.

The python cookbook is really good and what I was using as a start for dealing with csv. But it doesn't even go anywhere near this. Lots of examples with simple inputs.

Anyway Thanks again

 Sayth

[toc] | [prev] | [next] | [standalone]

#73819

From	Chris Angelico <rosuav@gmail.com>
Date	2014-07-02 11:20 +1000
Message-ID	<mailman.11392.1404264061.18130.python-list@python.org>
In reply to	#73811

On Wed, Jul 2, 2014 at 7:41 AM, flebber <flebber.crue@gmail.com> wrote:
> I understand why providing full solutions is frowned upon, because it doesn't assist in learning. Which is true,  it's incredibly helpful in this case.

In this case, my main reason for not providing a full solution is that
the work tends to be iterative. When I have a huge and messy file,
what I usually do is grab the first half-dozen lines and work out how
I'd go about fixing them manually, then write a script that does that.
Then run the script on the whole file, and see where it either chokes
or produces wrong data. Pick up the first few lines of wrong data,
figure out how to tweak the program to handle those. Rinse and repeat.

Often, what that results in is a file that gets progressively tidier.
When the scope of the mess is infinite (like with human-entered data -
believe you me, you haven't seen messy until you've seen what a
committee can do to a simple job), this means you stop working on the
script at exactly the point where it stops being worth the effort -
which is something that only you can decide.

ChrisA

[toc] | [prev] | [next] | [standalone]

#73830

From	flebber <flebber.crue@gmail.com>
Date	2014-07-02 02:13 -0700
Message-ID	<a84826ea-4018-40bc-88c1-812be5417e6b@googlegroups.com>
In reply to	#73819

> >>> TM = TX.Table_Maker (headings = 
> ('Meeting','Date','Race','Number','Name','Trainer','Location')) 
> >>> TM (race_table (your_csv_text)).write () 

Where do I find TX? Found this mention in the list, was it available in pip by any name?
https://mail.python.org/pipermail/python-list/2014-February/667464.html

Sayth

[toc] | [prev] | [next] | [standalone]

#73840

From	"F.R." <anthra.norell@bluewin.ch>
Date	2014-07-02 17:51 +0200
Message-ID	<mailman.11411.1404316334.18130.python-list@python.org>
In reply to	#73830

On 07/02/2014 11:13 AM, flebber wrote:
>>>>> TM = TX.Table_Maker (headings =
>> ('Meeting','Date','Race','Number','Name','Trainer','Location'))
>>>>> TM (race_table (your_csv_text)).write ()
> Where do I find TX? Found this mention in the list, was it available in pip by any name?
> https://mail.python.org/pipermail/python-list/2014-February/667464.html
>
> Sayth

I'd have to make it available. I proposed it some time ago and received 
a couple of suggestions in return. It is a modular transformation 
framework written entirely in python (2.7). It consists essentially of a 
base class "Transformer" that handles input and output in such a way 
that Transformer objects can be chained. It saved me from drowning an a 
horrible and growing tangle of hacks. Finding something usable I had 
previously done took time. Understanding how it worked took more time 
and adapting it took still more time, so that writing yet another hack 
from scratch was faster.
     A number of hacks I could quickly wrap into a Transformer object 
and so could start building a library of standard Transformers. The 
Table_Maker is one of them. The table making code is quite bad. It 
suffers from feature overload. I would clean it up for distribution.
     I'd be happy to distribute the base class and a few standard 
Translators, such as I use every day. (File Reader, File Writer, DB Run 
Command, DB Write, Table Maker, PDF To Text, Text To Lines, Lines To 
Text, Sort, Sort And Unique, etc.) Writing one's own Transformers is a 
breeze. Testing too, because a Transformer keeps its input and output 
and, in line with the system's design philosophy, does only its own 
single thing.
     A Chain is a list of Transformers that run in sequence. It is 
itself derived from Transformer and is a functional equivalent. So 
Chains nest. Fixing a Chain that nothing comes out of is a 
straightforward matter too. It will still have run up to the failing 
element. Chain.show () reveals the culprit as the first one to have no 
output.
     I am not up to date on distributing and would depend on qualified 
help on that.

Frederic


--------------------------------------------------------------------------------


A brief overview


The TX solution to your race table would be (TX is the name of the module):

     class Race_Table (TX.Transformer):
         '''
         In: CSV text
         Out: Tabular data (2-dimensional list)
         '''
         name = 'Race_Table'
         @TX.setup   # Checks timestamps to prevent needless reruns in 
the absence of new input
         def transform (self):
             for line in self.Input.data:
                 # See my post
             self.Output.take (output_table)

     Example file to file:
     >>> Race_Schedule_F2F = TX.Chain (TX.File_Reader (), Race_Table (), 
TX.List_To_CSV (delimiter = ';'), TX.File_Writer (terminal = out_file_name)
     >>> Race_Schedule_F2F (input_file_name)   # Does it all!

     Example web to database:
     >>> Race_Schedule_WWW2DB = TX.Chain (TX.WWW_Reader (), 
Race_Schedule_HTML_Reader (), Race_Table (), TX.DB_Writer (table_name = 
'horses'))
     >>> Race_Schedule_WWW2DB (url)   # Does is all! You'd have to write 
the Race_Schedule_HTML_Reader

     Verify your table:
     >>> Table_Viewer = TX.Chain (TX.Table_Maker (), TX.Table_Writer ())
     >>> Race_Schedule_WWW2DB.show_tree () # See which one should display
     Chain
     Chain[0] - WWW Reader
     Chain[1] - Race_Schedule_HTML_Reader
     Chain[2] - Race_Table
     Chain[3] - DB Writer
     >>> print Table_Viewer (Race_Schedule_WWW2DB[2]()) # All 
Transformers keep their data
     (Display of table)

     Verify database:
     >>> print Table_Viewer (TX.DB_Reader (table_name = 'horses')())
     (Display of database table)

[toc] | [prev] | [next] | [standalone]

#73936

From	flebber <flebber.crue@gmail.com>
Date	2014-07-03 21:12 -0700
Message-ID	<11ecf009-6f81-4fa5-bee9-b52b9407f0af@googlegroups.com>
In reply to	#73840

I have taken the code and gone a little further, but I need to be able to protect myself against commas and single quotes in names.

How is it the best to do this?

so in my file I had on line 44 this trainer name.

"Michael, Wayne & John Hawkes" 

and in line 95 this horse name.
Inz'n'out

this throws of my capturing correct item 9. How do I protect against this?

Here is current code.

import re
from sys import argv
SCRIPT, FILENAME = argv


def out_file_name(file_name):
    """take an input file and keep the name with appended _clean"""
    file_parts = file_name.split(".",)
    output_file = file_parts[0] + '_clean.' + file_parts[1]
    return output_file


def race_table(text_file):
    """utility to reorganise poorly made csv entry"""
    input_table = [[item.strip(' "') for item in record.split(',')]
                   for record in text_file.splitlines()]
    # At this point look at input_table to find the record indices
    output_table = []
    for record in input_table:
        if record[0] == 'Meeting':
            meeting = record[3]
        elif record[0] == 'Race':
            date = record[13]
            race = record[1]
        elif record[0] == 'Horse':
            number = record[1]
            name = record[2]
            results = record[9]
            res_split = re.split('[- ]', results)
            starts = res_split[0]
            wins = res_split[1]
            seconds = res_split[2]
            thirds = res_split[3]
            prizemoney = res_split[4]
            trainer = record[4]
            location = record[5]
            print(name, wins, seconds)
            output_table.append((meeting, date, race, number, name,
                                 starts, wins, seconds, thirds, prizemoney,
                                 trainer, location))
    return output_table

MY_FILE = out_file_name(FILENAME)

# with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
#     for line in race_table(f_in.readline()):
#         new_row = line
with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
    CONTENT = f_in.read()
    # print(content)
    FILE_CONTENTS = race_table(CONTENT)
    # print new_name
    f_out.write(str(FILE_CONTENTS))


if __name__ == '__main__':
    pass

[toc] | [prev] | [next] | [standalone]

#73937

From	Gregory Ewing <greg.ewing@canterbury.ac.nz>
Date	2014-07-04 18:19 +1200
Message-ID	<c1mvavFhpj7U1@mid.individual.net>
In reply to	#73936

flebber wrote:
> so in my file I had on line 44 this trainer name.
> 
> "Michael, Wayne & John Hawkes"
> 
> and in line 95 this horse name. Inz'n'out
> 
> this throws of my capturing correct item 9. How do I protect against this?

Use python's csv module to read the file. Don't try to
do it yourself; the rules for handling embedded commas
and quotes in csv are quite complicated. As long as
the file is a well-formed csv file, the csv module
should parse fields like that correctly.

-- 
Greg

[toc] | [prev] | [next] | [standalone]

#73949

From	flebber <flebber.crue@gmail.com>
Date	2014-07-04 03:48 -0700
Message-ID	<bee0ef3a-4ec3-4ca6-aa82-dffccf09dc67@googlegroups.com>
In reply to	#73937

On Friday, 4 July 2014 16:19:09 UTC+10, Gregory Ewing  wrote:
> flebber wrote:
> 
> > so in my file I had on line 44 this trainer name.
> 
> > 
> 
> > "Michael, Wayne & John Hawkes"
> 
> > 
> 
> > and in line 95 this horse name. Inz'n'out
> 
> > 
> 
> > this throws of my capturing correct item 9. How do I protect against this?
> 
> 
> 
> Use python's csv module to read the file. Don't try to
> 
> do it yourself; the rules for handling embedded commas
> 
> and quotes in csv are quite complicated. As long as
> 
> the file is a well-formed csv file, the csv module
> 
> should parse fields like that correctly.
> 
> 
> 
> -- 
> 
> Greg

True Greg worked easier

def race_table(text_file):
    """utility to reorganise poorly made csv entry"""
#     input_table = [[item.strip(' "') for item in record.split(',')]
#                    for record in text_file.splitlines()]
# At this point look at input_table to find the record indices
#     identity = string.maketrans("", "")
#     print(input_table)
#     input_table = [s.translate(identity, ",'") for s
#                    in input_table]
    output_table = []
    for record in text_file:
        if record[0] == 'Meeting':
            meeting = record[3]
        elif record[0] == 'Race':
            date = record[13]
            race = record[1]
        elif record[0] == 'Horse':
            number = record[1]
            name = record[2]
            results = record[9]
            res_split = re.split('[- ]', results)
            starts = res_split[0]
            wins = res_split[1]
            seconds = res_split[2]
            thirds = res_split[3]
            try:
                prizemoney = res_split[4]
            finally:
                prizemoney = 0
            trainer = record[4]
            location = record[5]
            print(name, wins, seconds)
            output_table.append((meeting, date, race, number, name,
                                 starts, wins, seconds, thirds, prizemoney,
                                 trainer, location))
    return output_table

MY_FILE = out_file_name(FILENAME)

# with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
#     for line in race_table(f_in.readline()):
#         new_row = line
with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
    CONTENT = csv.reader(f_in)
    # print(content)
    FILE_CONTENTS = race_table(CONTENT)
    # print new_name
    f_out.write(str(FILE_CONTENTS))


if __name__ == '__main__':
    pass

Sayth

[toc] | [prev] | [next] | [standalone]

#73947

From	flebber <flebber.crue@gmail.com>
Date	2014-07-04 03:28 -0700
Message-ID	<c783c959-081d-42d1-8bd0-96342bd2557c@googlegroups.com>
In reply to	#73936

On Friday, 4 July 2014 14:12:15 UTC+10, flebber  wrote:
> I have taken the code and gone a little further, but I need to be able to protect myself against commas and single quotes in names.
> 
> 
> 
> How is it the best to do this?
> 
> 
> 
> so in my file I had on line 44 this trainer name.
> 
> 
> 
> "Michael, Wayne & John Hawkes" 
> 
> 
> 
> and in line 95 this horse name.
> 
> Inz'n'out
> 
> 
> 
> this throws of my capturing correct item 9. How do I protect against this?
> 
> 
> 
> Here is current code.
> 
> 
> 
> import re
> 
> from sys import argv
> 
> SCRIPT, FILENAME = argv
> 
> 
> 
> 
> 
> def out_file_name(file_name):
> 
>     """take an input file and keep the name with appended _clean"""
> 
>     file_parts = file_name.split(".",)
> 
>     output_file = file_parts[0] + '_clean.' + file_parts[1]
> 
>     return output_file
> 
> 
> 
> 
> 
> def race_table(text_file):
> 
>     """utility to reorganise poorly made csv entry"""
> 
>     input_table = [[item.strip(' "') for item in record.split(',')]
> 
>                    for record in text_file.splitlines()]
> 
>     # At this point look at input_table to find the record indices
> 
>     output_table = []
> 
>     for record in input_table:
> 
>         if record[0] == 'Meeting':
> 
>             meeting = record[3]
> 
>         elif record[0] == 'Race':
> 
>             date = record[13]
> 
>             race = record[1]
> 
>         elif record[0] == 'Horse':
> 
>             number = record[1]
> 
>             name = record[2]
> 
>             results = record[9]
> 
>             res_split = re.split('[- ]', results)
> 
>             starts = res_split[0]
> 
>             wins = res_split[1]
> 
>             seconds = res_split[2]
> 
>             thirds = res_split[3]
> 
>             prizemoney = res_split[4]
> 
>             trainer = record[4]
> 
>             location = record[5]
> 
>             print(name, wins, seconds)
> 
>             output_table.append((meeting, date, race, number, name,
> 
>                                  starts, wins, seconds, thirds, prizemoney,
> 
>                                  trainer, location))
> 
>     return output_table
> 
> 
> 
> MY_FILE = out_file_name(FILENAME)
> 
> 
> 
> # with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
> 
> #     for line in race_table(f_in.readline()):
> 
> #         new_row = line
> 
> with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
> 
>     CONTENT = f_in.read()
> 
>     # print(content)
> 
>     FILE_CONTENTS = race_table(CONTENT)
> 
>     # print new_name
> 
>     f_out.write(str(FILE_CONTENTS))
> 
> 
> 
> 
> 
> if __name__ == '__main__':
> 
>     pass

So I found this on stack overflow

In [2]: import string

In [3]: identity = string.maketrans("", "")

In [4]: x = ['+5556', '-1539', '-99', '+1500']

In [5]: x = [s.translate(identity, "+-") for s in x]

In [6]: x
Out[6]: ['5556', '1539', '99', '1500']

but it fails in my file, due to I believe mine being a list of list. Is there an easy way to iterate the sublists without flattening?

Current code.

    input_table = [[item.strip(' "') for item in record.split(',')]
                   for record in text_file.splitlines()]
    # At this point look at input_table to find the record indices
    identity = string.maketrans("", "")
    print(input_table)
    input_table = [s.translate(identity, ",'") for s
                   in input_table]

Sayth

[toc] | [prev] | [next] | [standalone]

#73953

From	"F.R." <anthra.norell@bluewin.ch>
Date	2014-07-04 15:24 +0200
Message-ID	<mailman.11491.1404480345.18130.python-list@python.org>
In reply to	#73947

On 07/04/2014 12:28 PM, flebber wrote:
> On Friday, 4 July 2014 14:12:15 UTC+10, flebber  wrote:
>> I have taken the code and gone a little further, but I need to be able to protect myself against commas and single quotes in names.
>>
>>
>>
>> How is it the best to do this?
>>
>>
>>
>> so in my file I had on line 44 this trainer name.
>>
>>
>>
>> "Michael, Wayne & John Hawkes"
>>
>>
>>
>> and in line 95 this horse name.
>>
>> Inz'n'out
>>
>>
>>
>> this throws of my capturing correct item 9. How do I protect against this?
>>
>>
>>
>> Here is current code.
>>
>>
>>
>> import re
>>
>> from sys import argv
>>
>> SCRIPT, FILENAME = argv
>>
>>
>>
>>
>>
>> def out_file_name(file_name):
>>
>>      """take an input file and keep the name with appended _clean"""
>>
>>      file_parts = file_name.split(".",)
>>
>>      output_file = file_parts[0] + '_clean.' + file_parts[1]
>>
>>      return output_file
>>
>>
>>
>>
>>
>> def race_table(text_file):
>>
>>      """utility to reorganise poorly made csv entry"""
>>
>>      input_table = [[item.strip(' "') for item in record.split(',')]
>>
>>                     for record in text_file.splitlines()]
>>
>>      # At this point look at input_table to find the record indices
>>
>>      output_table = []
>>
>>      for record in input_table:
>>
>>          if record[0] == 'Meeting':
>>
>>              meeting = record[3]
>>
>>          elif record[0] == 'Race':
>>
>>              date = record[13]
>>
>>              race = record[1]
>>
>>          elif record[0] == 'Horse':
>>
>>              number = record[1]
>>
>>              name = record[2]
>>
>>              results = record[9]
>>
>>              res_split = re.split('[- ]', results)
>>
>>              starts = res_split[0]
>>
>>              wins = res_split[1]
>>
>>              seconds = res_split[2]
>>
>>              thirds = res_split[3]
>>
>>              prizemoney = res_split[4]
>>
>>              trainer = record[4]
>>
>>              location = record[5]
>>
>>              print(name, wins, seconds)
>>
>>              output_table.append((meeting, date, race, number, name,
>>
>>                                   starts, wins, seconds, thirds, prizemoney,
>>
>>                                   trainer, location))
>>
>>      return output_table
>>
>>
>>
>> MY_FILE = out_file_name(FILENAME)
>>
>>
>>
>> # with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
>>
>> #     for line in race_table(f_in.readline()):
>>
>> #         new_row = line
>>
>> with open(FILENAME, 'r') as f_in, open(MY_FILE, 'w') as f_out:
>>
>>      CONTENT = f_in.read()
>>
>>      # print(content)
>>
>>      FILE_CONTENTS = race_table(CONTENT)
>>
>>      # print new_name
>>
>>      f_out.write(str(FILE_CONTENTS))
>>
>>
>>
>>
>>
>> if __name__ == '__main__':
>>
>>      pass
> So I found this on stack overflow
>
> In [2]: import string
>
> In [3]: identity = string.maketrans("", "")
>
> In [4]: x = ['+5556', '-1539', '-99', '+1500']
>
> In [5]: x = [s.translate(identity, "+-") for s in x]
>
> In [6]: x
> Out[6]: ['5556', '1539', '99', '1500']
>
> but it fails in my file, due to I believe mine being a list of list. Is there an easy way to iterate the sublists without flattening?
>
> Current code.
>
>      input_table = [[item.strip(' "') for item in record.split(',')]
>                     for record in text_file.splitlines()]
>      # At this point look at input_table to find the record indices
>      identity = string.maketrans("", "")
>      print(input_table)
>      input_table = [s.translate(identity, ",'") for s
>                     in input_table]
>
> Sayth

Take Gregory's advice and use the csv module. Don't reinvent a csv 
parser. My "csv" splitter was the simplest approach possible, which I 
tend to use with undocumented formats, tweaking for unexpected features 
as they come along.

Frederic

[toc] | [prev] | [standalone]

csiph-web

fixing an horrific formatted csv file.

Contents

#73789 — fixing an horrific formatted csv file.

#73791

#73809

#73811

#73819

#73830

#73840

#73936

#73937

#73949

#73947

#73953