Groups > comp.lang.python > #54344 > unrolled thread

.csv to .txt after adding columns

Started by	Bryan Britten <britten.bryan@gmail.com>
First post	2013-09-17 18:42 -0700
Last post	2013-09-18 18:47 -0400
Articles	8 — 5 participants

Back to article view | Back to comp.lang.python

  *.csv to *.txt after adding columns Bryan Britten <britten.bryan@gmail.com> - 2013-09-17 18:42 -0700
    Re: *.csv to *.txt after adding columns Dave Angel <davea@davea.name> - 2013-09-18 02:18 +0000
    Re: *.csv to *.txt after adding columns Bryan Britten <britten.bryan@gmail.com> - 2013-09-17 19:28 -0700
      Re: *.csv to *.txt after adding columns Dave Angel <davea@davea.name> - 2013-09-18 07:55 +0000
    Re: *.csv to *.txt after adding columns Peter Otten <__peter__@web.de> - 2013-09-18 09:14 +0200
    Re: *.csv to *.txt after adding columns rusi <rustompmody@gmail.com> - 2013-09-18 00:44 -0700
    Re: *.csv to *.txt after adding columns Bryan Britten <britten.bryan@gmail.com> - 2013-09-18 04:42 -0700
    Re: *.csv to *.txt after adding columns Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-09-18 18:47 -0400

#54344 — .csv to .txt after adding columns

From	Bryan Britten <britten.bryan@gmail.com>
Date	2013-09-17 18:42 -0700
Subject	.csv to .txt after adding columns
Message-ID	<41edba54-31d3-48ad-a50f-41f87f32d251@googlegroups.com>

Hey, gang, I've got a problem here that I'm sure a handful of you will know how to solve. I've got about 6 *.csv files that I am trying to open; change the header names (to get rid of spaces); add two new columns, which are just the results of a string.split() command; drop the column I just split; and then finally export to *.txt files. Here's the code I'm using:

import os
import csv


fileHandle = 'Path/To/Data'
varNames = 'ID\tCaseNum\tDate\tTime\tBlock\tIUCR\tPrimaryType\tDescription\tLocDesc\tArrest\tDomestic\tBeat\tDistrict\tWard\tCommArea\tFBICode\tXCoord\tYCoord\tYear\tUpdatedOn\tLat\tLong\tLoc\n'

for csvFile in os.listdir(fileHandle):
    outFile = open(fileHandle + os.path.splitext(csvFile)[0] + '.txt', 'w')
    inFile = open(fileHandle + csvFile, 'rb')
    reader = csv.reader(inFile, delimiter=',')
    rowNum = 0
    for row in reader:
        if rowNum == 0:
            outFile.write(varNames)
            rowNum += 1
        else:
            date, time = row[2].split()
            row.insert(3, date)
            row.insert(4, time)
            row.remove(row[2])
            outFile.write('\t'.join(row) + '\n')
    outFile.close()
    inFile.close()  


The issue I'm having is that the *.txt files I'm generating are empty. I assume some unraised error is being thrown, but I'm new to Python and am self taught, so I don't know where to look or how to troubleshoot.

When I run the code on just one file and print the output instead of writing it, it looks exactly like what I'd want. So I'm at a loss for where the problem is.

Any help is appreciated!

[toc] | [next] | [standalone]

#54347

From	Dave Angel <davea@davea.name>
Date	2013-09-18 02:18 +0000
Message-ID	<mailman.99.1379470752.18130.python-list@python.org>
In reply to	#54344

On 17/9/2013 21:42, Bryan Britten wrote:

> Hey, gang, I've got a problem here that I'm sure a handful of you will know how to solve. I've got about 6 *.csv files that I am trying to open; change the header names (to get rid of spaces); add two new columns, which are just the results of a string.split() command; drop the column I just split; and then finally export to *.txt files. Here's the code I'm using:
>
> import os
> import csv
>
>
> fileHandle = 'Path/To/Data'
> varNames = 'ID\tCaseNum\tDate\tTime\tBlock\tIUCR\tPrimaryType\tDescription\tLocDesc\tArrest\tDomestic\tBeat\tDistrict\tWard\tCommArea\tFBICode\tXCoord\tYCoord\tYear\tUpdatedOn\tLat\tLong\tLoc\n'
>
> for csvFile in os.listdir(fileHandle):
>     outFile = open(fileHandle + os.path.splitext(csvFile)[0] + '.txt', 'w')
>     inFile = open(fileHandle + csvFile, 'rb')
>     reader = csv.reader(inFile, delimiter=',')
>     rowNum = 0
>     for row in reader:
>         if rowNum == 0:
>             outFile.write(varNames)
>             rowNum += 1
>         else:
>             date, time = row[2].split()
>             row.insert(3, date)
>             row.insert(4, time)
>             row.remove(row[2])
>             outFile.write('\t'.join(row) + '\n')
>     outFile.close()
>     inFile.close()  
>
>
> The issue I'm having is that the *.txt files I'm generating are empty. I assume some unraised error is being thrown, 

How can you NOT know if an exception is being raised?  How are you
running the code, in a terminal window?  Can't you just see what gets
printed (as stderr)?


> but I'm new to Python and am self taught, so I don't know where to look
> or how to troubleshoot.
>
> When I run the code on just one file and print the output instead of writing it, it looks exactly like what I'd want. So I'm at a loss for where the problem is.

You describe two changes, but don't show them.  How about you do them
one at a time, or perhaps even better, add prints so it both does the
file(s) AND prints the output?

What I think is happening is that you're missing the path separator
between the "Path/To/Data" and the basename.  You should be combining
those with os.path.join(), not with +

For a quick & dirty check, add a trailing slash to the fileHandle. But
that's not the right way to fix it.


-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#54348

From	Bryan Britten <britten.bryan@gmail.com>
Date	2013-09-17 19:28 -0700
Message-ID	<c7b5595f-b847-4172-9a1d-86287e0804d5@googlegroups.com>
In reply to	#54344

Dave -

I can't print the output because there are close to 1,000,000 records. It would be extremely inefficient and resource intensive to look at every row. Like I said, when I take just one file and run the code over the first few records I get what I'd expect to see. Here's an example(non-redacted code):

INPUT:

import csv

fileHandle = 'C:/Users/Bryan/Data Analysis/Crime Analysis/Data/'

varNames = 'ID\tCaseNum\tDate\tTime\tBlock\tIUCR\tPrimaryType\tDescription\tLocDesc\tArrest\tDomestic\tBeat\tDistrict\tWard\tCommArea\tFBICode\tXCoord\tYCoord\tYear\tUpdatedOn\tLat\tLong\tLoc\n'

outFile = open(fileHandle + 'ChiCrime01_02.txt', 'w')
inFile = open(fileHandle + 'ChiCrime01_02.csv', 'rb')
reader = csv.reader(inFile, delimiter=',')
rowNum = 0
for row in reader:
    if rowNum < 5:
        if rowNum >= 1:
            date, time = row[2].split()
            row.insert(3, date)
            row.insert(4, time)
            row.remove(row[2])
            print '\t'.join(row)
            rowNum+=1
        else:
            print varNames
            rowNum+=1


OUTPUT:

ID      CaseNum Date    Time    Block   IUCR    PrimaryType     Description     LocDesc Arrest  Domestic        Beat    District        Ward    CommArea        FBICode XCoord  YCoord  Year    UpdatedOn       Lat     Long    Loc

2924745 HJ602602        12/31/2002      23:59   006XX W 117TH ST        841     THEFT   FINANCIAL ID THEFT:$300 &UNDER  RESIDENCE PORCH/HALLWAY FALSE   FALSE   524     5       34      53      6       1173831 1827387 2002    3/30/2006 21:10 41.68175482     -87.63931351    (41.681754819160666, -87.63931350564216)

2523290 HJ101091        12/31/2002      23:59   002XX W 112TH PL        1310    CRIMINAL DAMAGE TO PROPERTY     APARTMENT       FALSE   FALSE   522             34      49      14      1176848 1830383 2002    3/30/2006 21:10 41.68990907     -87.62817988    (41.689909072449474, -87.62817987594765)

2527332 HJ105139        12/31/2002      23:55   005XX E 89TH PL 486     BATTERY DOMESTIC BATTERY SIMPLE RESIDENCE       FALSE   TRUE    633             6       44      08B     1181369 1845794 2002    3/30/2006 21:10 41.73209609     -87.61115533    (41.732096089465905, -87.61115532670617)

2524251 HJ100175        12/31/2002      23:55   012XX S KARLOV AVE      041A    BATTERY AGGRAVATED: HANDGUN     SIDEWALK        FALSE   FALSE   1011            24      29      04B     1149196 1894387 2002    3/30/2006 21:10 41.86612296     -87.72776536    (41.86612295941429, -87.72776535755746)


Like I said, the output is exactly what I want, but it doesn't seem to be writing to the file and I don't know why. I said I didn't know if it was raising an exception because I'm new to Python and I didn't know if there were some methods that included "silent" errors where it would continue the code but produce the wrong results, such as not writing my files. 

Lastly, why does everyone seem to push for os.path.join versus the method I have used? Is it just a 'standard' that people like to see?

Thanks for your help

[toc] | [prev] | [next] | [standalone]

#54362

From	Dave Angel <davea@davea.name>
Date	2013-09-18 07:55 +0000
Message-ID	<mailman.107.1379490982.18130.python-list@python.org>
In reply to	#54348

On 17/9/2013 22:28, Bryan Britten wrote:

> Dave -
>
> I can't print the output because there are close to 1,000,000 records. It would be extremely inefficient and resource intensive to look at every row.

Not if you made a sample directory with about 3 files, each containing
half a dozen lines.

> Like I said, when I take just one file and run the code over the first
> few records I get what I'd expect to see. Here's an example(non-redacted code):
>
> INPUT:
>
> import csv
>
> fileHandle = 'C:/Users/Bryan/Data Analysis/Crime Analysis/Data/'

Now, that directory specification ends with a slash.  So "+" will work
correctly.  But your original did not.

>
> varNames = 'ID\tCaseNum\tDate\tTime\tBlock\tIUCR\tPrimaryType\tDescription\tLocDesc\tArrest\tDomestic\tBeat\tDistrict\tWard\tCommArea\tFBICode\tXCoord\tYCoord\tYear\tUpdatedOn\tLat\tLong\tLoc\n'
>
> outFile = open(fileHandle + 'ChiCrime01_02.txt', 'w')
> inFile = open(fileHandle + 'ChiCrime01_02.csv', 'rb')

Instead of changing this code, you could have switched to a directory
containing only one file.

>
> Like I said, the output is exactly what I want, but it doesn't seem to be writing to the file and I don't know why. I said I didn't know if it was raising an exception because I'm new to Python and I didn't know if there were some methods that included "silent" errors where it would continue the code but produce the wrong results, such as not writing my files. 
>

The only "silent" exception I know of is the one triggered by
sys.exit(), used to terminate the process.

> Lastly, why does everyone seem to push for os.path.join versus the method I have used? Is it just a 'standard' that people like to see?
>

Because os.path.join will be smart enough to add the slashes only if
they are necessary.   That can be useful, especially if the directory
you're using as a prefix came from the user.

I think Peter's suggestion is probably right on;  you don't limit your
infiles to *.csv, so you will be processing *.txt files the second time.

Another useful debugging aid would have been to add print statements
right after opening the files, perhaps something like:

print "Reading :", infile.name
print "Writing:", outfile.name

If those names had been missing slashes, I would have been vindicated,
while if they were the same, you'd know Peter had nailed it.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#54360

From	Peter Otten <__peter__@web.de>
Date	2013-09-18 09:14 +0200
Message-ID	<mailman.106.1379488457.18130.python-list@python.org>
In reply to	#54344

Bryan Britten wrote:

> Hey, gang, I've got a problem here that I'm sure a handful of you will
> know how to solve. I've got about 6 *.csv files that I am trying to open;
> change the header names (to get rid of spaces); add two new columns, which
> are just the results of a string.split() command; drop the column I just
> split; and then finally export to *.txt files. Here's the code I'm using:
> 
> import os
> import csv
> 
> 
> fileHandle = 'Path/To/Data'
> varNames =
> 
'ID\tCaseNum\tDate\tTime\tBlock\tIUCR\tPrimaryType\tDescription\tLocDesc\tArrest\tDomestic\tBeat\tDistrict\tWard\tCommArea\tFBICode\tXCoord\tYCoord\tYear\tUpdatedOn\tLat\tLong\tLoc\n'
> 

You may be processing not just the .csv files, but also the .txt files from 
a previous run of your script. Remove them manually to verify my guess.
If I'm right you can (in order of my preference)

(1) write the output into another directory and also do (2)
(2) filter the input files with glob.glob("/path/to/data/*.csv") 
(3) or just add the two lines below to your current code.

> for csvFile in os.listdir(fileHandle):

      if not csvFile.endswith(".csv"):
          continue

>     outFile = open(fileHandle + os.path.splitext(csvFile)[0] + '.txt',
>     'w') 
>     inFile = open(fileHandle + csvFile, 'rb')
>     reader = csv.reader(inFile, delimiter=',')
>     rowNum = 0
>     for row in reader:
>         if rowNum == 0:
>             outFile.write(varNames)
>             rowNum += 1
>         else:
>             date, time = row[2].split()
>             row.insert(3, date)
>             row.insert(4, time)
>             row.remove(row[2])
>             outFile.write('\t'.join(row) + '\n')
>     outFile.close()
>     inFile.close()
> 
> 
> The issue I'm having is that the *.txt files I'm generating are empty. I
> assume some unraised error is being thrown, but I'm new to Python and am
> self taught, so I don't know where to look or how to troubleshoot.
> 
> When I run the code on just one file and print the output instead of
> writing it, it looks exactly like what I'd want. So I'm at a loss for
> where the problem is.
> 
> Any help is appreciated!

[toc] | [prev] | [next] | [standalone]

#54361

From	rusi <rustompmody@gmail.com>
Date	2013-09-18 00:44 -0700
Message-ID	<19f72b93-27c8-4426-9a97-25d88a9be702@googlegroups.com>
In reply to	#54344

On Wednesday, September 18, 2013 7:12:21 AM UTC+5:30, Bryan Britten wrote:
> Hey, gang, I've got a problem here that I'm sure a handful of you will know how to solve. I've got about 6 *.csv files that I am trying to open; change the header names (to get rid of spaces); add two new columns, which are just the results of a string.split() command; drop the column I just split; and then finally export to *.txt files. Here's the code I'm using:

Its generally a good idea in programming to separate separable problems. Since you say:

> The issue I'm having is that the *.txt files I'm generating are empty. I 
> assume some unraised error is being thrown, but I'm new to Python and am self 
> taught, so I don't know where to look or how to troubleshoot.

> When I run the code on just one file and print the output instead of writing 
> it, it looks exactly like what I'd want. So I'm at a loss for where the 
> problem is. 

you should probably write and have checked the code that handles one file. For example a function that takes as parameters the two: the input/output files (maybe names or handles), and then
- check that that function works for one file
- call it in your loop

[BTW your code seems to mixup the variable-names for file-names and handles]

[toc] | [prev] | [next] | [standalone]

#54373

From	Bryan Britten <britten.bryan@gmail.com>
Date	2013-09-18 04:42 -0700
Message-ID	<a8549e34-648f-4c74-9879-c8b46f6a2ab3@googlegroups.com>
In reply to	#54344

Peter nailed it. Adding in the two lines of code to ensure I was just working with *.csv files fixed the problem. Thanks to everyone for the help and suggestions on best practices.

[toc] | [prev] | [next] | [standalone]

#54405

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2013-09-18 18:47 -0400
Message-ID	<mailman.140.1379544469.18130.python-list@python.org>
In reply to	#54344

On Tue, 17 Sep 2013 18:42:21 -0700 (PDT), Bryan Britten
<britten.bryan@gmail.com> declaimed the following:

>Hey, gang, I've got a problem here that I'm sure a handful of you will know how to solve. I've got about 6 *.csv files that I am trying to open; change the header names (to get rid of spaces); add two new columns, which are just the results of a string.split() command; drop the column I just split; and then finally export to *.txt files. Here's the code I'm using:
>
>import os
>import csv
>
>
>fileHandle = 'Path/To/Data'

	That seems a misnomer... It is not a "file handle" of any sort, just a
base path.

>varNames = 'ID\tCaseNum\tDate\tTime\tBlock\tIUCR\tPrimaryType\tDescription\tLocDesc\tArrest\tDomestic\tBeat\tDistrict\tWard\tCommArea\tFBICode\tXCoord\tYCoord\tYear\tUpdatedOn\tLat\tLong\tLoc\n'
>
>for csvFile in os.listdir(fileHandle):
>    outFile = open(fileHandle + os.path.splitext(csvFile)[0] + '.txt', 'w')

	Note that your "fileHandle" does not end with a / -- so I have no idea
what types of names you are trying to open...

	Recommend you use

	os.path.join(fileHande, os.path.splitext(csvFile)[0] + ".txt")


>    inFile = open(fileHandle + csvFile, 'rb')

	ditto

>    reader = csv.reader(inFile, delimiter=',')
>    rowNum = 0
>    for row in reader:
>        if rowNum == 0:
>            outFile.write(varNames)
>            rowNum += 1
>        else:
>            date, time = row[2].split()
>            row.insert(3, date)
>            row.insert(4, time)
>            row.remove(row[2])
>            outFile.write('\t'.join(row) + '\n')

	Apparently your "text file" is a tab-separated file...

	The CSV module is capable of processing TSV just as easily as CSV.

	I'd also drop the whole confusion of rowNum


-=-=-=-=- pseudo-code, not runnable

	fin = open(os.path.join(basepath, filename), "rb")
	csvin = csv.reader(fin, delimiter=",")
	fout = open(os.path.join(basepath, otherfilename), "wb")
	csvout = csv.writer(fout, delimiter="\t")

	junk = reader.next()	#skip header
	csvout.write(["ID", "caseNum", ... ])

	for row in reader:
		out = row[:2] 
		out.extend(row[2].split())
		out.extend(row[3:])
		csvout.write(row)

	fout.close()
	fin.close()
	del csvout	#just paranoia, free up structures
	del csvin	#ditto

-=-=-=-=-

-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [standalone]

csiph-web

*.csv to *.txt after adding columns

Contents

#54344 — *.csv to *.txt after adding columns

#54347

#54348

#54362

#54360

#54361

#54373

#54405

.csv to .txt after adding columns

#54344 — .csv to .txt after adding columns