Groups > comp.lang.python > #32640 > unrolled thread

csv read clean up and write out to csv

Started by	Sacha Rook <sacharook@gmail.com>
First post	2012-11-02 10:25 -0700
Last post	2012-11-02 20:51 +0100
Articles	3 — 3 participants

Back to article view | Back to comp.lang.python

  csv read clean up and write out to csv Sacha Rook <sacharook@gmail.com> - 2012-11-02 10:25 -0700
    Re: csv read clean up and write out to csv Neil Cerutti <neilc@norwich.edu> - 2012-11-02 19:33 +0000
    Re: csv read clean up and write out to csv Hans Mulder <hansmu@xs4all.nl> - 2012-11-02 20:51 +0100

#32640 — csv read clean up and write out to csv

From	Sacha Rook <sacharook@gmail.com>
Date	2012-11-02 10:25 -0700
Subject	csv read clean up and write out to csv
Message-ID	<10133786-7f4c-4b19-8678-a50781ff453d@googlegroups.com>

Hi

I have a problem with a csv file from a supplier, so they export data to csv however the last column in the record is a description which is marked up with html.

trying to automate the processing of this csv to upload elsewhere in a useable format. If i open the csv with csved it looks like all the records aren't escaped correctly as after a while i find html tags and text on the next line/record.

If I 'openwith' excel the description stays on the correct line/record?

I want to use python to read these records in and output a valid csv with the descriptions intact preferably without the html tags so a string of text formatted with newline/CR where appropriate.

So far I have this but don't know where to go from here can someone help me?

import csv

infile = open('c:\data\input.csv', 'rb')
outfile = open('c:\data\output.csv', 'wb')

reader = csv.reader(infile)
writer = csv.writer(outfile)


for line in reader:
    print line
    writer.writerow(line)


The input.csv is set out as follows;
HEADER ROW 1st
"FileDate","ProductID","Name","StandardPrice","DropshipPrice","SRP","Brand","Xline","InStock","Stock","Barcode","Weight","CategoryID","Category","SmallImage","LargeImage","Description"

A COMPLETE RECORD LOOKS LIKE THIS WITH THE DESCRIPTION FIELD POPULATED SOME RECORDS DON'T HAVE THE DESCRIPTION FIELD POPULATED

"2012-11-01T18:28:45.25+00:00","10198","(Venom) PS2 DVD Remote Control (Black)","3.7800","4.3500","12.9800","Venom","true","In Stock","1","5031300025009","200","1339","PC/Games_Console / Playstation / PS2 / Remote Controls","http://www.atssitecentre.co.uk/images/products/10000/10198.gif","http://www.atssitecentre.co.uk/images/products/10000/10198f.jpg","Never have to unplug your joypad / DVD user friendly / Works up to 30 feet from PS/2 / IR wireless technology."

THIS IS AN EXAMPLE OF THE BAD RECORD CAUSING PROBLEMS, THE DESCRIPTION FIELD STARTS ""features:</p>
AS YOU CAN SEE CONTAINS HTML BUT BECAUSE OF THIS FORMAT SUBSEQUENT HTML TAGS ARE
ADDED AS NEW RECORDS ON SUBSEQUENT LINES. 

"2012-11-01T18:28:45.25+00:00","11116","3.5 inch Disk Drive Lock","2.9500","2.9500","9.9500","None","true","In Stock","3","077511994166","131","1332","PC/Games_Console / PC Security / General","http://www.atssitecentre.co.uk/images/products/11000/11116.gif","http://www.atssitecentre.co.uk/images/products/11000/11116f.jpg","features:</p>
<ul>
<li>3 1/2&quot; FDD Lock.</li>
<li>Die casting housing and cylinder chrome plated.</li>
<li>Lock Cover : PBT + GF 15%. (PLASTIC)</li>
<li>2 Keys supplied per lock. <br /></li>
</ul>"

I know I am far from complete but don't know how to proceed :-)
As I said I want to reconstruct a clean record either strip out the html tags or 
at least escape the records appropriately..

Thanks all

[toc] | [next] | [standalone]

#32652

From	Neil Cerutti <neilc@norwich.edu>
Date	2012-11-02 19:33 +0000
Message-ID	<afilglFndcnU1@mid.individual.net>
In reply to	#32640

On 2012-11-02, Sacha Rook <sacharook@gmail.com> wrote:
> Hi
>
> I have a problem with a csv file from a supplier, so they
> export data to csv however the last column in the record is a
> description which is marked up with html.
>
> trying to automate the processing of this csv to upload
> elsewhere in a useable format. If i open the csv with csved it
> looks like all the records aren't escaped correctly as after a
> while i find html tags and text on the next line/record.

Maybe compose a simple parter to disambiguate the lines from the
file.

Something like (you'll have to write is_html, and my Python 2 is
mighty rusty, you'll have to fix up. Note that infile doesn't
have to be in binary mode with this scheme, but it would fail on
bizarre newlines in the file):

def parse_records(iter):
    for line in iter:
        if is_html(line):
	    yield ('html', line)
	else:
	    yield ('csv', csv.reader([line.strip()]).next())

infile = open('c:\data\input.csv')
outfile = open('c:\data\output.csv', 'wb')

writer = csv.writer(outfile)

for tag, rec in parse_record(infile):
    if tag == 'html':
        print rec
    elif tag == 'csv':
        writer.writerow(rec)
    else:
        raise ValueError("Unknown record type %s" % tag)

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#32653

From	Hans Mulder <hansmu@xs4all.nl>
Date	2012-11-02 20:51 +0100
Message-ID	<5094245b$0$6847$e4fe514c@news2.news.xs4all.nl>
In reply to	#32640

On 2/11/12 18:25:09, Sacha Rook wrote:
> I have a problem with a csv file from a supplier, so they export data to csv
> however the last column in the record is a description which is marked up
> with html.
> 
> trying to automate the processing of this csv to upload elsewhere in a
> useable format. If i open the csv with csved it looks like all the records
> aren't escaped correctly as after a while i find html tags and text on the
> next line/record.

The example line you gave was correctly escaped: the description starts
with a double quote, and ends several lines later with another double
quote.  Double quotes in the HTML are represented by '&quot;'.

Maybe csved doesn't recognize this escape convention?

> If I 'openwith' excel the description stays on the correct line/record?

Excel implements this convention

> I want to use python to read these records in and output a valid csv with
> the descriptions intact preferably without the html tags so a string of
> text formatted with newline/CR where appropriate.

How about this:

import csv

infile = file("input.csv", "rb")
outfile = file("output.csv", "wb")

reader = csv.reader(infile)
writer = csv.writer(outfile)

for line in reader:
    line[-1] = line[-1].replace("\n", " ")
    print line
    writer.writerow(line)

infile.close()
outfile.close()

That will replace the newlines inside the HTML, which your csved
doesn't seem to recognize, by spaces.  When viewed as HTML code,
spaces have the same effect as newlines, so this replacement
shouldn't alter the meaning of the HTML text.

Hope this helps,

-- HansM

[toc] | [prev] | [standalone]

csiph-web

csv read clean up and write out to csv

Contents

#32640 — csv read clean up and write out to csv

#32652

#32653