Groups > comp.lang.python > #34366 > unrolled thread

Some help in refining this regex for CSV files

Started by	Oltmans <rolf.oltmans@gmail.com>
First post	2012-12-05 23:21 -0800
Last post	2012-12-06 08:27 -0600
Articles	3 — 3 participants

Back to article view | Back to comp.lang.python

  Some help in refining this regex for CSV files Oltmans <rolf.oltmans@gmail.com> - 2012-12-05 23:21 -0800
    Re: Some help in refining this regex for CSV files Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-12-06 07:57 +0000
    Re: Some help in refining this regex for CSV files Tim Chase <python.list@tim.thechases.com> - 2012-12-06 08:27 -0600

#34366 — Some help in refining this regex for CSV files

From	Oltmans <rolf.oltmans@gmail.com>
Date	2012-12-05 23:21 -0800
Subject	Some help in refining this regex for CSV files
Message-ID	<374d15bd-b20c-431b-b9bb-37ec0b1f4df3@googlegroups.com>

Hi guys,

I've to deal with CSVs that look like following

CSV (with one header and 3 legit rows where each legit row has 3 columns)
----
Some info
Date: 12/6/2012
Author: Some guy
Total records: 100

header1, header2, header3
one, two, three
one, "Python is great, so are other languages, isn't ?", three
one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing'
----

So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three lines and here is a regex that I came up with (which clearly isn't working)

    #print line
    pattern = r"([^\t]+\t|,+)"
    matches = re.match(pattern, line) 

Do you've any better ideas guys? I will really appreciate all help.

[toc] | [next] | [standalone]

#34367

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-12-06 07:57 +0000
Message-ID	<mailman.547.1354780692.29569.python-list@python.org>
In reply to	#34366

On 06/12/2012 07:21, Oltmans wrote:
> Hi guys,
>
> I've to deal with CSVs that look like following
>
> CSV (with one header and 3 legit rows where each legit row has 3 columns)
> ----
> Some info
> Date: 12/6/2012
> Author: Some guy
> Total records: 100
>
> header1, header2, header3
> one, two, three
> one, "Python is great, so are other languages, isn't ?", three
> one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing'
> ----
>
> So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three lines and here is a regex that I came up with (which clearly isn't working)
>
>      #print line
>      pattern = r"([^\t]+\t|,+)"
>      matches = re.match(pattern, line)
>
> Do you've any better ideas guys? I will really appreciate all help.
>

I'd simply use the csv module from the standard library to read your 
files, discarding anything that you regard as bad.  I'd certainly not 
use a regex for this.

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#34392

From	Tim Chase <python.list@tim.thechases.com>
Date	2012-12-06 08:27 -0600
Message-ID	<mailman.560.1354803981.29569.python-list@python.org>
In reply to	#34366

On 12/06/12 01:21, Oltmans wrote:
> Hi guys,
> 
> I've to deal with CSVs that look like following
> 
> CSV (with one header and 3 legit rows where each legit row has 3 columns)
> ----
> Some info
> Date: 12/6/2012
> Author: Some guy
> Total records: 100
> 
> header1, header2, header3
> one, two, three
> one, "Python is great, so are other languages, isn't ?", three
> one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing'
> ----
> 
> So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three lines and here is a regex that I came up with (which clearly isn't working)
> 
>     #print line
>     pattern = r"([^\t]+\t|,+)"
>     matches = re.match(pattern, line) 
> 
> Do you've any better ideas guys? I will really appreciate all help.

I agree with Mark that using the "csv" module will likely be your
easiest way to go.  Just consume the lines you don't want before
passing it to the csv.reader(), or parse them and discard invalid
items.  The first could be done something like

  import csv
  f = file("data.csv", "rb")
  while True:
      line = f.next().rstrip("\r\n")
      if not line: break
  r = csv.reader(f)
  for row in r:
      print repr(row)

The latter might be done something like

  f = file("data.csv", "rb")
  r = csv.reader(f)
  for row in r:
      if len(row) != 3: continue
      print repr(row)

However, I also noticed that your example file doesn't seem to fit a
true csv file definition, as you seem to switch quoting notations,
sometimes using single, sometimes using double quotes.

-tkc

[toc] | [prev] | [standalone]

csiph-web

Some help in refining this regex for CSV files

Contents

#34366 — Some help in refining this regex for CSV files

#34367

#34392