Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #34392

Re: Some help in refining this regex for CSV files

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python.list@tim.thechases.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'author:': 0.03; 'languages,': 0.04; 'discard': 0.05; 'deny': 0.07; 'line:': 0.07; 'matches': 0.07; 'subject:help': 0.07; '#print': 0.09; 'definition,': 0.09; 'rows': 0.09; 'subject:files': 0.09; 'cc:addr :python-list': 0.10; 'passing': 0.15; '"python': 0.16; '-tkc': 0.16; 'consume': 0.16; 'csv': 0.16; 'csv,': 0.16; 'from:addr:python.list': 0.16; 'from:addr:tim.thechases.com': 0.16; 'from:name:tim chase': 0.16; 'guys,': 0.16; 'guys?': 0.16; 'line)': 0.16; 'message-id:@tim.thechases.com': 0.16; 'received:70.251': 0.16; 'received:dsl.rcsntx.swbell.net': 0.16; 'received:rcsntx.swbell.net': 0.16; 'received:swbell.net': 0.16; 'row': 0.16; 'subject:CSV': 0.16; 'true:': 0.16; 'two,': 0.16; 'wrote:': 0.17; 'items.': 0.17; 'module': 0.19; 'import': 0.21; 'latter': 0.22; 'parse': 0.22; 'help.': 0.22; 'cc:2**0': 0.23; 'example': 0.23; "i've": 0.23; 'cc:no real name:2**0': 0.24; 'header': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply- To:1': 0.25; 'header:User-Agent:1': 0.26; '(which': 0.26; 'fit': 0.26; '----': 0.27; 'easiest': 0.27; "doesn't": 0.28; 'lines': 0.28; 'noticed': 0.28; 'quoting': 0.29; 'date:': 0.29; 'file': 0.32; 'switch': 0.32; 'could': 0.32; 'info': 0.32; 'print': 0.32; '(with': 0.33; 'likely': 0.33; 'agree': 0.34; 'done': 0.34; 'sometimes': 0.35; 'continue': 0.35; 'something': 0.35; 'there': 0.35; 'really': 0.36; 'alone': 0.36; 'bad': 0.37; 'one,': 0.37; 'subject:: ': 0.38; 'mark': 0.38; 'fact': 0.38; 'some': 0.38; 'where': 0.40; 'end': 0.40; 'your': 0.60; "you've": 0.61; 'first': 0.61; 'here': 0.65; 'total': 0.65; 'middle': 0.66; '100': 0.78; 'subject:this': 0.84; '(they': 0.84; 'fact.': 0.84; 'received:50.22': 0.84; 'single,': 0.84
Date Thu, 06 Dec 2012 08:27:21 -0600
From Tim Chase <python.list@tim.thechases.com>
User-Agent Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111120 Icedove/3.1.16
MIME-Version 1.0
To Oltmans <rolf.oltmans@gmail.com>
Subject Re: Some help in refining this regex for CSV files
References <374d15bd-b20c-431b-b9bb-37ec0b1f4df3@googlegroups.com>
In-Reply-To <374d15bd-b20c-431b-b9bb-37ec0b1f4df3@googlegroups.com>
Content-Type text/plain; charset=ISO-8859-1
Content-Transfer-Encoding 7bit
X-AntiAbuse This header was added to track abuse, please include it with any abuse report
X-AntiAbuse Primary Hostname - boston.accountservergroup.com
X-AntiAbuse Original Domain - python.org
X-AntiAbuse Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse Sender Address Domain - tim.thechases.com
Cc python-list@python.org
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.560.1354803981.29569.python-list@python.org> (permalink)
Lines 56
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1354803981 news.xs4all.nl 6916 [2001:888:2000:d::a6]:53529
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:34392

Show key headers only | View raw


On 12/06/12 01:21, Oltmans wrote:
> Hi guys,
> 
> I've to deal with CSVs that look like following
> 
> CSV (with one header and 3 legit rows where each legit row has 3 columns)
> ----
> Some info
> Date: 12/6/2012
> Author: Some guy
> Total records: 100
> 
> header1, header2, header3
> one, two, three
> one, "Python is great, so are other languages, isn't ?", three
> one, two, 'some languages, are realyl beautiful\r\n, I really cannot deny \n this \t\t\t fact. \t\t\t\tthis fact alone is amazing'
> ----
> 
> So inside this CSV, there will always be bad lines like the top 4 (they could end up in the beginning, in the middle and even in the last). So above sample, csv has 3 legit lines and a header. I want to read those three lines and here is a regex that I came up with (which clearly isn't working)
> 
>     #print line
>     pattern = r"([^\t]+\t|,+)"
>     matches = re.match(pattern, line) 
> 
> Do you've any better ideas guys? I will really appreciate all help.

I agree with Mark that using the "csv" module will likely be your
easiest way to go.  Just consume the lines you don't want before
passing it to the csv.reader(), or parse them and discard invalid
items.  The first could be done something like

  import csv
  f = file("data.csv", "rb")
  while True:
      line = f.next().rstrip("\r\n")
      if not line: break
  r = csv.reader(f)
  for row in r:
      print repr(row)

The latter might be done something like

  f = file("data.csv", "rb")
  r = csv.reader(f)
  for row in r:
      if len(row) != 3: continue
      print repr(row)

However, I also noticed that your example file doesn't seem to fit a
true csv file definition, as you seem to switch quoting notations,
sometimes using single, sometimes using double quotes.

-tkc


Back to comp.lang.python | Previous | NextPrevious in thread | Find similar | Unroll thread


Thread

Some help in refining this regex for CSV files Oltmans <rolf.oltmans@gmail.com> - 2012-12-05 23:21 -0800
  Re: Some help in refining this regex for CSV files Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-12-06 07:57 +0000
  Re: Some help in refining this regex for CSV files Tim Chase <python.list@tim.thechases.com> - 2012-12-06 08:27 -0600

csiph-web