Groups > comp.lang.python > #37119 > unrolled thread

RE Help splitting CVS data

Started by	Garry <ggkraemer@gmail.com>
First post	2013-01-20 14:04 -0800
Last post	2013-01-21 14:12 +0000
Articles	10 — 9 participants

Back to article view | Back to comp.lang.python

  RE Help splitting CVS data Garry <ggkraemer@gmail.com> - 2013-01-20 14:04 -0800
    Re: RE Help splitting CVS data Mitya Sirenef <msirenef@lightbird.net> - 2013-01-20 17:14 -0500
    Re: RE Help splitting CVS data Terry Reedy <tjreedy@udel.edu> - 2013-01-20 17:16 -0500
    Re: Help splitting CVS data Dave Angel <d@davea.name> - 2013-01-20 17:21 -0500
    Re: RE Help splitting CVS data Roy Smith <roy@panix.com> - 2013-01-20 19:00 -0500
    Re: RE Help splitting CVS data Tim Chase <python.list@tim.thechases.com> - 2013-01-20 18:10 -0600
    Re: RE Help splitting CVS data Garry <ggkraemer@gmail.com> - 2013-01-20 16:41 -0800
      Re: RE Help splitting CVS data Chris Angelico <rosuav@gmail.com> - 2013-01-21 12:30 +1100
      Re: RE Help splitting CVS data Alister <alister.ware@ntlworld.com> - 2013-01-21 08:28 +0000
      Re: RE Help splitting CVS data Neil Cerutti <neilc@norwich.edu> - 2013-01-21 14:12 +0000

#37119 — RE Help splitting CVS data

From	Garry <ggkraemer@gmail.com>
Date	2013-01-20 14:04 -0800
Subject	RE Help splitting CVS data
Message-ID	<3e1e8567-b9f4-446a-8a59-75f45367d2ac@googlegroups.com>

I'm trying to manipulate family tree data using Python.
I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files
The data appears in this format:

Marriage,Husband,Wife,Date,Place,Source,Note0x0a
Note: the Source field or the Note field can contain quoted data (same as the Place field)

Actual data:
[F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
[F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a

code snippet follows:

import os
import re
#I'm using the following regex in an attempt to decode the data:
RegExp2 = "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d{,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"
#
line = "[F0244],[I0690],[I0354],1916-06-08,\"Neely's Landing, Cape Gir. Co, MO\",,"
#
(Marriage,Husband,Wife,Date,Place,Source,Note) = re.split(RegExp2,line)
#
#However, this does not decode the 7 fields.
# The following error is displayed:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: too many values to unpack
#
# When I use xx the fields apparently get unpacked.
xx = re.split(RegExp2,line)
#
>>> print xx[0]

>>> print xx[1]
[F0244]
>>> print xx[5]
"Neely's Landing, Cape Gir. Co, MO"
>>> print xx[6]

>>> print xx[7]

>>> print xx[8]

Why is there an extra NULL field before and after my record contents?
I'm stuck, comments and solutions greatly appreciated.

Garry

[toc] | [next] | [standalone]

#37120

From	Mitya Sirenef <msirenef@lightbird.net>
Date	2013-01-20 17:14 -0500
Message-ID	<mailman.707.1358720081.2939.python-list@python.org>
In reply to	#37119

On 01/20/2013 05:04 PM, Garry wrote:
> I'm trying to manipulate family tree data using Python.
> I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files
> The data appears in this format:
>
> Marriage,Husband,Wife,Date,Place,Source,Note0x0a
> Note: the Source field or the Note field can contain quoted data (same as the Place field)
>
> Actual data:
> [F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
> [F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a
>
> code snippet follows:
>
> import os
> import re
> #I'm using the following regex in an attempt to decode the data:
> RegExp2 = "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d{,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"
> #
> line = "[F0244],[I0690],[I0354],1916-06-08,\"Neely's Landing, Cape Gir. Co, MO\",,"
> #
> (Marriage,Husband,Wife,Date,Place,Source,Note) = re.split(RegExp2,line)
> #
> #However, this does not decode the 7 fields.
> # The following error is displayed:
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> ValueError: too many values to unpack
> #
> # When I use xx the fields apparently get unpacked.
> xx = re.split(RegExp2,line)
> #
>>>> print xx[0]
>>>> print xx[1]
> [F0244]
>>>> print xx[5]
> "Neely's Landing, Cape Gir. Co, MO"
>>>> print xx[6]
>>>> print xx[7]
>>>> print xx[8]
> Why is there an extra NULL field before and after my record contents?
> I'm stuck, comments and solutions greatly appreciated.
>
> Garry
>


Gosh, you really don't want to use regex to split csv lines like that....

Use csv module:

 >>> s
'[F0244],[I0690],[I0354],1916-06-08,"Neely\'s Landing, Cape Gir. Co, 
MO",,0x0a'
 >>> import csv
 >>> r = csv.reader([s])
 >>> for l in r: print(l)
...
['[F0244]', '[I0690]', '[I0354]', '1916-06-08', "Neely's Landing, Cape 
Gir. Co, MO", '', '0x0a']


the arg to csv.reader can be the file object (or a list of lines).

  - mitya


-- 
Lark's Tongue Guide to Python: http://lightbird.net/larks/

[toc] | [prev] | [next] | [standalone]

#37122

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-01-20 17:16 -0500
Message-ID	<mailman.709.1358720389.2939.python-list@python.org>
In reply to	#37119

On 1/20/2013 5:04 PM, Garry wrote:
> I'm trying to manipulate family tree data using Python.
> I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files
...
> I'm stuck, comments and solutions greatly appreciated.

Why are you not using the cvs module?

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#37123 — Re: Help splitting CVS data

From	Dave Angel <d@davea.name>
Date	2013-01-20 17:21 -0500
Subject	Re: Help splitting CVS data
Message-ID	<mailman.710.1358720516.2939.python-list@python.org>
In reply to	#37119

On 01/20/2013 05:04 PM, Garry wrote:
> I'm trying to manipulate family tree data using Python.
> I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files
> The data appears in this format:
>
> Marriage,Husband,Wife,Date,Place,Source,Note0x0a
> Note: the Source field or the Note field can contain quoted data (same as the Place field)
>
> Actual data:
> [F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
> [F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a
>
> code snippet follows:
>
> import os
> import re
> #I'm using the following regex in an attempt to decode the data:
> RegExp2 = "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d{,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"
> #

Well, you lost me about there.  For a csv file, why not use the csv module:


import csv

ifile  = open('test.csv', "rb")
reader = csv.reader(ifile)


For reference, see http://docs.python.org/2/library/csv.html

and for sample use and discussion, see

http://www.linuxjournal.com/content/handling-csv-files-python

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#37126

From	Roy Smith <roy@panix.com>
Date	2013-01-20 19:00 -0500
Message-ID	<roy-DB743D.19005020012013@news.panix.com>
In reply to	#37119

In article <3e1e8567-b9f4-446a-8a59-75f45367d2ac@googlegroups.com>,
 Garry <ggkraemer@gmail.com> wrote:

> Actual data:
> [F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
> [F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a
> 
> code snippet follows:
> 
> import os
> import re
> #I'm using the following regex in an attempt to decode the data:

First suggestion, don't try to parse CSV data with regex.  I'm a huge 
regex fan, but it's just the wrong tool for this job.  Use the built-in 
csv module (http://docs.python.org/2/library/csv.html).  Or, if you want 
something fancier, read_csv() from pandas (http://tinyurl.com/ajxdxjm).

Second, when you use regexes, *always* use raw strings around the 
pattern:

RegExp2 = r'....'

Lastly, take a look at the re.VERBOSE flag.  It lets you write monster 
regexes split up into several lines.  Between re.VERBOSE and raw 
strings, it can make the difference between line noise like this:

> RegExp2 = 
> "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d
> {,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"

and something that mere mortals can understand.

[toc] | [prev] | [next] | [standalone]

#37127

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-01-20 18:10 -0600
Message-ID	<mailman.712.1358726924.2939.python-list@python.org>
In reply to	#37119

On 01/20/13 16:16, Terry Reedy wrote:
> On 1/20/2013 5:04 PM, Garry wrote:
>> I'm trying to manipulate family tree data using Python.
>> I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files
> ...
>> I'm stuck, comments and solutions greatly appreciated.
>
> Why are you not using the cvs module?

that's an easy answer:

 >>> import cvs
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
ImportError: No module named cvs


Now the *csv* module... ;-)

-tkc

[toc] | [prev] | [next] | [standalone]

#37129

From	Garry <ggkraemer@gmail.com>
Date	2013-01-20 16:41 -0800
Message-ID	<9e90bf6e-9072-4747-917f-0291b1ae9a05@googlegroups.com>
In reply to	#37119

On Sunday, January 20, 2013 3:04:39 PM UTC-7, Garry wrote:
> I'm trying to manipulate family tree data using Python.
> 
> I'm using linux and Python 2.7.3 and have data files saved as Linux formatted cvs files
> 
> The data appears in this format:
> 
> 
> 
> Marriage,Husband,Wife,Date,Place,Source,Note0x0a
> 
> Note: the Source field or the Note field can contain quoted data (same as the Place field)
> 
> 
> 
> Actual data:
> 
> [F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co, MO",,0x0a
> 
> [F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a
> 
> 
> 
> code snippet follows:
> 
> 
> 
> import os
> 
> import re
> 
> #I'm using the following regex in an attempt to decode the data:
> 
> RegExp2 = "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d{,2}\-\d{,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"
> 
> #
> 
> line = "[F0244],[I0690],[I0354],1916-06-08,\"Neely's Landing, Cape Gir. Co, MO\",,"
> 
> #
> 
> (Marriage,Husband,Wife,Date,Place,Source,Note) = re.split(RegExp2,line)
> 
> #
> 
> #However, this does not decode the 7 fields.
> 
> # The following error is displayed:
> 
> Traceback (most recent call last):
> 
>   File "<stdin>", line 1, in <module>
> 
> ValueError: too many values to unpack
> 
> #
> 
> # When I use xx the fields apparently get unpacked.
> 
> xx = re.split(RegExp2,line)
> 
> #
> 
> >>> print xx[0]
> 
> 
> 
> >>> print xx[1]
> 
> [F0244]
> 
> >>> print xx[5]
> 
> "Neely's Landing, Cape Gir. Co, MO"
> 
> >>> print xx[6]
> 
> 
> 
> >>> print xx[7]
> 
> 
> 
> >>> print xx[8]
> 
> 
> 
> Why is there an extra NULL field before and after my record contents?
> 
> I'm stuck, comments and solutions greatly appreciated.
> 
> 
> 
> Garry

Thanks everyone for your comments.  I'm new to Python, but can get around in Perl and regular expressions.  I sure was taking the long way trying to get the cvs data parsed.  

Sure hope to teach myself python.  Maybe I need to look into courses offered at the local Jr College!

Garry

[toc] | [prev] | [next] | [standalone]

#37134

From	Chris Angelico <rosuav@gmail.com>
Date	2013-01-21 12:30 +1100
Message-ID	<mailman.715.1358731819.2939.python-list@python.org>
In reply to	#37129

On Mon, Jan 21, 2013 at 11:41 AM, Garry <ggkraemer@gmail.com> wrote:
> Thanks everyone for your comments.  I'm new to Python, but can get around in Perl and regular expressions.  I sure was taking the long way trying to get the cvs data parsed.

As has been hinted by Tim, you're actually talking about csv data -
Comma Separated Values. Not to be confused with cvs, an old vcs. (See?
The v can go anywhere...) Not a big deal, but it's much easier to find
stuff on PyPI or similar when you have the right keyword to search
for!

ChrisA

[toc] | [prev] | [next] | [standalone]

#37168

From	Alister <alister.ware@ntlworld.com>
Date	2013-01-21 08:28 +0000
Message-ID	<0%6Ls.1$J74.0@fx31.fr7>
In reply to	#37129

On Sun, 20 Jan 2013 16:41:12 -0800, Garry wrote:

> On Sunday, January 20, 2013 3:04:39 PM UTC-7, Garry wrote:
>> I'm trying to manipulate family tree data using Python.
>> 
>> I'm using linux and Python 2.7.3 and have data files saved as Linux
>> formatted cvs files
>> 
>> The data appears in this format:
>> 
>> 
>> 
>> Marriage,Husband,Wife,Date,Place,Source,Note0x0a
>> 
>> Note: the Source field or the Note field can contain quoted data (same
>> as the Place field)
>> 
>> 
>> 
>> Actual data:
>> 
>> [F0244],[I0690],[I0354],1916-06-08,"Neely's Landing, Cape Gir. Co,
>> MO",,0x0a
>> 
>> [F0245],[I0692],[I0355],1919-09-04,"Cape Girardeau Co, MO",,0x0a
>> 
>> 
>> 
>> code snippet follows:
>> 
>> 
>> 
>> import os
>> 
>> import re
>> 
>> #I'm using the following regex in an attempt to decode the data:
>> 
>> RegExp2 =
>> "^(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\[[A-Z]\d{1,}\])\,(\d{,4}\-\d
{,2}\-\d{,2})\,(.*|\".*\")\,(.*|\".*\")\,(.*|\".*\")"
>> 
>> #
>> 
>> line = "[F0244],[I0690],[I0354],1916-06-08,\"Neely's Landing, Cape Gir.
>> Co, MO\",,"
>> 
>> #
>> 
>> (Marriage,Husband,Wife,Date,Place,Source,Note) = re.split(RegExp2,line)
>> 
>> #
>> 
>> #However, this does not decode the 7 fields.
>> 
>> # The following error is displayed:
>> 
>> Traceback (most recent call last):
>> 
>>   File "<stdin>", line 1, in <module>
>> 
>> ValueError: too many values to unpack
>> 
>> #
>> 
>> # When I use xx the fields apparently get unpacked.
>> 
>> xx = re.split(RegExp2,line)
>> 
>> #
>> 
>> >>> print xx[0]
>> 
>> 
>> 
>> >>> print xx[1]
>> 
>> [F0244]
>> 
>> >>> print xx[5]
>> 
>> "Neely's Landing, Cape Gir. Co, MO"
>> 
>> >>> print xx[6]
>> 
>> 
>> 
>> >>> print xx[7]
>> 
>> 
>> 
>> >>> print xx[8]
>> 
>> 
>> 
>> Why is there an extra NULL field before and after my record contents?
>> 
>> I'm stuck, comments and solutions greatly appreciated.
>> 
>> 
>> 
>> Garry
> 
> Thanks everyone for your comments.  I'm new to Python, but can get
> around in Perl and regular expressions.  I sure was taking the long way
> trying to get the cvs data parsed.
> 
> Sure hope to teach myself python.  Maybe I need to look into courses
> offered at the local Jr College!
> 
> Garry

don't waste time at college (at least not yet) there are many good 
tutorials available on the web.
as you are already an experienced programmer "Dive into Python" would not 
be a bad start & the official python tutorial is also one not to miss




-- 
"Just think of a computer as hardware you can program."
-- Nigel de la Tierre

[toc] | [prev] | [next] | [standalone]

#37195

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-01-21 14:12 +0000
Message-ID	<am50lmF4146U1@mid.individual.net>
In reply to	#37129

On 2013-01-21, Garry <ggkraemer@gmail.com> wrote:
> Thanks everyone for your comments.  I'm new to Python, but can
> get around in Perl and regular expressions.  I sure was taking
> the long way trying to get the cvs data parsed.  
>
> Sure hope to teach myself python.  Maybe I need to look into
> courses offered at the local Jr College!

There's more than enough free resources online for the
resourceful Perl programmer to get going. It sounds like you
might be interested in Text Processing in Python.

http://gnosis.cx/TPiP/

Also good for your purposes is Dive Into Python.

http://www.diveintopython.net/

-- 
Neil Cerutti

[toc] | [prev] | [standalone]

csiph-web

RE Help splitting CVS data

Contents

#37119 — RE Help splitting CVS data

#37120

#37122

#37123 — Re: Help splitting CVS data

#37126

#37127

#37129

#37134

#37168

#37195