Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #7503

Re: Subsetting a dataset

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <bsk16@case.edu>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.011
X-Spam-Evidence '*H*': 0.98; '*S*': 0.00; 'python.': 0.04; 'beginner': 0.04; 'next,': 0.05; 'dictionary': 0.07; 'tab': 0.07; 'created,': 0.09; 'pm,': 0.10; 'wrote:': 0.14; '32.9': 0.16; 'columns': 0.16; 'datasets': 0.16; 'delimited': 0.16; 'rows': 0.16; 'sp2': 0.16; 'subset': 0.16; 'help.': 0.20; 'header:In- Reply-To:1': 0.21; 'column': 0.22; 'file,': 0.22; 'replacing': 0.23; 'code': 0.24; 'values': 0.25; 'extract': 0.25; 'function': 0.25; 'string': 0.26; 'message-id:@mail.gmail.com': 0.28; 'lists': 0.29; 'received:209.85.160': 0.29; 'second': 0.30; 'sun,': 0.30; 'looks': 0.31; 'define': 0.31; 'separate': 0.31; 'file.': 0.32; 'to:addr:python-list': 0.33; 'list': 0.33; 'file': 0.34; 'there': 0.35; 'several': 0.36; 'open': 0.36; 'probably': 0.36; 'received:google.com': 0.37; 'received:209.85': 0.37; 'another': 0.37; 'put': 0.37; 'two': 0.37; 'data': 0.38; 'creates': 0.38; 'subject:: ': 0.38; 'received:209': 0.39; 'containing': 0.39; 'list,': 0.39; 'to:addr:python.org': 0.39; 'listed': 0.40; 'under': 0.40; 'header': 0.40; 'hundreds': 0.60; 'more': 0.60; 'your': 0.60; 'huge': 0.62; 'grab': 0.63; 'below.': 0.65; 'records': 0.72; 'gen': 0.84; 'latitude': 0.84; 'longitude': 0.84; 'species': 0.84; 'dozen': 0.91
MIME-Version 1.0
In-Reply-To <BANLkTikF95KTC2-noGoPx=NL0NtrE7wO7Q@mail.gmail.com>
References <BANLkTikF95KTC2-noGoPx=NL0NtrE7wO7Q@mail.gmail.com>
Date Sun, 12 Jun 2011 23:08:05 -0700
Subject Re: Subsetting a dataset
From Benjamin Kaplan <benjamin.kaplan@case.edu>
To python-list@python.org
Content-Type text/plain; charset=windows-1252
Content-Transfer-Encoding quoted-printable
X-Junkmail-Status score=10/49, host=mpv2.tis.cwru.edu
X-Junkmail-Signature-Raw score=unknown, refid=str=0001.0A020202.4DF5A947.005F,ss=1,fgs=0, ip=209.85.160.54, so=2010-12-23 16:51:53, dmn=2009-09-10 00:05:08, mode=single engine
X-Junkmail-IWF false
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.161.1307945290.11593.python-list@python.org> (permalink)
Lines 58
NNTP-Posting-Host 82.94.164.166
X-Trace 1307945290 news.xs4all.nl 49174 [::ffff:82.94.164.166]:59458
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:7503

Show key headers only | View raw


On Sun, Jun 12, 2011 at 9:53 PM, Kumar Mainali <kpmainali@gmail.com> wrote:
> I have a huge dataset containing millions of rows and several dozen columns
> in a tab delimited text file.  I need to extract a small subset of rows and
> only three columns. One of the three columns has two word string with header
> “Scientific Name”. The other two columns carry numbers for Longitude and
> Latitude, as below.
> Sci Name Longitude Latitude Column4
> Gen sp1 82.5 28.4 …
> Gen sp2 45.9 29.7 …
> Gen sp1 57.9 32.9 …
> … … … …
> Of the many species listed under the column “Sci Name”, I am interested in
> only one species which will have multiple records interspersed in the
> millions of rows, and I will probably have to use filename.readline() to
> read the rows one at a time. How would I search for a particular species in
> the dataset and create a new dataset for the species with only the three
> columns?
> Next, I have to create such datasets for hundreds of species. All these
> species are listed in another text file. There must be a way to define an
> iterative function that looks at one species at a time in the list of
> species and creates separate dataset for each species. The huge dataset
> contains more species than those listed in the list of my interest.
> I very much appreciate any help. I am a beginner in Python. So, complete
> code would be more helpful.
> - Kumar

Read in the file with the lists of species. For each line in that
list, open up a file and then put it into a dictionary where the key
is the species name and the value is the file. Then, once you have all
your files created, open up the second file. For each line in the
second file, split it on the tabs and check to see if the first item
is in the dict. If it is, grab your necessary values and write it to
that corresponding file. In rough, untested code

animals = dict()
for line in open('species_list) :
    #make a file for that animal and associate it with the name
    animals[line.strip()] = open('%s_data.csv' % line.strip(),'w')


#now open the second file
for line in open('animal_data') :
    data = line.split('\t')
    if data[name_col] in animals :
        animals[data[name_col]].write('%s\t%s\t%s' & (data[name_col],
data[lat_col], data[lon_col])

replacing the respective file names and column numbers as appropriate,
of course.

>
>

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Subsetting a dataset Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-06-12 23:08 -0700

csiph-web