Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #7504

Re: Subsetting a dataset

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.005
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; 'python,': 0.01; 'next,': 0.05; 'tab': 0.07; 'terry': 0.07; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:80.91.229.12': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'received:lo.gmane.org': 0.09; 'am,': 0.14; 'wrote:': 0.14; 'columns': 0.16; 'datasets': 0.16; 'delimited': 0.16; 'reedy': 0.16; 'rows': 0.16; 'sqlite,': 0.16; 'subset': 0.16; 'jan': 0.20; 'header:In-Reply-To:1': 0.21; 'column': 0.22; 'programs.': 0.23; 'extract': 0.25; 'function': 0.25; 'string': 0.26; 'looks': 0.31; 'define': 0.31; 'separate': 0.31; 'header:X-Complaints-To:1': 0.32; 'file.': 0.32; 'to:addr:python-list': 0.33; 'list': 0.33; 'there': 0.35; 'header:User-Agent:1': 0.35; 'using': 0.35; 'several': 0.36; 'probably': 0.36; 'another': 0.37; 'two': 0.37; 'received:org': 0.38; 'could': 0.38; 'creates': 0.38; 'subject:: ': 0.38; 'should': 0.39; 'containing': 0.39; 'header:Mime- Version:1': 0.39; 'to:addr:python.org': 0.39; 'listed': 0.40; 'under': 0.40; 'export': 0.40; 'header': 0.40; 'needed.': 0.40; 'hundreds': 0.60; 'more': 0.60; 'huge': 0.62; 'free': 0.63; 'permanently': 0.65; 'below.': 0.65; 'database.': 0.72; 'records': 0.72; '12:53': 0.84; 'gen': 0.84; 'longitude': 0.84; 'species': 0.84; 'dozen': 0.91
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Terry Reedy <tjreedy@udel.edu>
Subject Re: Subsetting a dataset
Date Mon, 13 Jun 2011 02:21:39 -0400
References <BANLkTikF95KTC2-noGoPx=NL0NtrE7wO7Q@mail.gmail.com>
Mime-Version 1.0
Content-Type text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding quoted-printable
X-Gmane-NNTP-Posting-Host rain.gmane.org
User-Agent Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.17) Gecko/20110414 Lightning/1.0b2 Thunderbird/3.1.10
In-Reply-To <BANLkTikF95KTC2-noGoPx=NL0NtrE7wO7Q@mail.gmail.com>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.162.1307946113.11593.python-list@python.org> (permalink)
Lines 45
NNTP-Posting-Host 82.94.164.166
X-Trace 1307946113 news.xs4all.nl 49175 [::ffff:82.94.164.166]:48840
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:7504

Show key headers only | View raw


On 6/13/2011 12:53 AM, Kumar Mainali wrote:
> I have a huge dataset containing millions of rows and several dozen
> columns in a tab delimited text file.  I need to extract a small subset
> of rows and only three columns. One of the three columns has two word
> string with header “Scientific Name”. The other two columns carry
> numbers for Longitude and Latitude, as below.
>
> Sci NameLongitudeLatitudeColumn4
> Gen sp182.528.4…
> Gen sp245.929.7…
> Gen sp157.932.9…
> …………
>
> Of the many species listed under the column “Sci Name”, I am interested
> in only one species which will have multiple records interspersed in the
> millions of rows, and I will probably have to use filename.readline() to
> read the rows one at a time. How would I search for a particular species
> in the dataset and create a new dataset for the species with only the
> three columns?
>
> Next, I have to create such datasets for hundreds of species. All these
> species are listed in another text file. There must be a way to define
> an iterative function that looks at one species at a time in the list of
> species and creates separate dataset for each species. The huge dataset
> contains more species than those listed in the list of my interest.

Consider using a real dataset program with Sci_name indexed. Then you 
can extract the rows for any species as needed. You should only need 
separate files if you want to export them or more or less permanently 
split the database. You could try sqlite, which come with python, or one 
of the other free database programs.

-- 
Terry Jan Reedy

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Subsetting a dataset Terry Reedy <tjreedy@udel.edu> - 2011-06-13 02:21 -0400

csiph-web