Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Russell Owen <rowen@uw.edu>
Subject: Re: Picking apart a text line
Date: Mon, 02 Mar 2015 12:25:34 -0800
References: <mcopnu$s6t$1@ger.gmane.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
In-Reply-To: <mcopnu$s6t$1@ger.gmane.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.67.1425327960.13471.python-list@python.org>
Lines: 35
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:86784

On 2/26/15 7:53 PM, memilanuk wrote:
> So... okay.  I've got a bunch of PDFs of tournament reports that I want
> to sift thru for information.  Ended up using 'pdftotext -layout
> file.pdf file.txt' to extract the text from the PDF.  Still have a few
> little glitches to iron out there, but I'm getting decent enough results
> for the moment to move on.
>
...
> So back to the lines of text I have stored as strings in a list.  I
> think I want to convert that to a list of lists, i.e. split each line
> up, store that info in another list and ditch the whitespace.  Or would
> I be better off using dicts?  Originally I was thinking of how to
> process each line and split it them up based on what information was
> where - some sort of nested for/if mess.  Now I'm starting to think that
> the lines of text are pretty uniform in structure i.e. the same field is
> always in the same location, and that list slicing might be the way to
> go, if a bit tedious to set up initially...?
>
> Any thoughts or suggestions from people who've gone down this particular
> path would be greatly appreciated.  I think I have a general
> idea/direction, but I'm open to other ideas if the path I'm on is just
> blatantly wrong.

It sounds to me as if the best way to handle all this is keep the 
information it in a database, preferably one available from the network 
and centrally managed, so whoever enters the information in the first 
place enters it there. But I admit that setting such a thing up requires 
some overhead.

Simpler alternatives include using SQLite, a simple file-based database 
system, or numpy structured arrays (arrays with named fields). Python 
includes a standard library module for sqlite and numpy is easy to install.

-- Russell