Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
DomainKey-Signature: a=rsa-sha1; c=nofws; d=rebertia.com; s=google; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=GEDAITZ4X2BtS4a6eAkMDiJvq0m53pduKUC969nMebry/fcXHfBdoyntI369QiecaW xqx3VsrUjv3PNx+xsX1WSehlRDZppCqzKppjOIFwzP1Z8fS1bOFBZetboGcIMPAA89o4 B85coAF0CShSFFxJov+AVpM0iuxItz8BbWYXI=
MIME-Version: 1.0
Sender: chris@rebertia.com
In-Reply-To: <BANLkTinjLGSim79f1acOKzYgUq5xoSFdeg@mail.gmail.com>
References: <BANLkTinjLGSim79f1acOKzYgUq5xoSFdeg@mail.gmail.com>
Date: Sat, 18 Jun 2011 17:16:31 -0700
Subject: Re: NEED HELP-process words in a text file
From: Chris Rebert <clp2@rebertia.com>
To: Cathy James <nambo4jb@gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.136.1308442594.1164.python-list@python.org>
Lines: 130
NNTP-Posting-Host: 82.94.164.166
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:7933

On Sat, Jun 18, 2011 at 4:21 PM, Cathy James <nambo4jb@gmail.com> wrote:
> Dear Python Experts,
>
> First, I'd like to convey my appreciation to you all for your support
> and contributions. =C2=A0I am a Python newborn and need help with my
> function. I commented on my program as to what it should do, but
> nothing is printing. I know I am off, but not sure where. Please
> help:(
>
> import string
> def fileProcess(filename):
> =C2=A0 =C2=A0"""Call the program with an argument,
> =C2=A0 =C2=A0it should treat the argument as a filename,
> =C2=A0 =C2=A0splitting it up into words, and computes the length of each =
word.
> =C2=A0 =C2=A0print a table showing the word count for each of the word le=
ngths
> that has been encountered.
> =C2=A0 =C2=A0Example:
> =C2=A0 =C2=A0Length Count
> =C2=A0 =C2=A01 16
> =C2=A0 =C2=A02 267
> =C2=A0 =C2=A03 267
> =C2=A0 =C2=A04 169
> =C2=A0 =C2=A0>>>"&"
> =C2=A0 =C2=A0Length =C2=A0 =C2=A0Count
> =C2=A0 =C2=A00 =C2=A0 =C2=A00
> =C2=A0 =C2=A0>>>
> =C2=A0 =C2=A0>>>"right."
> =C2=A0 =C2=A0Length =C2=A0 =C2=A0Count
> =C2=A0 =C2=A05 =C2=A0 =C2=A010
> =C2=A0 =C2=A0"""
> =C2=A0 =C2=A0freq =3D [] #empty dict to accumulate words and word length

Er, that's an empty *list*, not an empty dict. Dicts use curly braces,
i.e. {}. Lists use square brackets, i.e. [].
So:

freq =3D {}

> =C2=A0 =C2=A0filename=3Dopen('declaration.txt, r')

1. You should be using the passed-in filename; you're currently
ignoring the function's argument and just hardcoding the filename as
declaration.txt.
2. You're missing 2 quotes inside the open() call. It should be:
open('declaration.txt', 'r')
3. `filename` is misnamed; you're using it for a file object as
opposed to a string representing the name of the file

Taking all that into account:

f =3D open(filename, 'r')
for line in f:

> =C2=A0 =C2=A0for line in filename:
> =C2=A0 =C2=A0 =C2=A0 =C2=A0punc =3D string.punctuation + string.whitespac=
e#use Python's
> built-in punctuation and whiitespace
> =C2=A0 =C2=A0 =C2=A0 =C2=A0for i, word in enumerate (line.replace (punc, =
"").lower().split()):

str.replace() does not match the characters of the needle string as a
set. Rather, it matches it as a contiguous sequence of characters. By
way of example:
>>> "abc abc abc".replace("ac", "Q") # no effect
'abc abc abc'
>>> "abc abc abc".replace("bc", "Q")
'aQ aQ aQ'

Order matters; the needle is a substring, not a set of characters.

(Jargon: needle =3D what you're searching for; as opposed to: haystack =3D
what you're searching through).

Also, since whitespace is part of punct, even if str.replace() were to
have the semantics you thought it did, you'd end up with
nospacesbetweenthewords whatsoever, making the str.split() call quite
useless.

So, to rewrite this, let's first define a helper function to remove
all the punctuation from a string:

def withoutPunct(word):
    # Lookup "list comprehensions" if you don't understand this code.
    return ''.join(char for char in word if char not in string.punctuation)

Now, rewriting everything inside the enumerate() call:
withoutPunct(word) for word in line.split()

In fact, you never use `i`, so there's not need to use enumerate in
the first place in the inner for-loop:

words =3D (withoutPunct(word) for word in line.split())
for word in words:

> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if word in freq:
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0freq[word] +=3D1 #=
increment current count if word already in dict
>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0else:
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0freq[word] =3D 0 #=
if punctuation encountered,
> frequency=3D0 word length =3D 0

Problem: What about the very first time you see a word? It won't be in
freq, so you'll set its count to 0, when in fact you've now seen it
once.
Moreover, we don't care about what the words actually are; we only
care about their lengths. So freq should use word lengths, not the
actual words themselves, as keys.
Corrected version (there are several ways to do this):

length =3D len(word)
freq[length] =3D freq.get(length, 0) + 1 # See dict.get() docs for details

> =C2=A0 =C2=A0 =C2=A0 =C2=A0for word in freq.items():

Items returns a collection of key-value pairs, not a collection of
keys. If you just want the keys, omit the `.items()`.
Also, this seems to be indented wrong. You're running the output loop
once per line rather than once per file.
Finally, the dictionary yields its keys/items in no particular order;
based on the sample output, you'll need to sort the word lengths if
you want to output the table's rows in ascending order.

Cheers,
Chris
--
http://rebertia.com