Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'example:': 0.03; '#if': 0.07; 'char': 0.07; 'dictionary': 0.07; 'subject:process': 0.07; 'python': 0.08; '(there': 0.09; 'argument,': 0.09; 'be:': 0.09; 'dict': 0.09; 'filename': 0.09; 'keys,': 0.09; 'omit': 0.09; 'problem:': 0.09; 'pm,': 0.10; 'output': 0.11; '>>>': 0.12; 'def': 0.12; 'subject:file': 0.14; 'wrote:': 0.14; "'r')": 0.16; '`filename`': 0.16; 'ascending': 0.16; 'braces,': 0.16; 'brackets,': 0.16; 'computes': 0.16; 'curly': 0.16; 'did,': 0.16; 'encountered.': 0.16; 'filename,': 0.16; "function's": 0.16; 'indented': 0.16; 'keys.': 0.16; 'lengths': 0.16; 'line.split()': 0.16; 'lookup': 0.16; 'needle': 0.16; 'newborn': 0.16; 'open()': 0.16; 'printing.': 0.16; 'rows': 0.16; 'skip:\xc2 30': 0.16; 'string:': 0.16; 'useless.': 0.16; '\xc2\xa0i': 0.16; '\xc2\xa0if': 0.16; 'argument': 0.16; 'cc:addr:python-list': 0.17; 'call.': 0.19; 'rewrite': 0.19; 'cheers,': 0.19; 'header:In-Reply- To:1': 0.21; 'seems': 0.21; 'loop': 0.22; 'cc:2**0': 0.22; 'code.': 0.22; 'cc:no real name:2**0': 0.23; 'helper': 0.23; 'moreover,': 0.23; 'once.': 0.23; 'received:209.85.213.46': 0.23; 'received:mail-yw0-f46.google.com': 0.23; 'set.': 0.23; 'do,': 0.25; 'function': 0.25; 'match': 0.26; 'string': 0.26; 'object': 0.26; 'message-id:@mail.gmail.com': 0.28; 'effect': 0.29; 'sat,': 0.29; 'import': 0.29; 'matches': 0.29; 'opposed': 0.29; 'order.': 0.29; "python's": 0.29; 'subject:HELP': 0.29; 'lists': 0.29; 'version': 0.29; "won't": 0.30; 'cc:addr:python.org': 0.30; 'fact': 0.30; 'semantics': 0.30; 'whitespace': 0.30; 'yields': 0.30; 'sort': 0.31; 'define': 0.31; 'word.': 0.32; 'words,': 0.32; 'file.': 0.32; 'does': 0.33; 'actually': 0.33; 'skip:" 20': 0.33; 'rather': 0.34; 'chris': 0.34; 'file': 0.34; 'characters': 0.34; 'showing': 0.34; 'function.': 0.35; 'themselves,': 0.35; 'using': 0.35; 'quite': 0.36; 'actual': 0.36; 'subject:text': 0.36; 'several': 0.36; 'skip:o 20': 0.37; 'running': 0.37; 'table': 0.37; 'received:google.com': 0.37; 'received:209.85': 0.37; 'received:209.85.213': 0.37; 'sequence': 0.37; 'ways': 0.37; 'but': 0.38; 'docs': 0.38; 'subject:: ': 0.38; '8bit%:6': 0.39; 'skip:s 20': 0.39; 'should': 0.39; "i'd": 0.39; 'received:209': 0.39; 'empty': 0.39; 'i.e.': 0.39; 'current': 0.40; 'missing': 0.40; 'help': 0.40; 'your': 0.60; 'order': 0.62; 'dear': 0.63; 'details': 0.64; 'taking': 0.64; 'making': 0.67; 'account:': 0.67; 'appreciation': 0.67; 'square': 0.67; 'skip:\xc2 10': 0.72; 'collection': 0.72; 'care': 0.72; '"list': 0.84; 'convey': 0.84; 'filename:': 0.84; 'sender:addr:chris': 0.84; 'so:': 0.84; '\xc2\xa0it': 0.84; '169': 0.91; 'abc': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rebertia.com; s=google; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=msB303DWaLGQTiIWrDfrSvfFFP8+wiScbNacwZLPSrY=; b=RlzsLjG1ye8mg0lSzJwPsQ5ybUKJaqDcYZEt9LdLqg/anvV3u4RXugSzSsyhEakKh5 9VRG69b5nrTzqUB/7USZGgwoUFJmZmPSEsD2xReffPhGG+E7a13KY//SNfWIWYT490mX Vx/GSPJgFH/hOnL4Vr4ZBo5ky9O8P6ih4gzN8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=rebertia.com; s=google; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=GEDAITZ4X2BtS4a6eAkMDiJvq0m53pduKUC969nMebry/fcXHfBdoyntI369QiecaW xqx3VsrUjv3PNx+xsX1WSehlRDZppCqzKppjOIFwzP1Z8fS1bOFBZetboGcIMPAA89o4 B85coAF0CShSFFxJov+AVpM0iuxItz8BbWYXI= MIME-Version: 1.0 Sender: chris@rebertia.com In-Reply-To: References: Date: Sat, 18 Jun 2011 17:16:31 -0700 X-Google-Sender-Auth: XBGWat4bRFSQ1E-F3c48H43dGS4 Subject: Re: NEED HELP-process words in a text file From: Chris Rebert To: Cathy James Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 130 NNTP-Posting-Host: 82.94.164.166 X-Trace: 1308442594 news.xs4all.nl 49046 [::ffff:82.94.164.166]:53039 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:7933 On Sat, Jun 18, 2011 at 4:21 PM, Cathy James wrote: > Dear Python Experts, > > First, I'd like to convey my appreciation to you all for your support > and contributions. =C2=A0I am a Python newborn and need help with my > function. I commented on my program as to what it should do, but > nothing is printing. I know I am off, but not sure where. Please > help:( > > import string > def fileProcess(filename): > =C2=A0 =C2=A0"""Call the program with an argument, > =C2=A0 =C2=A0it should treat the argument as a filename, > =C2=A0 =C2=A0splitting it up into words, and computes the length of each = word. > =C2=A0 =C2=A0print a table showing the word count for each of the word le= ngths > that has been encountered. > =C2=A0 =C2=A0Example: > =C2=A0 =C2=A0Length Count > =C2=A0 =C2=A01 16 > =C2=A0 =C2=A02 267 > =C2=A0 =C2=A03 267 > =C2=A0 =C2=A04 169 > =C2=A0 =C2=A0>>>"&" > =C2=A0 =C2=A0Length =C2=A0 =C2=A0Count > =C2=A0 =C2=A00 =C2=A0 =C2=A00 > =C2=A0 =C2=A0>>> > =C2=A0 =C2=A0>>>"right." > =C2=A0 =C2=A0Length =C2=A0 =C2=A0Count > =C2=A0 =C2=A05 =C2=A0 =C2=A010 > =C2=A0 =C2=A0""" > =C2=A0 =C2=A0freq =3D [] #empty dict to accumulate words and word length Er, that's an empty *list*, not an empty dict. Dicts use curly braces, i.e. {}. Lists use square brackets, i.e. []. So: freq =3D {} > =C2=A0 =C2=A0filename=3Dopen('declaration.txt, r') 1. You should be using the passed-in filename; you're currently ignoring the function's argument and just hardcoding the filename as declaration.txt. 2. You're missing 2 quotes inside the open() call. It should be: open('declaration.txt', 'r') 3. `filename` is misnamed; you're using it for a file object as opposed to a string representing the name of the file Taking all that into account: f =3D open(filename, 'r') for line in f: > =C2=A0 =C2=A0for line in filename: > =C2=A0 =C2=A0 =C2=A0 =C2=A0punc =3D string.punctuation + string.whitespac= e#use Python's > built-in punctuation and whiitespace > =C2=A0 =C2=A0 =C2=A0 =C2=A0for i, word in enumerate (line.replace (punc, = "").lower().split()): str.replace() does not match the characters of the needle string as a set. Rather, it matches it as a contiguous sequence of characters. By way of example: >>> "abc abc abc".replace("ac", "Q") # no effect 'abc abc abc' >>> "abc abc abc".replace("bc", "Q") 'aQ aQ aQ' Order matters; the needle is a substring, not a set of characters. (Jargon: needle =3D what you're searching for; as opposed to: haystack =3D what you're searching through). Also, since whitespace is part of punct, even if str.replace() were to have the semantics you thought it did, you'd end up with nospacesbetweenthewords whatsoever, making the str.split() call quite useless. So, to rewrite this, let's first define a helper function to remove all the punctuation from a string: def withoutPunct(word): # Lookup "list comprehensions" if you don't understand this code. return ''.join(char for char in word if char not in string.punctuation) Now, rewriting everything inside the enumerate() call: withoutPunct(word) for word in line.split() In fact, you never use `i`, so there's not need to use enumerate in the first place in the inner for-loop: words =3D (withoutPunct(word) for word in line.split()) for word in words: > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if word in freq: > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0freq[word] +=3D1 #= increment current count if word already in dict > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0else: > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0freq[word] =3D 0 #= if punctuation encountered, > frequency=3D0 word length =3D 0 Problem: What about the very first time you see a word? It won't be in freq, so you'll set its count to 0, when in fact you've now seen it once. Moreover, we don't care about what the words actually are; we only care about their lengths. So freq should use word lengths, not the actual words themselves, as keys. Corrected version (there are several ways to do this): length =3D len(word) freq[length] =3D freq.get(length, 0) + 1 # See dict.get() docs for details > =C2=A0 =C2=A0 =C2=A0 =C2=A0for word in freq.items(): Items returns a collection of key-value pairs, not a collection of keys. If you just want the keys, omit the `.items()`. Also, this seems to be indented wrong. You're running the output loop once per line rather than once per file. Finally, the dictionary yields its keys/items in no particular order; based on the sample output, you'll need to sort the word lengths if you want to output the table's rows in ascending order. Cheers, Chris -- http://rebertia.com