Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'else:': 0.03; 'parameter': 0.05; '#if': 0.07; 'bits': 0.07; 'mess': 0.07; 'seemed': 0.07; 'subject:process': 0.07; 'tab': 0.07; 'dict': 0.09; 'presume': 0.09; 'pm,': 0.10; 'written': 0.14; 'subject:file': 0.14; 'wrote:': 0.14; '"/"': 0.16; '-tkc': 0.16; '2.6,': 0.16; 'characters:': 0.16; 'escapes': 0.16; 'formatting.': 0.16; 'from:addr:python.list': 0.16; 'from:addr:tim.thechases.com': 0.16; 'from:name:tim chase': 0.16; 'message- id:@tim.thechases.com': 0.16; 'received:70.251': 0.16; 'received:dsl.rcsntx.swbell.net': 0.16; 'received:rcsntx.swbell.net': 0.16; 'received:swbell.net': 0.16; 'regexp': 0.16; 'set:': 0.16; 'cc:addr:python-list': 0.17; 'header :In-Reply-To:1': 0.21; 'loop': 0.22; 'cc:2**0': 0.22; 'code.': 0.22; 'cc:no real name:2**0': 0.23; 'documented': 0.23; 'loop,': 0.23; 'once.': 0.23; 'code': 0.24; "doesn't": 0.25; 'expect': 0.25; 'string': 0.26; "i'm": 0.27; 'random': 0.28; "python's": 0.29; 'subject:HELP': 0.29; 'instead': 0.29; 'cc:addr:python.org': 0.30; 'characters,': 0.30; 'iterating': 0.30; 'separated': 0.30; 'strings.': 0.30; 'familiar': 0.33; 'comment': 0.33; 'skip:" 20': 0.33; 'list.': 0.33; 'things': 0.33; 'rather': 0.34; '...': 0.34; 'characters': 0.34; 'there': 0.35; 'header:User-Agent:1': 0.35; 'skip:" 10': 0.35; 'fails': 0.35; 'replacement': 0.35; 'couple': 0.35; 'using': 0.35; 'skip:. 10': 0.36; 'subject:text': 0.36; 'presence': 0.37; 'something': 0.37; 'change': 0.37; 'ways': 0.37; 'but': 0.38; 'creates': 0.38; 'somewhat': 0.38; 'word,': 0.38; 'subject:: ': 0.38; 'skip:s 20': 0.39; 'should': 0.39; 'unless': 0.39; "i'd": 0.39; 'empty': 0.39; 'sets': 0.39; 'either': 0.39; 'current': 0.40; 'your': 0.60; 'matter': 0.63; 'exact': 0.65; 'anywhere.': 0.84; 'dict,': 0.84; 'filename:': 0.84 Date: Sat, 18 Jun 2011 19:09:18 -0500 From: Tim Chase User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10 MIME-Version: 1.0 To: Cathy James Subject: Re: NEED HELP-process words in a text file References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - boston.accountservergroup.com X-AntiAbuse: Original Domain - python.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - tim.thechases.com X-Source: X-Source-Args: X-Source-Dir: Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 94 NNTP-Posting-Host: 82.94.164.166 X-Trace: 1308442167 news.xs4all.nl 49180 [::ffff:82.94.164.166]:59823 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:7932 On 06/18/2011 06:21 PM, Cathy James wrote: > freq = [] #empty dict to accumulate words and word length While you say you create an empty dict, using "[]" creates an empty *list*, not a dict. Either your comment is wrong or your code is wrong. :) Given your usage, I presume you want a dict, not a list. > for line in filename: > punc = string.punctuation + string.whitespace#use Python's > built-in punctuation and whiitespace Since you don't change "punc" in your loop, you'd get better performance by hoisting this outside of the loop so it's only evaluated once. Not that it should matter *that* greatly, but it's just a bad-code-smell. > for i, word in enumerate (line.replace (punc, "").lower().split()): .replace() doesn't operate on sets of characters, but rather strings. So unless your line contains the exact text in "punc" (unlikely), that replacement is a NOP. There are a couple ways to go about removing unwanted characters: - make a set of those characters and produce a resulting string from things not in that set: punc_set = set(punc) line = ''.join(c for c in line if c not in punc_set) - use a regexp to strip them out...something like punc_re = re.compile("[" + re.escape(punc) + "]") ... line = punc_re.sub('', line) - use string translations. I'm not as familiar with these, but the following seemed to work for me, abusing the 2nd "deletechars" parameter for your particular use-case: line = line.translate(None, punc) I don't see .translate(None) documented anywhere. My random effort seemed to work in 2.6, but fails in 2.5 and prior. YMMV. > if word in freq: > freq[word] +=1 #increment current count if word already in dict > > else: > freq[word] = 0 #if punctuation encountered, > frequency=0 word length = 0 Again, your 2nd comment disagrees with your code. As an aside, if you're using 2.5 or greater, I'd use collections.defaultdict(int) as the accumulator: freq = collections.defaultdict(int) ... freq[word] += 1 # no need to check presence > for word in freq.items(): > print("Length /t"+"Count/n"+ freq[word],+'/t' + > len(word))#print word count and length of word separated by a tab Where to begin: - Your escapes are using "/" instead of "\" for and which I expect will mess up the formatting. - You're also labeling them "Length/Count" but printing "count/length". - you're iterating over freq.items() but that should be written as for word, count in freq.items(): or for word in freq: - Additionally, adding the bits together makes it somewhat hard to understand. I'd use something like for word, count in freq.items(): print("Word \tLength \tCount\n%s \t%i \t%i" % ( word, len(word), count)) -tkc