Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #7932

Re: NEED HELP-process words in a text file

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python.list@tim.thechases.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'else:': 0.03; 'parameter': 0.05; '#if': 0.07; 'bits': 0.07; 'mess': 0.07; 'seemed': 0.07; 'subject:process': 0.07; 'tab': 0.07; 'dict': 0.09; 'presume': 0.09; 'pm,': 0.10; 'written': 0.14; 'subject:file': 0.14; 'wrote:': 0.14; '"/"': 0.16; '-tkc': 0.16; '2.6,': 0.16; 'characters:': 0.16; 'escapes': 0.16; 'formatting.': 0.16; 'from:addr:python.list': 0.16; 'from:addr:tim.thechases.com': 0.16; 'from:name:tim chase': 0.16; 'message- id:@tim.thechases.com': 0.16; 'received:70.251': 0.16; 'received:dsl.rcsntx.swbell.net': 0.16; 'received:rcsntx.swbell.net': 0.16; 'received:swbell.net': 0.16; 'regexp': 0.16; 'set:': 0.16; 'cc:addr:python-list': 0.17; 'header :In-Reply-To:1': 0.21; 'loop': 0.22; 'cc:2**0': 0.22; 'code.': 0.22; 'cc:no real name:2**0': 0.23; 'documented': 0.23; 'loop,': 0.23; 'once.': 0.23; 'code': 0.24; "doesn't": 0.25; 'expect': 0.25; 'string': 0.26; "i'm": 0.27; 'random': 0.28; "python's": 0.29; 'subject:HELP': 0.29; 'instead': 0.29; 'cc:addr:python.org': 0.30; 'characters,': 0.30; 'iterating': 0.30; 'separated': 0.30; 'strings.': 0.30; 'familiar': 0.33; 'comment': 0.33; 'skip:" 20': 0.33; 'list.': 0.33; 'things': 0.33; 'rather': 0.34; '...': 0.34; 'characters': 0.34; 'there': 0.35; 'header:User-Agent:1': 0.35; 'skip:" 10': 0.35; 'fails': 0.35; 'replacement': 0.35; 'couple': 0.35; 'using': 0.35; 'skip:. 10': 0.36; 'subject:text': 0.36; 'presence': 0.37; 'something': 0.37; 'change': 0.37; 'ways': 0.37; 'but': 0.38; 'creates': 0.38; 'somewhat': 0.38; 'word,': 0.38; 'subject:: ': 0.38; 'skip:s 20': 0.39; 'should': 0.39; 'unless': 0.39; "i'd": 0.39; 'empty': 0.39; 'sets': 0.39; 'either': 0.39; 'current': 0.40; 'your': 0.60; 'matter': 0.63; 'exact': 0.65; 'anywhere.': 0.84; 'dict,': 0.84; 'filename:': 0.84
Date Sat, 18 Jun 2011 19:09:18 -0500
From Tim Chase <python.list@tim.thechases.com>
User-Agent Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10
MIME-Version 1.0
To Cathy James <nambo4jb@gmail.com>
Subject Re: NEED HELP-process words in a text file
References <BANLkTinjLGSim79f1acOKzYgUq5xoSFdeg@mail.gmail.com>
In-Reply-To <BANLkTinjLGSim79f1acOKzYgUq5xoSFdeg@mail.gmail.com>
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
X-AntiAbuse This header was added to track abuse, please include it with any abuse report
X-AntiAbuse Primary Hostname - boston.accountservergroup.com
X-AntiAbuse Original Domain - python.org
X-AntiAbuse Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse Sender Address Domain - tim.thechases.com
X-Source
X-Source-Args
X-Source-Dir
Cc python-list@python.org
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.135.1308442167.1164.python-list@python.org> (permalink)
Lines 94
NNTP-Posting-Host 82.94.164.166
X-Trace 1308442167 news.xs4all.nl 49180 [::ffff:82.94.164.166]:59823
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:7932

Show key headers only | View raw


On 06/18/2011 06:21 PM, Cathy James wrote:

>      freq = [] #empty dict to accumulate words and word length

While you say you create an empty dict, using "[]" creates an 
empty *list*, not a dict.  Either your comment is wrong or your 
code is wrong. :)  Given your usage, I presume you want a dict, 
not a list.

>      for line in filename:
>          punc = string.punctuation + string.whitespace#use Python's
> built-in punctuation and whiitespace

Since you don't change "punc" in your loop, you'd get better 
performance by hoisting this outside of the loop so it's only 
evaluated once.  Not that it should matter *that* greatly, but 
it's just a bad-code-smell.

>          for i, word in enumerate (line.replace (punc, "").lower().split()):

.replace() doesn't operate on sets of characters, but rather 
strings.  So unless your line contains the exact text in "punc" 
(unlikely), that replacement is a NOP.  There are a couple ways 
to go about removing unwanted characters:

- make a set of those characters and produce a resulting string 
from things not in that set:

  punc_set = set(punc)
  line = ''.join(c for c in line if c not in punc_set)

- use a regexp to strip them out...something like

   punc_re = re.compile("[" + re.escape(punc) + "]")
   ...
   line = punc_re.sub('', line)

- use string translations.  I'm not as familiar with these, but 
the following seemed to work for me, abusing the 2nd 
"deletechars" parameter for your particular use-case:

   line = line.translate(None, punc)

I don't see .translate(None) documented anywhere.  My random 
effort seemed to work in 2.6, but fails in 2.5 and prior.  YMMV.

>              if word in freq:
>                  freq[word] +=1 #increment current count if word already in dict
>
>              else:
>                  freq[word] = 0 #if punctuation encountered,
> frequency=0 word length = 0

Again, your 2nd comment disagrees with your code.  As an aside, 
if you're using 2.5 or greater, I'd use 
collections.defaultdict(int) as the accumulator:

   freq = collections.defaultdict(int)
   ...
   freq[word] += 1
   # no need to check presence

>          for word in freq.items():
>              print("Length /t"+"Count/n"+ freq[word],+'/t' +
> len(word))#print word count and length of word separated by a tab

Where to begin:

- Your escapes are using "/" instead of "\" for <tab> and 
<newline> which I expect will mess up the formatting.

- You're also labeling them "Length/Count" but printing 
"count/length".

- you're iterating over freq.items() but that should be written as

   for word, count in freq.items():

or

   for word in freq:

-  Additionally, adding the bits together makes it somewhat hard 
to understand.

I'd use something like

   for word, count in freq.items():
     print("Word \tLength \tCount\n%s \t%i \t%i" % (
       word, len(word), count))

-tkc

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: NEED HELP-process words in a text file Tim Chase <python.list@tim.thechases.com> - 2011-06-18 19:09 -0500

csiph-web