Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #99722

Re: How can I count word frequency in a web site?

Path csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From Michiel Overtoom <motoom@xs4all.nl>
Newsgroups comp.lang.python
Subject Re: How can I count word frequency in a web site?
Date Mon, 30 Nov 2015 08:56:32 +0100
Lines 43
Message-ID <mailman.20.1448870261.14615.python-list@python.org> (permalink)
References <6851e3b8-0d46-4808-9f7f-372b71bf327c@googlegroups.com> <mailman.14.1448850720.14615.python-list@python.org> <88ec2ba2-6b06-421b-89d5-ece408bb4c8e@googlegroups.com>
Mime-Version 1.0 (Mac OS X Mail 8.2 \(2104\))
Content-Type text/plain; charset=us-ascii
Content-Transfer-Encoding quoted-printable
X-Trace news.uni-berlin.de YIH6vWRR+sZMrvgGqlxILA3+6RHrFa72sFfDT0MIe7/Q==
Return-Path <motoom@xs4all.nl>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.049
X-Spam-Evidence '*H*': 0.90; '*S*': 0.00; 'subject:How': 0.09; 'cleaned': 0.09; 'collections': 0.09; 'rounding': 0.09; 'weird': 0.15; "'%s'": 0.16; 'from:addr:xs4all.nl': 0.16; 'received:194.109': 0.16; 'received:194.109.24': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'received:xs4all.nl': 0.16; 'rings': 0.16; 'rotation': 0.16; 'wrote:': 0.16; 'occurs': 0.22; 'split': 0.23; 'this:': 0.23; 'import': 0.24; 'plain': 0.24; 'words': 0.24; 'header:In-Reply- To:1': 0.24; 'convert': 0.29; 'print': 0.30; "we're": 0.30; 'to:name:python-list': 0.30; 'probably': 0.31; '"the': 0.32; 'skip:c 30': 0.35; 'text': 0.35; 'nov': 0.35; 'something': 0.35; 'to:addr:python-list': 0.36; 'subject:?': 0.36; 'subject:: ': 0.37; 'being': 0.37; 'charset:us-ascii': 0.37; 'to:addr:python.org': 0.40; 'header:Message-Id:1': 0.61; 'received:194': 0.61; 'relatively': 0.63; 'times': 0.63; 'our': 0.64; 'received:nl': 0.72; 'increase': 0.73; 'tour': 0.81; "'and'": 0.84; 'blue.': 0.84; 'triton': 0.84; 'water,': 0.84; 'giants': 0.91; 'ice': 0.91
In-Reply-To <88ec2ba2-6b06-421b-89d5-ece408bb4c8e@googlegroups.com>
X-Mailer Apple Mail (2.2104)
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.20+
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Xref csiph.com comp.lang.python:99722

Show key headers only | View raw


> On 30 Nov 2015, at 03:54, ryguy7272 <ryanshuell@gmail.com> wrote:
> 
> Now, how can I count specific words like 'fraud' and 'lawsuit'?

- convert the page to plain text
- remove any interpunction
- split into words
- see what words occur
- enumerate all the words and increase a counter for each word

Something like this:

s = """Today we're rounding out our planetary tour with ice giants Uranus
and Neptune. Both have small rocky cores, thick mantles of ammonia, water,
and methane, and atmospheres that make them look greenish and blue. Uranus
has a truly weird rotation and relatively dull weather, while Neptune has
clouds and storms whipped by tremendous winds. Both have rings and moons,
with Neptune's Triton probably being a captured iceball that has active
geology."""

import collections
cleaned = s.lower().replace("\n", " ").replace(".", "").replace(",", "").replace("'", " ")
count = collections.Counter(cleaned.split(" "))
for interesting in ("neptune", "and"):
    print "The word '%s' occurs %d times" % (interesting, count[interesting])


# Outputs:

The word 'neptune' occurs 3 times
The word 'and' occurs 7 times



Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

How can I count word frequency in a web site? ryguy7272 <ryanshuell@gmail.com> - 2015-11-29 16:49 -0800
  Re: How can I count word frequency in a web site? Cem Karan <cfkaran2@gmail.com> - 2015-11-29 21:31 -0500
    Re: How can I count word frequency in a web site? ryguy7272 <ryanshuell@gmail.com> - 2015-11-29 18:54 -0800
      Re: How can I count word frequency in a web site? Michiel Overtoom <motoom@xs4all.nl> - 2015-11-30 08:56 +0100
  Re: How can I count word frequency in a web site? Laura Creighton <lac@openend.se> - 2015-11-30 03:51 +0100
    Re: How can I count word frequency in a web site? ryguy7272 <ryanshuell@gmail.com> - 2015-11-30 07:04 -0800
  Re: How can I count word frequency in a web site? ryguy7272 <ryanshuell@gmail.com> - 2015-11-30 07:04 -0800

csiph-web