Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Michiel Overtoom <motoom@xs4all.nl>
Newsgroups: comp.lang.python
Subject: Re: How can I count word frequency in a web site?
Date: Mon, 30 Nov 2015 08:56:32 +0100
Lines: 43
Message-ID: <mailman.20.1448870261.14615.python-list@python.org>
References: <6851e3b8-0d46-4808-9f7f-372b71bf327c@googlegroups.com> <mailman.14.1448850720.14615.python-list@python.org> <88ec2ba2-6b06-421b-89d5-ece408bb4c8e@googlegroups.com>
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\))
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <88ec2ba2-6b06-421b-89d5-ece408bb4c8e@googlegroups.com>
Precedence: list
Xref: csiph.com comp.lang.python:99722


> On 30 Nov 2015, at 03:54, ryguy7272 <ryanshuell@gmail.com> wrote:
>=20
> Now, how can I count specific words like 'fraud' and 'lawsuit'?

- convert the page to plain text
- remove any interpunction
- split into words
- see what words occur
- enumerate all the words and increase a counter for each word

Something like this:

s =3D """Today we're rounding out our planetary tour with ice giants =
Uranus
and Neptune. Both have small rocky cores, thick mantles of ammonia, =
water,
and methane, and atmospheres that make them look greenish and blue. =
Uranus
has a truly weird rotation and relatively dull weather, while Neptune =
has
clouds and storms whipped by tremendous winds. Both have rings and =
moons,
with Neptune's Triton probably being a captured iceball that has active
geology."""

import collections
cleaned =3D s.lower().replace("\n", " ").replace(".", "").replace(",", =
"").replace("'", " ")
count =3D collections.Counter(cleaned.split(" "))
for interesting in ("neptune", "and"):
    print "The word '%s' occurs %d times" % (interesting, =
count[interesting])


# Outputs:

The word 'neptune' occurs 3 times
The word 'and' occurs 7 times