Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Michiel Overtoom Newsgroups: comp.lang.python Subject: Re: How can I count word frequency in a web site? Date: Mon, 30 Nov 2015 08:56:32 +0100 Lines: 43 Message-ID: References: <6851e3b8-0d46-4808-9f7f-372b71bf327c@googlegroups.com> <88ec2ba2-6b06-421b-89d5-ece408bb4c8e@googlegroups.com> Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\)) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Trace: news.uni-berlin.de YIH6vWRR+sZMrvgGqlxILA3+6RHrFa72sFfDT0MIe7/Q== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.049 X-Spam-Evidence: '*H*': 0.90; '*S*': 0.00; 'subject:How': 0.09; 'cleaned': 0.09; 'collections': 0.09; 'rounding': 0.09; 'weird': 0.15; "'%s'": 0.16; 'from:addr:xs4all.nl': 0.16; 'received:194.109': 0.16; 'received:194.109.24': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'received:xs4all.nl': 0.16; 'rings': 0.16; 'rotation': 0.16; 'wrote:': 0.16; 'occurs': 0.22; 'split': 0.23; 'this:': 0.23; 'import': 0.24; 'plain': 0.24; 'words': 0.24; 'header:In-Reply- To:1': 0.24; 'convert': 0.29; 'print': 0.30; "we're": 0.30; 'to:name:python-list': 0.30; 'probably': 0.31; '"the': 0.32; 'skip:c 30': 0.35; 'text': 0.35; 'nov': 0.35; 'something': 0.35; 'to:addr:python-list': 0.36; 'subject:?': 0.36; 'subject:: ': 0.37; 'being': 0.37; 'charset:us-ascii': 0.37; 'to:addr:python.org': 0.40; 'header:Message-Id:1': 0.61; 'received:194': 0.61; 'relatively': 0.63; 'times': 0.63; 'our': 0.64; 'received:nl': 0.72; 'increase': 0.73; 'tour': 0.81; "'and'": 0.84; 'blue.': 0.84; 'triton': 0.84; 'water,': 0.84; 'giants': 0.91; 'ice': 0.91 In-Reply-To: <88ec2ba2-6b06-421b-89d5-ece408bb4c8e@googlegroups.com> X-Mailer: Apple Mail (2.2104) X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:99722 > On 30 Nov 2015, at 03:54, ryguy7272 wrote: >=20 > Now, how can I count specific words like 'fraud' and 'lawsuit'? - convert the page to plain text - remove any interpunction - split into words - see what words occur - enumerate all the words and increase a counter for each word Something like this: s =3D """Today we're rounding out our planetary tour with ice giants = Uranus and Neptune. Both have small rocky cores, thick mantles of ammonia, = water, and methane, and atmospheres that make them look greenish and blue. = Uranus has a truly weird rotation and relatively dull weather, while Neptune = has clouds and storms whipped by tremendous winds. Both have rings and = moons, with Neptune's Triton probably being a captured iceball that has active geology.""" import collections cleaned =3D s.lower().replace("\n", " ").replace(".", "").replace(",", = "").replace("'", " ") count =3D collections.Counter(cleaned.split(" ")) for interesting in ("neptune", "and"): print "The word '%s' occurs %d times" % (interesting, = count[interesting]) # Outputs: The word 'neptune' occurs 3 times The word 'and' occurs 7 times