Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #99705 > unrolled thread

How can I count word frequency in a web site?

Started byryguy7272 <ryanshuell@gmail.com>
First post2015-11-29 16:49 -0800
Last post2015-11-30 07:04 -0800
Articles 7 — 4 participants

Back to article view | Back to comp.lang.python


Contents

  How can I count word frequency in a web site? ryguy7272 <ryanshuell@gmail.com> - 2015-11-29 16:49 -0800
    Re: How can I count word frequency in a web site? Cem Karan <cfkaran2@gmail.com> - 2015-11-29 21:31 -0500
      Re: How can I count word frequency in a web site? ryguy7272 <ryanshuell@gmail.com> - 2015-11-29 18:54 -0800
        Re: How can I count word frequency in a web site? Michiel Overtoom <motoom@xs4all.nl> - 2015-11-30 08:56 +0100
    Re: How can I count word frequency in a web site? Laura Creighton <lac@openend.se> - 2015-11-30 03:51 +0100
      Re: How can I count word frequency in a web site? ryguy7272 <ryanshuell@gmail.com> - 2015-11-30 07:04 -0800
    Re: How can I count word frequency in a web site? ryguy7272 <ryanshuell@gmail.com> - 2015-11-30 07:04 -0800

#99705 — How can I count word frequency in a web site?

Fromryguy7272 <ryanshuell@gmail.com>
Date2015-11-29 16:49 -0800
SubjectHow can I count word frequency in a web site?
Message-ID<6851e3b8-0d46-4808-9f7f-372b71bf327c@googlegroups.com>
I'm trying to figure out how to count words in a web site.  Here is a sample of the link I want to scrape data from and count specific words.
http://finance.yahoo.com/q/h?s=STRP+Headlines

I only want to count certain words, like 'fraud', 'lawsuit', etc.  I want to have a way to control for specific words.  I have a couple Python scripts that do this for a text file, but not for a web site.  I can post that, if that's helpful.

[toc] | [next] | [standalone]


#99714

FromCem Karan <cfkaran2@gmail.com>
Date2015-11-29 21:31 -0500
Message-ID<mailman.14.1448850720.14615.python-list@python.org>
In reply to#99705
You might want to look into Beautiful Soup (https://pypi.python.org/pypi/beautifulsoup4), which is an HTML screen-scraping tool.  I've never used it, but I've heard good things about it.

Good luck,
Cem Karan

On Nov 29, 2015, at 7:49 PM, ryguy7272 <ryanshuell@gmail.com> wrote:

> I'm trying to figure out how to count words in a web site.  Here is a sample of the link I want to scrape data from and count specific words.
> http://finance.yahoo.com/q/h?s=STRP+Headlines
> 
> I only want to count certain words, like 'fraud', 'lawsuit', etc.  I want to have a way to control for specific words.  I have a couple Python scripts that do this for a text file, but not for a web site.  I can post that, if that's helpful.
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]


#99719

Fromryguy7272 <ryanshuell@gmail.com>
Date2015-11-29 18:54 -0800
Message-ID<88ec2ba2-6b06-421b-89d5-ece408bb4c8e@googlegroups.com>
In reply to#99714
On Sunday, November 29, 2015 at 9:32:22 PM UTC-5, Cem Karan wrote:
> You might want to look into Beautiful Soup (https://pypi.python.org/pypi/beautifulsoup4), which is an HTML screen-scraping tool.  I've never used it, but I've heard good things about it.
> 
> Good luck,
> Cem Karan
> 
> On Nov 29, 2015, at 7:49 PM, ryguy7272 wrote:
> 
> > I'm trying to figure out how to count words in a web site.  Here is a sample of the link I want to scrape data from and count specific words.
> > http://finance.yahoo.com/q/h?s=STRP+Headlines
> > 
> > I only want to count certain words, like 'fraud', 'lawsuit', etc.  I want to have a way to control for specific words.  I have a couple Python scripts that do this for a text file, but not for a web site.  I can post that, if that's helpful.
> > 
> > -- 
> > https://mail.python.org/mailman/listinfo/python-list

Ok, this small script will grab everything from the link.

import requests
from bs4 import BeautifulSoup
r = requests.get("http://finance.yahoo.com/q/h?s=STRP+Headlines")
soup = BeautifulSoup(r.content)
htmltext = soup.prettify()
print htmltext


Now, how can I count specific words like 'fraud' and 'lawsuit'?

[toc] | [prev] | [next] | [standalone]


#99722

FromMichiel Overtoom <motoom@xs4all.nl>
Date2015-11-30 08:56 +0100
Message-ID<mailman.20.1448870261.14615.python-list@python.org>
In reply to#99719
> On 30 Nov 2015, at 03:54, ryguy7272 <ryanshuell@gmail.com> wrote:
> 
> Now, how can I count specific words like 'fraud' and 'lawsuit'?

- convert the page to plain text
- remove any interpunction
- split into words
- see what words occur
- enumerate all the words and increase a counter for each word

Something like this:

s = """Today we're rounding out our planetary tour with ice giants Uranus
and Neptune. Both have small rocky cores, thick mantles of ammonia, water,
and methane, and atmospheres that make them look greenish and blue. Uranus
has a truly weird rotation and relatively dull weather, while Neptune has
clouds and storms whipped by tremendous winds. Both have rings and moons,
with Neptune's Triton probably being a captured iceball that has active
geology."""

import collections
cleaned = s.lower().replace("\n", " ").replace(".", "").replace(",", "").replace("'", " ")
count = collections.Counter(cleaned.split(" "))
for interesting in ("neptune", "and"):
    print "The word '%s' occurs %d times" % (interesting, count[interesting])


# Outputs:

The word 'neptune' occurs 3 times
The word 'and' occurs 7 times



[toc] | [prev] | [next] | [standalone]


#99718

FromLaura Creighton <lac@openend.se>
Date2015-11-30 03:51 +0100
Message-ID<mailman.18.1448851896.14615.python-list@python.org>
In reply to#99705
In a message of Sun, 29 Nov 2015 21:31:49 -0500, Cem Karan writes:
>You might want to look into Beautiful Soup (https://pypi.python.org/pypi/beautifulsoup4), which is an HTML screen-scraping tool.  I've never used it, but I've heard good things about it.
>
>Good luck,
>Cem Karan

http://codereview.stackexchange.com/questions/73887/finding-the-occurrences-of-all-words-in-movie-scripts

scrapes a site of movie scripts and then spits out the 10 most common
words.  I suspect the OP could modify this script to suit his or her needs.

Laura

[toc] | [prev] | [next] | [standalone]


#99737

Fromryguy7272 <ryanshuell@gmail.com>
Date2015-11-30 07:04 -0800
Message-ID<b67bd51c-9f92-471c-b9ca-73292a8315d0@googlegroups.com>
In reply to#99718
On Sunday, November 29, 2015 at 9:51:46 PM UTC-5, Laura Creighton wrote:
> In a message of Sun, 29 Nov 2015 21:31:49 -0500, Cem Karan writes:
> >You might want to look into Beautiful Soup (https://pypi.python.org/pypi/beautifulsoup4), which is an HTML screen-scraping tool.  I've never used it, but I've heard good things about it.
> >
> >Good luck,
> >Cem Karan
> 
> http://codereview.stackexchange.com/questions/73887/finding-the-occurrences-of-all-words-in-movie-scripts
> 
> scrapes a site of movie scripts and then spits out the 10 most common
> words.  I suspect the OP could modify this script to suit his or her needs.
> 
> Laura


Thanks Laura!

[toc] | [prev] | [next] | [standalone]


#99736

Fromryguy7272 <ryanshuell@gmail.com>
Date2015-11-30 07:04 -0800
Message-ID<a9dbdd7e-6c32-49aa-ae6c-9a4f42b8e497@googlegroups.com>
In reply to#99705
On Sunday, November 29, 2015 at 7:49:40 PM UTC-5, ryguy7272 wrote:
> I'm trying to figure out how to count words in a web site.  Here is a sample of the link I want to scrape data from and count specific words.
> http://finance.yahoo.com/q/h?s=STRP+Headlines
> 
> I only want to count certain words, like 'fraud', 'lawsuit', etc.  I want to have a way to control for specific words.  I have a couple Python scripts that do this for a text file, but not for a web site.  I can post that, if that's helpful.


This works great!  Thanks for sharing!!

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web