Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #99705 > unrolled thread
| Started by | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| First post | 2015-11-29 16:49 -0800 |
| Last post | 2015-11-30 07:04 -0800 |
| Articles | 7 — 4 participants |
Back to article view | Back to comp.lang.python
How can I count word frequency in a web site? ryguy7272 <ryanshuell@gmail.com> - 2015-11-29 16:49 -0800
Re: How can I count word frequency in a web site? Cem Karan <cfkaran2@gmail.com> - 2015-11-29 21:31 -0500
Re: How can I count word frequency in a web site? ryguy7272 <ryanshuell@gmail.com> - 2015-11-29 18:54 -0800
Re: How can I count word frequency in a web site? Michiel Overtoom <motoom@xs4all.nl> - 2015-11-30 08:56 +0100
Re: How can I count word frequency in a web site? Laura Creighton <lac@openend.se> - 2015-11-30 03:51 +0100
Re: How can I count word frequency in a web site? ryguy7272 <ryanshuell@gmail.com> - 2015-11-30 07:04 -0800
Re: How can I count word frequency in a web site? ryguy7272 <ryanshuell@gmail.com> - 2015-11-30 07:04 -0800
| From | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| Date | 2015-11-29 16:49 -0800 |
| Subject | How can I count word frequency in a web site? |
| Message-ID | <6851e3b8-0d46-4808-9f7f-372b71bf327c@googlegroups.com> |
I'm trying to figure out how to count words in a web site. Here is a sample of the link I want to scrape data from and count specific words. http://finance.yahoo.com/q/h?s=STRP+Headlines I only want to count certain words, like 'fraud', 'lawsuit', etc. I want to have a way to control for specific words. I have a couple Python scripts that do this for a text file, but not for a web site. I can post that, if that's helpful.
[toc] | [next] | [standalone]
| From | Cem Karan <cfkaran2@gmail.com> |
|---|---|
| Date | 2015-11-29 21:31 -0500 |
| Message-ID | <mailman.14.1448850720.14615.python-list@python.org> |
| In reply to | #99705 |
You might want to look into Beautiful Soup (https://pypi.python.org/pypi/beautifulsoup4), which is an HTML screen-scraping tool. I've never used it, but I've heard good things about it. Good luck, Cem Karan On Nov 29, 2015, at 7:49 PM, ryguy7272 <ryanshuell@gmail.com> wrote: > I'm trying to figure out how to count words in a web site. Here is a sample of the link I want to scrape data from and count specific words. > http://finance.yahoo.com/q/h?s=STRP+Headlines > > I only want to count certain words, like 'fraud', 'lawsuit', etc. I want to have a way to control for specific words. I have a couple Python scripts that do this for a text file, but not for a web site. I can post that, if that's helpful. > > -- > https://mail.python.org/mailman/listinfo/python-list
[toc] | [prev] | [next] | [standalone]
| From | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| Date | 2015-11-29 18:54 -0800 |
| Message-ID | <88ec2ba2-6b06-421b-89d5-ece408bb4c8e@googlegroups.com> |
| In reply to | #99714 |
On Sunday, November 29, 2015 at 9:32:22 PM UTC-5, Cem Karan wrote:
> You might want to look into Beautiful Soup (https://pypi.python.org/pypi/beautifulsoup4), which is an HTML screen-scraping tool. I've never used it, but I've heard good things about it.
>
> Good luck,
> Cem Karan
>
> On Nov 29, 2015, at 7:49 PM, ryguy7272 wrote:
>
> > I'm trying to figure out how to count words in a web site. Here is a sample of the link I want to scrape data from and count specific words.
> > http://finance.yahoo.com/q/h?s=STRP+Headlines
> >
> > I only want to count certain words, like 'fraud', 'lawsuit', etc. I want to have a way to control for specific words. I have a couple Python scripts that do this for a text file, but not for a web site. I can post that, if that's helpful.
> >
> > --
> > https://mail.python.org/mailman/listinfo/python-list
Ok, this small script will grab everything from the link.
import requests
from bs4 import BeautifulSoup
r = requests.get("http://finance.yahoo.com/q/h?s=STRP+Headlines")
soup = BeautifulSoup(r.content)
htmltext = soup.prettify()
print htmltext
Now, how can I count specific words like 'fraud' and 'lawsuit'?
[toc] | [prev] | [next] | [standalone]
| From | Michiel Overtoom <motoom@xs4all.nl> |
|---|---|
| Date | 2015-11-30 08:56 +0100 |
| Message-ID | <mailman.20.1448870261.14615.python-list@python.org> |
| In reply to | #99719 |
> On 30 Nov 2015, at 03:54, ryguy7272 <ryanshuell@gmail.com> wrote:
>
> Now, how can I count specific words like 'fraud' and 'lawsuit'?
- convert the page to plain text
- remove any interpunction
- split into words
- see what words occur
- enumerate all the words and increase a counter for each word
Something like this:
s = """Today we're rounding out our planetary tour with ice giants Uranus
and Neptune. Both have small rocky cores, thick mantles of ammonia, water,
and methane, and atmospheres that make them look greenish and blue. Uranus
has a truly weird rotation and relatively dull weather, while Neptune has
clouds and storms whipped by tremendous winds. Both have rings and moons,
with Neptune's Triton probably being a captured iceball that has active
geology."""
import collections
cleaned = s.lower().replace("\n", " ").replace(".", "").replace(",", "").replace("'", " ")
count = collections.Counter(cleaned.split(" "))
for interesting in ("neptune", "and"):
print "The word '%s' occurs %d times" % (interesting, count[interesting])
# Outputs:
The word 'neptune' occurs 3 times
The word 'and' occurs 7 times
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-11-30 03:51 +0100 |
| Message-ID | <mailman.18.1448851896.14615.python-list@python.org> |
| In reply to | #99705 |
In a message of Sun, 29 Nov 2015 21:31:49 -0500, Cem Karan writes: >You might want to look into Beautiful Soup (https://pypi.python.org/pypi/beautifulsoup4), which is an HTML screen-scraping tool. I've never used it, but I've heard good things about it. > >Good luck, >Cem Karan http://codereview.stackexchange.com/questions/73887/finding-the-occurrences-of-all-words-in-movie-scripts scrapes a site of movie scripts and then spits out the 10 most common words. I suspect the OP could modify this script to suit his or her needs. Laura
[toc] | [prev] | [next] | [standalone]
| From | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| Date | 2015-11-30 07:04 -0800 |
| Message-ID | <b67bd51c-9f92-471c-b9ca-73292a8315d0@googlegroups.com> |
| In reply to | #99718 |
On Sunday, November 29, 2015 at 9:51:46 PM UTC-5, Laura Creighton wrote: > In a message of Sun, 29 Nov 2015 21:31:49 -0500, Cem Karan writes: > >You might want to look into Beautiful Soup (https://pypi.python.org/pypi/beautifulsoup4), which is an HTML screen-scraping tool. I've never used it, but I've heard good things about it. > > > >Good luck, > >Cem Karan > > http://codereview.stackexchange.com/questions/73887/finding-the-occurrences-of-all-words-in-movie-scripts > > scrapes a site of movie scripts and then spits out the 10 most common > words. I suspect the OP could modify this script to suit his or her needs. > > Laura Thanks Laura!
[toc] | [prev] | [next] | [standalone]
| From | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| Date | 2015-11-30 07:04 -0800 |
| Message-ID | <a9dbdd7e-6c32-49aa-ae6c-9a4f42b8e497@googlegroups.com> |
| In reply to | #99705 |
On Sunday, November 29, 2015 at 7:49:40 PM UTC-5, ryguy7272 wrote: > I'm trying to figure out how to count words in a web site. Here is a sample of the link I want to scrape data from and count specific words. > http://finance.yahoo.com/q/h?s=STRP+Headlines > > I only want to count certain words, like 'fraud', 'lawsuit', etc. I want to have a way to control for specific words. I have a couple Python scripts that do this for a text file, but not for a web site. I can post that, if that's helpful. This works great! Thanks for sharing!!
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web