Groups > comp.lang.python > #52767 > unrolled thread

I wonder if I would be able to collect data from such page using Python

Started by	Comment Holder <commentholder@gmail.com>
First post	2013-08-21 07:55 -0700
Last post	2013-08-23 01:11 +1000
Articles	11 — 5 participants

Back to article view | Back to comp.lang.python

  I wonder if I would be able to collect data from such page using Python Comment Holder <commentholder@gmail.com> - 2013-08-21 07:55 -0700
    Re: I wonder if I would be able to collect data from such page using Python Joel Goldstick <joel.goldstick@gmail.com> - 2013-08-21 11:30 -0400
      Re: I wonder if I would be able to collect data from such page using Python Comment Holder <commentholder@gmail.com> - 2013-08-21 08:44 -0700
        Re: I wonder if I would be able to collect data from such page using Python Joel Goldstick <joel.goldstick@gmail.com> - 2013-08-21 11:58 -0400
          Re: I wonder if I would be able to collect data from such page using Python Comment Holder <commentholder@gmail.com> - 2013-08-21 10:41 -0700
            Re: I wonder if I would be able to collect data from such page using Python Joel Goldstick <joel.goldstick@gmail.com> - 2013-08-21 13:52 -0400
            Re: I wonder if I would be able to collect data from such page using Python Terry Reedy <tjreedy@udel.edu> - 2013-08-21 15:18 -0400
              Re: I wonder if I would be able to collect data from such page using Python Comment Holder <commentholder@gmail.com> - 2013-08-22 07:58 -0700
    Re: I wonder if I would be able to collect data from such page using Python Piet van Oostrum <piet@vanoostrum.org> - 2013-08-22 00:54 -0400
      Re: I wonder if I would be able to collect data from such page using Python Comment Holder <commentholder@gmail.com> - 2013-08-22 08:03 -0700
        Re: I wonder if I would be able to collect data from such page using Python Chris Angelico <rosuav@gmail.com> - 2013-08-23 01:11 +1000

#52767 — I wonder if I would be able to collect data from such page using Python

From	Comment Holder <commentholder@gmail.com>
Date	2013-08-21 07:55 -0700
Subject	I wonder if I would be able to collect data from such page using Python
Message-ID	<a50210f8-8959-46da-a386-2d9a7a17a79e@googlegroups.com>

Hi,
I am totally new to Python. I noticed that there are many videos showing how to collect data from Python, but I am not sure if I would be able to accomplish my goal using Python so I can start learning.

Here is the example of the target page:
http://and.medianewsonline.com/hello.html
In this example, there are 10 articles.

What I exactly need is to do the following:
1- Collect the article title, date, source, and contents.
2- I need to be able to export the final results to excel or a database client. That is, I need to have all of those specified in step 1 in one row, while each of them saved in separate column. For example:

Title1    Date1   Source1   Contents1
Title2    Date2   Source2   Contents2

I appreciate any advise regarding my case. 

Thanks & Regards//

[toc] | [next] | [standalone]

#52768

From	Joel Goldstick <joel.goldstick@gmail.com>
Date	2013-08-21 11:30 -0400
Message-ID	<mailman.81.1377099024.19984.python-list@python.org>
In reply to	#52767

On Wed, Aug 21, 2013 at 10:55 AM, Comment Holder
<commentholder@gmail.com> wrote:
> Hi,
> I am totally new to Python. I noticed that there are many videos showing how to collect data from Python, but I am not sure if I would be able to accomplish my goal using Python so I can start learning.
>
> Here is the example of the target page:
> http://and.medianewsonline.com/hello.html
> In this example, there are 10 articles.
>
> What I exactly need is to do the following:
> 1- Collect the article title, date, source, and contents.
> 2- I need to be able to export the final results to excel or a database client. That is, I need to have all of those specified in step 1 in one row, while each of them saved in separate column. For example:
>
> Title1    Date1   Source1   Contents1
> Title2    Date2   Source2   Contents2
>
> I appreciate any advise regarding my case.
>
> Thanks & Regards//
> --
> http://mail.python.org/mailman/listinfo/python-list

I'm guessing that you are not only new to Python, but that you haven't
much experience in writing computer programs at all.  So, you need to
do that.  There is a good tutorial on the python site, and lots of
links to other resources.

then do this:

1. write code to access the page you require.  The Requests module can
help with that
2. write code to select the data you want.  The BeautifulSoup module
is excellent for this
3. write code to save your data in comma separated value format.
4. import to excel or wherever

Now, go off and write the code.  When you get stuck, copy and paste
the portion of the code that is giving you problems, along with the
traceback.  You can also get help at the python-tutor mailing list

-- 
Joel Goldstick
http://joelgoldstick.com

[toc] | [prev] | [next] | [standalone]

#52769

From	Comment Holder <commentholder@gmail.com>
Date	2013-08-21 08:44 -0700
Message-ID	<bfd5cc17-8901-47b4-944f-7841c8d7cc15@googlegroups.com>
In reply to	#52768

Many thanks Joel,

You are right to some extent. I come from Finance background, but I am very familiar with what could be referred to as non-native languages such as Matlab, VBA,.. actually, I have developed couple of complete programs.

I have asked this question, because I am a little worried about the structure of this particular page, as there are no specific defined classes. 

I know how powerful Python is, but I wonder if it could do the job with this particular page.

Again, many thanks Joel, I appreciate your guidance.
All Best//

[toc] | [prev] | [next] | [standalone]

#52770

From	Joel Goldstick <joel.goldstick@gmail.com>
Date	2013-08-21 11:58 -0400
Message-ID	<mailman.83.1377100719.19984.python-list@python.org>
In reply to	#52769

On Wed, Aug 21, 2013 at 11:44 AM, Comment Holder
<commentholder@gmail.com> wrote:
> Many thanks Joel,
>
> You are right to some extent. I come from Finance background, but I am very familiar with what could be referred to as non-native languages such as Matlab, VBA,.. actually, I have developed couple of complete programs.
>
> I have asked this question, because I am a little worried about the structure of this particular page, as there are no specific defined classes.
>
> I know how powerful Python is, but I wonder if it could do the job with this particular page.
>
> Again, many thanks Joel, I appreciate your guidance.
> All Best//
> --
> http://mail.python.org/mailman/listinfo/python-list

Your biggest hurdle will be to get proficient with python.  Give
yourself a weekend with a good tutorial.  You won't be very skilled,
but you will get the gist of things.

Also, google Beautiful Soup.  You need the latest version. Its v4 I
think.  They have a GREAT tutorial.  Spend a few hours with it and you
will see your way to get the data you want from your web pages.

Since you gave a sample web page, I am guessing that you need to log
in to the site for 'real data'.  For that, you need to really
understand stuff that you might not.  At any rate, study the Requests
Module documentation.  Python comes with urllib, and urllib2 that
cover the same ground, but Requests is a lot simpler to understand

-- 
Joel Goldstick
http://joelgoldstick.com

[toc] | [prev] | [next] | [standalone]

#52776

From	Comment Holder <commentholder@gmail.com>
Date	2013-08-21 10:41 -0700
Message-ID	<02caf0a8-1506-4746-9136-3452cbdea14b@googlegroups.com>
In reply to	#52770

Dear Joel,

Many thanks for your help - I think I shall start with this way and see how it goes. My concerns were if the task can be accomplished with Python, and from your posts, I guess it can - so I shall give it a try :). 

Again, thanks a lot & all best//

[toc] | [prev] | [next] | [standalone]

#52777

From	Joel Goldstick <joel.goldstick@gmail.com>
Date	2013-08-21 13:52 -0400
Message-ID	<mailman.89.1377107547.19984.python-list@python.org>
In reply to	#52776

On Wed, Aug 21, 2013 at 1:41 PM, Comment Holder <commentholder@gmail.com> wrote:
> Dear Joel,
>
> Many thanks for your help - I think I shall start with this way and see how it goes. My concerns were if the task can be accomplished with Python, and from your posts, I guess it can - so I shall give it a try :).
>
> Again, thanks a lot & all best//
>
> --
> http://mail.python.org/mailman/listinfo/python-list

You're welcome.  One thought popped into my mind.  Since the site
seems to be from the Wall Street Journal, you may want to look into
whether they have an api for searching and retrieving articles.  If
they do, this would be simpler and probably safer than parsing web
pages.  From time to time, websites change their layout, which would
probably break your program.  However APIs are more stable

good luck to you
-- 
Joel Goldstick
http://joelgoldstick.com

[toc] | [prev] | [next] | [standalone]

#52786

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-08-21 15:18 -0400
Message-ID	<mailman.97.1377112715.19984.python-list@python.org>
In reply to	#52776

On 8/21/2013 1:52 PM, Joel Goldstick wrote:
> On Wed, Aug 21, 2013 at 1:41 PM, Comment Holder <commentholder@gmail.com> wrote:

>> Many thanks for your help - I think I shall start with this way and see how it goes. My concerns were if the task can be accomplished with Python, and from your posts, I guess it can - so I shall give it a try :).

CM: You still seem a bit doubtful. If you are wondering why no one else 
has answered, it is because Joel has given you a really good answer that 
cannot be beat without writing your code for you.

> You're welcome.  One thought popped into my mind.  Since the site
> seems to be from the Wall Street Journal, you may want to look into
> whether they have an api for searching and retrieving articles.  If
> they do, this would be simpler and probably safer than parsing web
> pages.  From time to time, websites change their layout, which would
> probably break your program.  However APIs are more stable

Including this suggestion, which I did not think of.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#52834

From	Comment Holder <commentholder@gmail.com>
Date	2013-08-22 07:58 -0700
Message-ID	<c650302e-1c31-44bd-bf9f-96ae90926691@googlegroups.com>
In reply to	#52786

Dear Terry,

Many thanks for your comments. Actually I was, because the target-page doesn't have a neat structure. But, after all of your contributions, I think the task can be achieved very well with Python.

Thanks again & all best//

[toc] | [prev] | [next] | [standalone]

#52809

From	Piet van Oostrum <piet@vanoostrum.org>
Date	2013-08-22 00:54 -0400
Message-ID	<m2haeiiaur.fsf@cochabamba.vanoostrum.org>
In reply to	#52767

[Multipart message — attachments visible in raw view] — view raw

> Hi,
> I am totally new to Python. I noticed that there are many videos showing how to collect data from Python, but I am not sure if I would be able to accomplish my goal using Python so I can start learning.
>
> Here is the example of the target page:
> http://and.medianewsonline.com/hello.html
> In this example, there are 10 articles.
>
> What I exactly need is to do the following:
> 1- Collect the article title, date, source, and contents.
> 2- I need to be able to export the final results to excel or a database client. That is, I need to have all of those specified in step 1 in one row, while each of them saved in separate column. For example:
>
> Title1    Date1   Source1   Contents1
> Title2    Date2   Source2   Contents2
>
> I appreciate any advise regarding my case. 
>
> Thanks & Regards//

Here is an attempt for you. It uses BeatifulSoup 4. It is written in Python 3.3, so if you want to use Python 2.x you will have to make some small changes, like
from urllib import urlopen
and probably something with the print statements.

The formatting in columns is left as an exercise for you. I wonder how you would want that with multiparagraph contents.

[toc] | [prev] | [next] | [standalone]

#52835

From	Comment Holder <commentholder@gmail.com>
Date	2013-08-22 08:03 -0700
Message-ID	<d1390c28-91d2-46f5-aff0-7703b8a165e3@googlegroups.com>
In reply to	#52809

Dear Piet,

Many thanks for your assistance. It is much appreciated. I have just installed Python 3.3.2 and BeautifulSoup 4.3.1. I tried running the code, but run into some syntax errors. 

>  I wonder how you would want that with multiparagraph contents.

I am looking to save all the paragraphs of an article in one field, so that, the afterwards-analysis becomes easier.

As I am new, I won't ask for assistance before I get some general idea about Python. I shall dedicate the weekend for this purpose, or at least Sunday. Once I am done, I will post my results back in here.

Thanks again & all best//

[toc] | [prev] | [next] | [standalone]

#52836

From	Chris Angelico <rosuav@gmail.com>
Date	2013-08-23 01:11 +1000
Message-ID	<mailman.135.1377184301.19984.python-list@python.org>
In reply to	#52835

On Fri, Aug 23, 2013 at 1:03 AM, Comment Holder <commentholder@gmail.com> wrote:
> As I am new, I won't ask for assistance before I get some general idea about Python. I shall dedicate the weekend for this purpose, or at least Sunday. Once I am done, I will post my results back in here.


Smart move :) I strongly recommend the inbuilt tutorial, if you
haven't seen it already:

http://docs.python.org/3/tutorial/

And you're using the current version, which is good. Saves the hassle
of figuring out what's different in an old version.

All the best!

ChrisA

[toc] | [prev] | [standalone]

csiph-web