Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #52767 > unrolled thread
| Started by | Comment Holder <commentholder@gmail.com> |
|---|---|
| First post | 2013-08-21 07:55 -0700 |
| Last post | 2013-08-23 01:11 +1000 |
| Articles | 11 — 5 participants |
Back to article view | Back to comp.lang.python
I wonder if I would be able to collect data from such page using Python Comment Holder <commentholder@gmail.com> - 2013-08-21 07:55 -0700
Re: I wonder if I would be able to collect data from such page using Python Joel Goldstick <joel.goldstick@gmail.com> - 2013-08-21 11:30 -0400
Re: I wonder if I would be able to collect data from such page using Python Comment Holder <commentholder@gmail.com> - 2013-08-21 08:44 -0700
Re: I wonder if I would be able to collect data from such page using Python Joel Goldstick <joel.goldstick@gmail.com> - 2013-08-21 11:58 -0400
Re: I wonder if I would be able to collect data from such page using Python Comment Holder <commentholder@gmail.com> - 2013-08-21 10:41 -0700
Re: I wonder if I would be able to collect data from such page using Python Joel Goldstick <joel.goldstick@gmail.com> - 2013-08-21 13:52 -0400
Re: I wonder if I would be able to collect data from such page using Python Terry Reedy <tjreedy@udel.edu> - 2013-08-21 15:18 -0400
Re: I wonder if I would be able to collect data from such page using Python Comment Holder <commentholder@gmail.com> - 2013-08-22 07:58 -0700
Re: I wonder if I would be able to collect data from such page using Python Piet van Oostrum <piet@vanoostrum.org> - 2013-08-22 00:54 -0400
Re: I wonder if I would be able to collect data from such page using Python Comment Holder <commentholder@gmail.com> - 2013-08-22 08:03 -0700
Re: I wonder if I would be able to collect data from such page using Python Chris Angelico <rosuav@gmail.com> - 2013-08-23 01:11 +1000
| From | Comment Holder <commentholder@gmail.com> |
|---|---|
| Date | 2013-08-21 07:55 -0700 |
| Subject | I wonder if I would be able to collect data from such page using Python |
| Message-ID | <a50210f8-8959-46da-a386-2d9a7a17a79e@googlegroups.com> |
Hi, I am totally new to Python. I noticed that there are many videos showing how to collect data from Python, but I am not sure if I would be able to accomplish my goal using Python so I can start learning. Here is the example of the target page: http://and.medianewsonline.com/hello.html In this example, there are 10 articles. What I exactly need is to do the following: 1- Collect the article title, date, source, and contents. 2- I need to be able to export the final results to excel or a database client. That is, I need to have all of those specified in step 1 in one row, while each of them saved in separate column. For example: Title1 Date1 Source1 Contents1 Title2 Date2 Source2 Contents2 I appreciate any advise regarding my case. Thanks & Regards//
[toc] | [next] | [standalone]
| From | Joel Goldstick <joel.goldstick@gmail.com> |
|---|---|
| Date | 2013-08-21 11:30 -0400 |
| Message-ID | <mailman.81.1377099024.19984.python-list@python.org> |
| In reply to | #52767 |
On Wed, Aug 21, 2013 at 10:55 AM, Comment Holder <commentholder@gmail.com> wrote: > Hi, > I am totally new to Python. I noticed that there are many videos showing how to collect data from Python, but I am not sure if I would be able to accomplish my goal using Python so I can start learning. > > Here is the example of the target page: > http://and.medianewsonline.com/hello.html > In this example, there are 10 articles. > > What I exactly need is to do the following: > 1- Collect the article title, date, source, and contents. > 2- I need to be able to export the final results to excel or a database client. That is, I need to have all of those specified in step 1 in one row, while each of them saved in separate column. For example: > > Title1 Date1 Source1 Contents1 > Title2 Date2 Source2 Contents2 > > I appreciate any advise regarding my case. > > Thanks & Regards// > -- > http://mail.python.org/mailman/listinfo/python-list I'm guessing that you are not only new to Python, but that you haven't much experience in writing computer programs at all. So, you need to do that. There is a good tutorial on the python site, and lots of links to other resources. then do this: 1. write code to access the page you require. The Requests module can help with that 2. write code to select the data you want. The BeautifulSoup module is excellent for this 3. write code to save your data in comma separated value format. 4. import to excel or wherever Now, go off and write the code. When you get stuck, copy and paste the portion of the code that is giving you problems, along with the traceback. You can also get help at the python-tutor mailing list -- Joel Goldstick http://joelgoldstick.com
[toc] | [prev] | [next] | [standalone]
| From | Comment Holder <commentholder@gmail.com> |
|---|---|
| Date | 2013-08-21 08:44 -0700 |
| Message-ID | <bfd5cc17-8901-47b4-944f-7841c8d7cc15@googlegroups.com> |
| In reply to | #52768 |
Many thanks Joel, You are right to some extent. I come from Finance background, but I am very familiar with what could be referred to as non-native languages such as Matlab, VBA,.. actually, I have developed couple of complete programs. I have asked this question, because I am a little worried about the structure of this particular page, as there are no specific defined classes. I know how powerful Python is, but I wonder if it could do the job with this particular page. Again, many thanks Joel, I appreciate your guidance. All Best//
[toc] | [prev] | [next] | [standalone]
| From | Joel Goldstick <joel.goldstick@gmail.com> |
|---|---|
| Date | 2013-08-21 11:58 -0400 |
| Message-ID | <mailman.83.1377100719.19984.python-list@python.org> |
| In reply to | #52769 |
On Wed, Aug 21, 2013 at 11:44 AM, Comment Holder <commentholder@gmail.com> wrote: > Many thanks Joel, > > You are right to some extent. I come from Finance background, but I am very familiar with what could be referred to as non-native languages such as Matlab, VBA,.. actually, I have developed couple of complete programs. > > I have asked this question, because I am a little worried about the structure of this particular page, as there are no specific defined classes. > > I know how powerful Python is, but I wonder if it could do the job with this particular page. > > Again, many thanks Joel, I appreciate your guidance. > All Best// > -- > http://mail.python.org/mailman/listinfo/python-list Your biggest hurdle will be to get proficient with python. Give yourself a weekend with a good tutorial. You won't be very skilled, but you will get the gist of things. Also, google Beautiful Soup. You need the latest version. Its v4 I think. They have a GREAT tutorial. Spend a few hours with it and you will see your way to get the data you want from your web pages. Since you gave a sample web page, I am guessing that you need to log in to the site for 'real data'. For that, you need to really understand stuff that you might not. At any rate, study the Requests Module documentation. Python comes with urllib, and urllib2 that cover the same ground, but Requests is a lot simpler to understand -- Joel Goldstick http://joelgoldstick.com
[toc] | [prev] | [next] | [standalone]
| From | Comment Holder <commentholder@gmail.com> |
|---|---|
| Date | 2013-08-21 10:41 -0700 |
| Message-ID | <02caf0a8-1506-4746-9136-3452cbdea14b@googlegroups.com> |
| In reply to | #52770 |
Dear Joel, Many thanks for your help - I think I shall start with this way and see how it goes. My concerns were if the task can be accomplished with Python, and from your posts, I guess it can - so I shall give it a try :). Again, thanks a lot & all best//
[toc] | [prev] | [next] | [standalone]
| From | Joel Goldstick <joel.goldstick@gmail.com> |
|---|---|
| Date | 2013-08-21 13:52 -0400 |
| Message-ID | <mailman.89.1377107547.19984.python-list@python.org> |
| In reply to | #52776 |
On Wed, Aug 21, 2013 at 1:41 PM, Comment Holder <commentholder@gmail.com> wrote: > Dear Joel, > > Many thanks for your help - I think I shall start with this way and see how it goes. My concerns were if the task can be accomplished with Python, and from your posts, I guess it can - so I shall give it a try :). > > Again, thanks a lot & all best// > > -- > http://mail.python.org/mailman/listinfo/python-list You're welcome. One thought popped into my mind. Since the site seems to be from the Wall Street Journal, you may want to look into whether they have an api for searching and retrieving articles. If they do, this would be simpler and probably safer than parsing web pages. From time to time, websites change their layout, which would probably break your program. However APIs are more stable good luck to you -- Joel Goldstick http://joelgoldstick.com
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-08-21 15:18 -0400 |
| Message-ID | <mailman.97.1377112715.19984.python-list@python.org> |
| In reply to | #52776 |
On 8/21/2013 1:52 PM, Joel Goldstick wrote: > On Wed, Aug 21, 2013 at 1:41 PM, Comment Holder <commentholder@gmail.com> wrote: >> Many thanks for your help - I think I shall start with this way and see how it goes. My concerns were if the task can be accomplished with Python, and from your posts, I guess it can - so I shall give it a try :). CM: You still seem a bit doubtful. If you are wondering why no one else has answered, it is because Joel has given you a really good answer that cannot be beat without writing your code for you. > You're welcome. One thought popped into my mind. Since the site > seems to be from the Wall Street Journal, you may want to look into > whether they have an api for searching and retrieving articles. If > they do, this would be simpler and probably safer than parsing web > pages. From time to time, websites change their layout, which would > probably break your program. However APIs are more stable Including this suggestion, which I did not think of. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Comment Holder <commentholder@gmail.com> |
|---|---|
| Date | 2013-08-22 07:58 -0700 |
| Message-ID | <c650302e-1c31-44bd-bf9f-96ae90926691@googlegroups.com> |
| In reply to | #52786 |
Dear Terry, Many thanks for your comments. Actually I was, because the target-page doesn't have a neat structure. But, after all of your contributions, I think the task can be achieved very well with Python. Thanks again & all best//
[toc] | [prev] | [next] | [standalone]
| From | Piet van Oostrum <piet@vanoostrum.org> |
|---|---|
| Date | 2013-08-22 00:54 -0400 |
| Message-ID | <m2haeiiaur.fsf@cochabamba.vanoostrum.org> |
| In reply to | #52767 |
[Multipart message — attachments visible in raw view] — view raw
> Hi, > I am totally new to Python. I noticed that there are many videos showing how to collect data from Python, but I am not sure if I would be able to accomplish my goal using Python so I can start learning. > > Here is the example of the target page: > http://and.medianewsonline.com/hello.html > In this example, there are 10 articles. > > What I exactly need is to do the following: > 1- Collect the article title, date, source, and contents. > 2- I need to be able to export the final results to excel or a database client. That is, I need to have all of those specified in step 1 in one row, while each of them saved in separate column. For example: > > Title1 Date1 Source1 Contents1 > Title2 Date2 Source2 Contents2 > > I appreciate any advise regarding my case. > > Thanks & Regards// Here is an attempt for you. It uses BeatifulSoup 4. It is written in Python 3.3, so if you want to use Python 2.x you will have to make some small changes, like from urllib import urlopen and probably something with the print statements. The formatting in columns is left as an exercise for you. I wonder how you would want that with multiparagraph contents.
[toc] | [prev] | [next] | [standalone]
| From | Comment Holder <commentholder@gmail.com> |
|---|---|
| Date | 2013-08-22 08:03 -0700 |
| Message-ID | <d1390c28-91d2-46f5-aff0-7703b8a165e3@googlegroups.com> |
| In reply to | #52809 |
Dear Piet, Many thanks for your assistance. It is much appreciated. I have just installed Python 3.3.2 and BeautifulSoup 4.3.1. I tried running the code, but run into some syntax errors. > I wonder how you would want that with multiparagraph contents. I am looking to save all the paragraphs of an article in one field, so that, the afterwards-analysis becomes easier. As I am new, I won't ask for assistance before I get some general idea about Python. I shall dedicate the weekend for this purpose, or at least Sunday. Once I am done, I will post my results back in here. Thanks again & all best//
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-08-23 01:11 +1000 |
| Message-ID | <mailman.135.1377184301.19984.python-list@python.org> |
| In reply to | #52835 |
On Fri, Aug 23, 2013 at 1:03 AM, Comment Holder <commentholder@gmail.com> wrote: > As I am new, I won't ask for assistance before I get some general idea about Python. I shall dedicate the weekend for this purpose, or at least Sunday. Once I am done, I will post my results back in here. Smart move :) I strongly recommend the inbuilt tutorial, if you haven't seen it already: http://docs.python.org/3/tutorial/ And you're using the current version, which is good. Saves the hassle of figuring out what's different in an old version. All the best! ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web