Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #73727 > unrolled thread
| Started by | subhabangalore@gmail.com |
|---|---|
| First post | 2014-06-29 03:49 -0700 |
| Last post | 2014-06-30 23:22 +0000 |
| Articles | 8 — 5 participants |
Back to article view | Back to comp.lang.python
Writing Multiple files at a times subhabangalore@gmail.com - 2014-06-29 03:49 -0700
Re: Writing Multiple files at a times Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-06-29 12:17 +0100
Re:Writing Multiple files at a times Dave Angel <davea@davea.name> - 2014-06-29 09:33 -0400
Re: Writing Multiple files at a times Roy Smith <roy@panix.com> - 2014-06-29 10:01 -0400
Re: Writing Multiple files at a times subhabangalore@gmail.com - 2014-06-29 10:32 -0700
Re: Writing Multiple files at a times Denis McMahon <denismfmcmahon@gmail.com> - 2014-06-29 19:21 +0000
Re: Writing Multiple files at a times subhabangalore@gmail.com - 2014-06-30 12:23 -0700
Re: Writing Multiple files at a times Denis McMahon <denismfmcmahon@gmail.com> - 2014-06-30 23:22 +0000
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2014-06-29 03:49 -0700 |
| Subject | Writing Multiple files at a times |
| Message-ID | <b8951113-5171-4441-b490-4d731eb56cec@googlegroups.com> |
Dear Group, I am trying to crawl multiple URLs. As they are coming I want to write them as string, as they are coming, preferably in a queue. If any one of the esteemed members of the group may kindly help. Regards, Subhabrata Banerjee.
[toc] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2014-06-29 12:17 +0100 |
| Message-ID | <mailman.11324.1404040636.18130.python-list@python.org> |
| In reply to | #73727 |
On 29/06/2014 11:49, subhabangalore@gmail.com wrote: > Dear Group, > > I am trying to crawl multiple URLs. As they are coming I want to write them as string, as they are coming, preferably in a queue. > > If any one of the esteemed members of the group may kindly help. > > Regards, > Subhabrata Banerjee. > https://docs.python.org/3/library/urllib.html#module-urllib https://pypi.python.org/pypi/requests/2.3.0 https://docs.python.org/3/library/queue.html -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2014-06-29 09:33 -0400 |
| Message-ID | <mailman.11325.1404048700.18130.python-list@python.org> |
| In reply to | #73727 |
[Multipart message — attachments visible in raw view] — view raw
subhabangalore@gmail.com Wrote in message: > Dear Group, > > I am trying to crawl multiple URLs. As they are coming I want to write them as string, as they are coming, preferably in a queue. > > If any one of the esteemed members of the group may kindly help. > >From your subject line, it appears you want to keep multiple files open, and write to each in an arbitrary order. That's no problem, up to the operating system limits. Define a class that holds the URL information and for each instance, add an attribute for an output file handle. Don't forget to close each file when you're done with the corresponding URL. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-06-29 10:01 -0400 |
| Message-ID | <roy-AFAF93.10013729062014@news.panix.com> |
| In reply to | #73729 |
In article <mailman.11325.1404048700.18130.python-list@python.org>, Dave Angel <davea@davea.name> wrote: > subhabangalore@gmail.com Wrote in message: > > Dear Group, > > > > I am trying to crawl multiple URLs. As they are coming I want to write them > > as string, as they are coming, preferably in a queue. > > > > If any one of the esteemed members of the group may kindly help. > > > > >From your subject line, it appears you want to keep multiple files open, > >and write to each in an arbitrary order. That's no problem, up to the > >operating system limits. Define a class that holds the URL information and > >for each instance, add an attribute for an output file handle. > > Don't forget to close each file when you're done with the corresponding URL. One other thing to mention is that if you're doing anything with fetching URLs from Python, you almost certainly want to be using Kenneth Reitz's excellent requests module (http://docs.python-requests.org/). The built-in urllib support in Python works, but requests is so much simpler to use.
[toc] | [prev] | [next] | [standalone]
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2014-06-29 10:32 -0700 |
| Message-ID | <6bf28329-9983-4024-ac90-374fb11ac854@googlegroups.com> |
| In reply to | #73730 |
On Sunday, June 29, 2014 7:31:37 PM UTC+5:30, Roy Smith wrote:
> In article <mailman.11325.1404048700.18130.python-list@python.org>,
>
> Dave Angel <davea@davea.name> wrote:
>
>
>
> > subhabangalore@gmail.com Wrote in message:
>
> > > Dear Group,
>
> > >
>
> > > I am trying to crawl multiple URLs. As they are coming I want to write them
>
> > > as string, as they are coming, preferably in a queue.
>
> > >
>
> > > If any one of the esteemed members of the group may kindly help.
>
> > >
>
> >
>
> > >From your subject line, it appears you want to keep multiple files open,
>
> > >and write to each in an arbitrary order. That's no problem, up to the
>
> > >operating system limits. Define a class that holds the URL information and
>
> > >for each instance, add an attribute for an output file handle.
>
> >
>
> > Don't forget to close each file when you're done with the corresponding URL.
>
>
>
> One other thing to mention is that if you're doing anything with
>
> fetching URLs from Python, you almost certainly want to be using Kenneth
>
> Reitz's excellent requests module (http://docs.python-requests.org/).
>
> The built-in urllib support in Python works, but requests is so much
>
> simpler to use.
Dear Group,
Sorry if I miscommunicated.
I am opening multiple URLs with urllib.open, now one Url has huge html source files, like that each one has. As these files are read I am trying to concatenate them and put in one txt file as string.
From this big txt file I am trying to take out each html file body of each URL and trying to write and store them with attempts like,
for i, line in enumerate(file1):
f = open("/python27/newfile_%i.txt" %i,'w')
f.write(line)
f.close()
Generally not much of an issue, but was thinking of some better options.
Regards,
Subhabrata Banerjee.
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2014-06-29 19:21 +0000 |
| Message-ID | <lopp03$91i$1@dont-email.me> |
| In reply to | #73734 |
On Sun, 29 Jun 2014 10:32:00 -0700, subhabangalore wrote:
> I am opening multiple URLs with urllib.open, now one Url has huge html
> source files, like that each one has. As these files are read I am
> trying to concatenate them and put in one txt file as string.
> From this big txt file I am trying to take out each html file body of
> each URL and trying to write and store them
OK, let me clarify what I think you said.
First you concatenate all the web pages into a single file.
Then you extract all the page bodies from the single file and save them
as separate files.
This seems a silly way to do things, why don't you just save each html
body section as you receive it?
This sounds like it should be something as simple as:
from BeautifulSoup import BeautifulSoup
import requests
urlList = [
"http://something/",
"http://something/",
"http://something/",
....... ]
n = 0
for url in urlList:
r = requests.get( url )
soup = BeautifulSoup( r.content )
body = soup.find( "body" )
fp = open( "scraped/body{:0>5d}.htm".format( n ), "w" )
fp.write( body.prettify() )
fp.close
n += 1
will give you:
scraped/body00000.htm
scraped/body00001.htm
scraped/body00002.htm
........
for as many urls as you have in your url list. (make sure the target
directory exists!)
--
Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2014-06-30 12:23 -0700 |
| Message-ID | <29f99276-9609-42cd-9210-5bc5e75aa364@googlegroups.com> |
| In reply to | #73727 |
On Sunday, June 29, 2014 4:19:27 PM UTC+5:30, subhaba...@gmail.com wrote:
> Dear Group,
>
>
>
> I am trying to crawl multiple URLs. As they are coming I want to write them as string, as they are coming, preferably in a queue.
>
>
>
> If any one of the esteemed members of the group may kindly help.
>
>
>
> Regards,
>
> Subhabrata Banerjee.
Dear Group,
Thank you for your kind suggestion. But I am not being able to sort out,
"fp = open( "scraped/body{:0>5d}.htm".format( n ), "w" ) "
please suggest.
Regards,
Subhabrata Banerjee.
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2014-06-30 23:22 +0000 |
| Message-ID | <losrf2$c6i$2@dont-email.me> |
| In reply to | #73758 |
On Mon, 30 Jun 2014 12:23:08 -0700, subhabangalore wrote:
> Thank you for your kind suggestion. But I am not being able to sort out,
> "fp = open( "scraped/body{:0>5d}.htm".format( n ), "w" ) "
> please suggest.
look up the python manual for string.format() and open() functions.
The line indicated opens a file for write whose name is generated by the
string.format() function by inserting the number N formatted to 5 digits
with leading zeroes into the string "scraped/bodyN.htm"
It expects you to have a subdir called "scraped" below the dir you're
executing the code in.
Also, this newsgroup is *NOT* a substitute for reading the manual for
basic python functions and methods.
Finally, if you don't understand basic string and file handling in
python, why on earth are you trying to write code that arguably needs a
level of competence in both? Perhaps as your starter project you should
try something simpler, print "hello world" is traditional.
To understand the string formatting, try:
print "hello {:0>5d} world".format( 5 )
print "hello {:0>5d} world".format( 50 )
print "hello {:0>5d} world".format( 500 )
print "hello {:0>5d} world".format( 5000 )
print "hello {:0>5d} world".format( 50000 )
--
Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web