Groups > comp.lang.python > #73727 > unrolled thread

Writing Multiple files at a times

Started by	subhabangalore@gmail.com
First post	2014-06-29 03:49 -0700
Last post	2014-06-30 23:22 +0000
Articles	8 — 5 participants

Back to article view | Back to comp.lang.python

  Writing Multiple files at a times subhabangalore@gmail.com - 2014-06-29 03:49 -0700
    Re: Writing Multiple files at a times Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-06-29 12:17 +0100
    Re:Writing Multiple files at a times Dave Angel <davea@davea.name> - 2014-06-29 09:33 -0400
      Re: Writing Multiple files at a times Roy Smith <roy@panix.com> - 2014-06-29 10:01 -0400
        Re: Writing Multiple files at a times subhabangalore@gmail.com - 2014-06-29 10:32 -0700
          Re: Writing Multiple files at a times Denis McMahon <denismfmcmahon@gmail.com> - 2014-06-29 19:21 +0000
    Re: Writing Multiple files at a times subhabangalore@gmail.com - 2014-06-30 12:23 -0700
      Re: Writing Multiple files at a times Denis McMahon <denismfmcmahon@gmail.com> - 2014-06-30 23:22 +0000

#73727 — Writing Multiple files at a times

From	subhabangalore@gmail.com
Date	2014-06-29 03:49 -0700
Subject	Writing Multiple files at a times
Message-ID	<b8951113-5171-4441-b490-4d731eb56cec@googlegroups.com>

Dear Group,

I am trying to crawl multiple URLs. As they are coming I want to write them as string, as they are coming, preferably in a queue. 

If any one of the esteemed members of the group may kindly help.

Regards,
Subhabrata Banerjee.

[toc] | [next] | [standalone]

#73728

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2014-06-29 12:17 +0100
Message-ID	<mailman.11324.1404040636.18130.python-list@python.org>
In reply to	#73727

On 29/06/2014 11:49, subhabangalore@gmail.com wrote:
> Dear Group,
>
> I am trying to crawl multiple URLs. As they are coming I want to write them as string, as they are coming, preferably in a queue.
>
> If any one of the esteemed members of the group may kindly help.
>
> Regards,
> Subhabrata Banerjee.
>

https://docs.python.org/3/library/urllib.html#module-urllib
https://pypi.python.org/pypi/requests/2.3.0

https://docs.python.org/3/library/queue.html

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com

[toc] | [prev] | [next] | [standalone]

#73729

From	Dave Angel <davea@davea.name>
Date	2014-06-29 09:33 -0400
Message-ID	<mailman.11325.1404048700.18130.python-list@python.org>
In reply to	#73727

[Multipart message — attachments visible in raw view] — view raw

subhabangalore@gmail.com Wrote in message:
> Dear Group,
> 
> I am trying to crawl multiple URLs. As they are coming I want to write them as string, as they are coming, preferably in a queue. 
> 
> If any one of the esteemed members of the group may kindly help.
> 

>From your subject line,  it appears you want to keep multiple files open, and write to each in an arbitrary order.  That's no problem,  up to the operating system limits.  Define a class that holds the URL information and for each instance,  add an attribute for an output file handle. 

Don't forget to close each file when you're done with the corresponding URL. 

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#73730

From	Roy Smith <roy@panix.com>
Date	2014-06-29 10:01 -0400
Message-ID	<roy-AFAF93.10013729062014@news.panix.com>
In reply to	#73729

In article <mailman.11325.1404048700.18130.python-list@python.org>,
 Dave Angel <davea@davea.name> wrote:

> subhabangalore@gmail.com Wrote in message:
> > Dear Group,
> > 
> > I am trying to crawl multiple URLs. As they are coming I want to write them 
> > as string, as they are coming, preferably in a queue. 
> > 
> > If any one of the esteemed members of the group may kindly help.
> > 
> 
> >From your subject line,  it appears you want to keep multiple files open, 
> >and write to each in an arbitrary order.  That's no problem,  up to the 
> >operating system limits.  Define a class that holds the URL information and 
> >for each instance,  add an attribute for an output file handle. 
> 
> Don't forget to close each file when you're done with the corresponding URL.

One other thing to mention is that if you're doing anything with 
fetching URLs from Python, you almost certainly want to be using Kenneth 
Reitz's excellent requests module (http://docs.python-requests.org/).  
The built-in urllib support in Python works, but requests is so much 
simpler to use.

[toc] | [prev] | [next] | [standalone]

#73734

From	subhabangalore@gmail.com
Date	2014-06-29 10:32 -0700
Message-ID	<6bf28329-9983-4024-ac90-374fb11ac854@googlegroups.com>
In reply to	#73730

On Sunday, June 29, 2014 7:31:37 PM UTC+5:30, Roy Smith wrote:
> In article <mailman.11325.1404048700.18130.python-list@python.org>,
> 
>  Dave Angel <davea@davea.name> wrote:
> 
> 
> 
> > subhabangalore@gmail.com Wrote in message:
> 
> > > Dear Group,
> 
> > > 
> 
> > > I am trying to crawl multiple URLs. As they are coming I want to write them 
> 
> > > as string, as they are coming, preferably in a queue. 
> 
> > > 
> 
> > > If any one of the esteemed members of the group may kindly help.
> 
> > > 
> 
> > 
> 
> > >From your subject line,  it appears you want to keep multiple files open, 
> 
> > >and write to each in an arbitrary order.  That's no problem,  up to the 
> 
> > >operating system limits.  Define a class that holds the URL information and 
> 
> > >for each instance,  add an attribute for an output file handle. 
> 
> > 
> 
> > Don't forget to close each file when you're done with the corresponding URL.
> 
> 
> 
> One other thing to mention is that if you're doing anything with 
> 
> fetching URLs from Python, you almost certainly want to be using Kenneth 
> 
> Reitz's excellent requests module (http://docs.python-requests.org/).  
> 
> The built-in urllib support in Python works, but requests is so much 
> 
> simpler to use.

Dear Group,

Sorry if I miscommunicated. 

I am opening multiple URLs with urllib.open, now one Url has huge html source files, like that each one has. As these files are read I am trying to concatenate them and put in one txt file as string. 
From this big txt file I am trying to take out each html file body of each URL and trying to write and store them with attempts like,

for i, line in enumerate(file1):
	f = open("/python27/newfile_%i.txt" %i,'w')
	f.write(line)
	f.close()

Generally not much of an issue, but was thinking of some better options.

Regards,
Subhabrata Banerjee.

[toc] | [prev] | [next] | [standalone]

#73737

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2014-06-29 19:21 +0000
Message-ID	<lopp03$91i$1@dont-email.me>
In reply to	#73734

On Sun, 29 Jun 2014 10:32:00 -0700, subhabangalore wrote:

> I am opening multiple URLs with urllib.open, now one Url has huge html
> source files, like that each one has. As these files are read I am
> trying to concatenate them and put in one txt file as string.
> From this big txt file I am trying to take out each html file body of
> each URL and trying to write and store them

OK, let me clarify what I think you said.

First you concatenate all the web pages into a single file.
Then you extract all the page bodies from the single file and save them 
as separate files.

This seems a silly way to do things, why don't you just save each html 
body section as you receive it?

This sounds like it should be something as simple as:

from BeautifulSoup import BeautifulSoup
import requests

urlList = [ 
    "http://something/", 
    "http://something/", 
    "http://something/", 
    ....... ]

n = 0
for url in urlList:
    r = requests.get( url )
    soup = BeautifulSoup( r.content )
    body = soup.find( "body" )
    fp = open( "scraped/body{:0>5d}.htm".format( n ), "w" )
    fp.write( body.prettify() )
    fp.close
    n += 1

will give you:

scraped/body00000.htm
scraped/body00001.htm
scraped/body00002.htm
........

for as many urls as you have in your url list. (make sure the target 
directory exists!)

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]

#73758

From	subhabangalore@gmail.com
Date	2014-06-30 12:23 -0700
Message-ID	<29f99276-9609-42cd-9210-5bc5e75aa364@googlegroups.com>
In reply to	#73727

On Sunday, June 29, 2014 4:19:27 PM UTC+5:30, subhaba...@gmail.com wrote:
> Dear Group,
> 
> 
> 
> I am trying to crawl multiple URLs. As they are coming I want to write them as string, as they are coming, preferably in a queue. 
> 
> 
> 
> If any one of the esteemed members of the group may kindly help.
> 
> 
> 
> Regards,
> 
> Subhabrata Banerjee.

Dear Group,

Thank you for your kind suggestion. But I am not being able to sort out,
"fp = open( "scraped/body{:0>5d}.htm".format( n ), "w" ) "
please suggest.

Regards,
Subhabrata Banerjee.

[toc] | [prev] | [next] | [standalone]

#73766

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2014-06-30 23:22 +0000
Message-ID	<losrf2$c6i$2@dont-email.me>
In reply to	#73758

On Mon, 30 Jun 2014 12:23:08 -0700, subhabangalore wrote:

> Thank you for your kind suggestion. But I am not being able to sort out,
> "fp = open( "scraped/body{:0>5d}.htm".format( n ), "w" ) "
> please suggest.

look up the python manual for string.format() and open() functions.

The line indicated opens a file for write whose name is generated by the 
string.format() function by inserting the number N formatted to 5 digits 
with leading zeroes into the string "scraped/bodyN.htm"

It expects you to have a subdir called "scraped" below the dir you're 
executing the code in.

Also, this newsgroup is *NOT* a substitute for reading the manual for 
basic python functions and methods.

Finally, if you don't understand basic string and file handling in 
python, why on earth are you trying to write code that arguably needs a 
level of competence in both? Perhaps as your starter project you should 
try something simpler, print "hello world" is traditional.

To understand the string formatting, try:

print "hello {:0>5d} world".format( 5 )
print "hello {:0>5d} world".format( 50 )
print "hello {:0>5d} world".format( 500 )
print "hello {:0>5d} world".format( 5000 )
print "hello {:0>5d} world".format( 50000 )

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [standalone]

csiph-web

Writing Multiple files at a times

Contents

#73727 — Writing Multiple files at a times

#73728

#73729

#73730

#73734

#73737

#73758

#73766