Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

From	Roy Smith <roy@panix.com>
Newsgroups	comp.lang.python
Subject	Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()
Date	2013-05-27 16:56 -0400
Organization	PANIX Public Access Internet and UNIX, NYC
Message-ID	<roy-E04E0C.16563127052013@news.panix.com> (permalink)
References	<10be5c62-4c58-4b4f-b00a-82d85ee4ef8e@googlegroups.com>

Show all headers | View raw

In article <10be5c62-4c58-4b4f-b00a-82d85ee4ef8e@googlegroups.com>,
 Bryan Britten <britten.bryan@gmail.com> wrote:

> If I use the following code:
> 
> <code>
> import urllib
> 
> urlStr = "https://stream.twitter.com/1/statuses/sample.json"
> 
> fileHandle = urllib.urlopen(urlStr)
> 
> twtrText = fileHandle.readlines()
> </code>
> 
> It takes hours (upwards of 6 or 7, if not more) to finish computing the last 
> command.

I'm not surprised!  readlines() reads in the ENTIRE file in one gulp.  
That a lot of tweets!

> With that being said, my question is whether there is a more efficient manner 
> to do this.

In general, when reading a large file, you want to iterate over lines of 
the file and process each one.  Something like:

for line in urllib.urlopen(urlStr):
   twtrDict = json.loads(line)

You still need to download and process all the data, but at least you 
don't need to store it in memory all at once.  There is an assumption 
here that there's exactly one json object per line.  If that's not the 
case, things might get a little more complicated.

Thread

Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Bryan Britten <britten.bryan@gmail.com> - 2013-05-27 13:47 -0700
  Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Roy Smith <roy@panix.com> - 2013-05-27 16:56 -0400
    Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Bryan Britten <britten.bryan@gmail.com> - 2013-05-27 14:29 -0700
      Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Denis McMahon <denismfmcmahon@gmail.com> - 2013-05-27 21:35 +0000
      Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Fábio Santos <fabiosantosart@gmail.com> - 2013-05-28 00:36 +0100
  Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Dave Angel <davea@davea.name> - 2013-05-27 19:58 -0400
    Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Bryan Britten <britten.bryan@gmail.com> - 2013-05-27 20:11 -0700
      Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Fábio Santos <fabiosantosart@gmail.com> - 2013-05-28 08:31 +0100
        Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Bryan Britten <britten.bryan@gmail.com> - 2013-05-28 07:32 -0700
        Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Alister <alister.ware@ntlworld.com> - 2013-05-28 17:52 +0000
  Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-05-27 21:40 -0400

csiph-web