Groups > comp.lang.python > #46223 > unrolled thread

Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

Started by	Bryan Britten <britten.bryan@gmail.com>
First post	2013-05-27 13:47 -0700
Last post	2013-05-27 21:40 -0400
Articles	11 — 7 participants

Back to article view | Back to comp.lang.python

  Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Bryan Britten <britten.bryan@gmail.com> - 2013-05-27 13:47 -0700
    Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Roy Smith <roy@panix.com> - 2013-05-27 16:56 -0400
      Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Bryan Britten <britten.bryan@gmail.com> - 2013-05-27 14:29 -0700
        Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Denis McMahon <denismfmcmahon@gmail.com> - 2013-05-27 21:35 +0000
        Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Fábio Santos <fabiosantosart@gmail.com> - 2013-05-28 00:36 +0100
    Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Dave Angel <davea@davea.name> - 2013-05-27 19:58 -0400
      Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Bryan Britten <britten.bryan@gmail.com> - 2013-05-27 20:11 -0700
        Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Fábio Santos <fabiosantosart@gmail.com> - 2013-05-28 08:31 +0100
          Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Bryan Britten <britten.bryan@gmail.com> - 2013-05-28 07:32 -0700
          Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Alister <alister.ware@ntlworld.com> - 2013-05-28 17:52 +0000
    Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-05-27 21:40 -0400

#46223 — Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

From	Bryan Britten <britten.bryan@gmail.com>
Date	2013-05-27 13:47 -0700
Subject	Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()
Message-ID	<10be5c62-4c58-4b4f-b00a-82d85ee4ef8e@googlegroups.com>

Hey, everyone! 

I'm very new to Python and have only been using it for a couple of days, but have some experience in programming (albeit mostly statistical programming in SAS or R) so I'm hoping someone can answer this question in a technical way, but without using an abundant amount of jargon.

The issue I'm having is that I'm trying to pull information from a website to practice Python with, but I'm having trouble getting the data in a timely fashion. If I use the following code:

<code>
import json
import urllib

urlStr = "https://stream.twitter.com/1/statuses/sample.json"

twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]
</code>

I get a memory issue. I'm running 32-bit Python 2.7 with 4 gigs of RAM if that helps at all.

If I use the following code:

<code>
import urllib

urlStr = "https://stream.twitter.com/1/statuses/sample.json"

fileHandle = urllib.urlopen(urlStr)

twtrText = fileHandle.readlines()
</code>

It takes hours (upwards of 6 or 7, if not more) to finish computing the last command.

With that being said, my question is whether there is a more efficient manner to do this. I'm worried that if it's taking this long to process the .readlines() command, trying to work with the data is going to be a computational nightmare.

Thanks in advance for any insights or advice!

[toc] | [next] | [standalone]

#46224

From	Roy Smith <roy@panix.com>
Date	2013-05-27 16:56 -0400
Message-ID	<roy-E04E0C.16563127052013@news.panix.com>
In reply to	#46223

In article <10be5c62-4c58-4b4f-b00a-82d85ee4ef8e@googlegroups.com>,
 Bryan Britten <britten.bryan@gmail.com> wrote:

> If I use the following code:
> 
> <code>
> import urllib
> 
> urlStr = "https://stream.twitter.com/1/statuses/sample.json"
> 
> fileHandle = urllib.urlopen(urlStr)
> 
> twtrText = fileHandle.readlines()
> </code>
> 
> It takes hours (upwards of 6 or 7, if not more) to finish computing the last 
> command.

I'm not surprised!  readlines() reads in the ENTIRE file in one gulp.  
That a lot of tweets!

> With that being said, my question is whether there is a more efficient manner 
> to do this.

In general, when reading a large file, you want to iterate over lines of 
the file and process each one.  Something like:

for line in urllib.urlopen(urlStr):
   twtrDict = json.loads(line)

You still need to download and process all the data, but at least you 
don't need to store it in memory all at once.  There is an assumption 
here that there's exactly one json object per line.  If that's not the 
case, things might get a little more complicated.

[toc] | [prev] | [next] | [standalone]

#46226

From	Bryan Britten <britten.bryan@gmail.com>
Date	2013-05-27 14:29 -0700
Message-ID	<a35f5ef7-d458-4ae6-a64a-2690a335b0f4@googlegroups.com>
In reply to	#46224

Try to not sigh audibly as I ask what I'm sure are two asinine questions. 

1) How is this approach different from twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]?

2) How do I tell how many JSON objects are on each line?

[toc] | [prev] | [next] | [standalone]

#46227

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2013-05-27 21:35 +0000
Message-ID	<ko0jiq$a5h$3@dont-email.me>
In reply to	#46226

On Mon, 27 May 2013 14:29:38 -0700, Bryan Britten wrote:

> Try to not sigh audibly as I ask what I'm sure are two asinine
> questions.
> 
> 1) How is this approach different from twtrDict = [json.loads(line) for
> line in urllib.urlopen(urlStr)]?
> 
> 2) How do I tell how many JSON objects are on each line?

Your code at (1) creates a single list of all the json objects

The code you replied to loaded each object, assumed you did something 
with it, and then over-wrote it with the next one.

As for (2) - either inspection, or errors from the json parser.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]

#46231

From	Fábio Santos <fabiosantosart@gmail.com>
Date	2013-05-28 00:36 +0100
Message-ID	<mailman.2266.1369697823.3114.python-list@python.org>
In reply to	#46226

[Multipart message — attachments visible in raw view] — view raw

On 27 May 2013 22:36, "Bryan Britten" <britten.bryan@gmail.com> wrote:
>
> Try to not sigh audibly as I ask what I'm sure are two asinine questions.
>
> 1) How is this approach different from twtrDict = [json.loads(line) for
line in urllib.urlopen(urlStr)]?
>

The suggested approach made use of generators. Just because you can iterate
over something, that doesn't mean it is all in memory ;)

Check out the difference between range() and xrange() in python 2

[toc] | [prev] | [next] | [standalone]

#46234

From	Dave Angel <davea@davea.name>
Date	2013-05-27 19:58 -0400
Message-ID	<mailman.2268.1369699108.3114.python-list@python.org>
In reply to	#46223

On 05/27/2013 04:47 PM, Bryan Britten wrote:
> Hey, everyone!
>
> I'm very new to Python and have only been using it for a couple of days, but have some experience in programming (albeit mostly statistical programming in SAS or R) so I'm hoping someone can answer this question in a technical way, but without using an abundant amount of jargon.
>
> The issue I'm having is that I'm trying to pull information from a website to practice Python with, but I'm having trouble getting the data in a timely fashion. If I use the following code:
>
> <code>
> import json
> import urllib
>
> urlStr = "https://stream.twitter.com/1/statuses/sample.json"
>
> twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]
> </code>
>
> I get a memory issue. I'm running 32-bit Python 2.7 with 4 gigs of RAM if that helps at all.

Which OS?

The first question I'd ask is how big this file is.  I can't tell, since 
it needs a user name & password to actually get the file.  But it's not 
unusual to need at least double that space in memory, and in Windoze 
you're limited to two gig max, regardless of how big your hardware might be.

If you separately fetch the file, then you can experiment with it, 
including cutting it down to a dozen lines, and see if you can deal with 
that much.

How could you fetch it?  With wget, with a browser (and saveAs), with a 
simple loop which uses read(4096) repeatedly and writes each block to a 
local file.  Don't forget to use 'wb', as you don't know yet what line 
endings it might use.

Once you have an idea what the data looks like, you can answer such 
questions as whether it's json at all, whether the lines each contain a 
single json record, or what.

For all we know, the file might be a few terabytes in size.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#46248

From	Bryan Britten <britten.bryan@gmail.com>
Date	2013-05-27 20:11 -0700
Message-ID	<4db5a3be-d9dc-455c-8e3b-5adebad2dcdd@googlegroups.com>
In reply to	#46234

On Monday, May 27, 2013 7:58:05 PM UTC-4, Dave Angel wrote:
> On 05/27/2013 04:47 PM, Bryan Britten wrote:
> 
> > Hey, everyone!
> 
> >
> 
> > I'm very new to Python and have only been using it for a couple of days, but have some experience in programming (albeit mostly statistical programming in SAS or R) so I'm hoping someone can answer this question in a technical way, but without using an abundant amount of jargon.
> 
> >
> 
> > The issue I'm having is that I'm trying to pull information from a website to practice Python with, but I'm having trouble getting the data in a timely fashion. If I use the following code:
> 
> >
> 
> > <code>
> 
> > import json
> 
> > import urllib
> 
> >
> 
> > urlStr = "https://stream.twitter.com/1/statuses/sample.json"
> 
> >
> 
> > twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]
> 
> > </code>
> 
> >
> 
> > I get a memory issue. I'm running 32-bit Python 2.7 with 4 gigs of RAM if that helps at all.
> 
> 
> 
> Which OS?

I'm operating on Windows 7.

> 
> The first question I'd ask is how big this file is.  I can't tell, since 
> 
> it needs a user name & password to actually get the file.  

If you have Twitter, you can just use your log-in information to access the file.

> But it's not unusual to need at least double that space in memory, and in Windoze 
> 
> you're limited to two gig max, regardless of how big your hardware might be.
> 
> 
> 
> If you separately fetch the file, then you can experiment with it, 
> 
> including cutting it down to a dozen lines, and see if you can deal with 
> 
> that much.
> 
> 
> 
> How could you fetch it?  With wget, with a browser (and saveAs), with a 
> 
> simple loop which uses read(4096) repeatedly and writes each block to a 
> 
> local file.  Don't forget to use 'wb', as you don't know yet what line 
> 
> endings it might use.
> 
I'm not familiar with using read(4096), I'll have to look into that. When I tried to just save the file, my computer just sat in limbo for some time and didn't seem to want to process the command. 
> 
> Once you have an idea what the data looks like, you can answer such 
> 
> questions as whether it's json at all, whether the lines each contain a 
> 
> single json record, or what.
> 
Based on my *extremely* limited knowledge of JSON, that's definitely the type of file this is. Here is a snippet of what is seen when you log in:

{"created_at":"Tue May 28 03:09:23 +0000 2013","id":339216806461972481,"id_str":"339216806461972481","text":"RT @aleon_11: Sigo creyendo que las noches lluviosas me acercan mucho m\u00e1s a ti!","source":"\u003ca href=\"http:\/\/blackberry.com\/twitter\" rel=\"nofollow\"\u003eTwitter for BlackBerry\u00ae\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":310910123,"id_str":"310910123","name":"\u2661","screen_name":"LaMarielita_","location":"","url":null,"description":"MERCADOLOGA & PUBLICISTA EN PROCESO, AMO A MI DIOS & MI FAMILIA\u2665 ME ENCANTA REIRME , MOLESTAR & HABLAR :D BFF, pancho, ale & china :) LY\u2661","protected":false,"followers_count":506,"friends_count":606,"listed_count":1,"created_at":"Sat Jun 04 15:24:19 +0000 2011","favourites_count":207,"utc_offset":-25200,"time_zone":"Mountain Time (US & Canada)","geo_enabled":false,"verified":false,"statuses_count":17241,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"FF6699","profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme11\/bg.gif","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme11\/bg.gif","profile_background_tile":true,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3720425493\/13a48910e56ca34edeea07ff04075c77_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3720425493\/13a48910e56ca34edeea07ff04075c77_normal.jpeg","profile_link_color":"B40B43","profile_sidebar_border_color":"CC3366","profile_sidebar_fill_color":"E5507E","profile_text_color":"362720","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Tue May 28 02:57:40 +0000 2013","id":339213856922537984,"id_str":"339213856922537984","text":"Sigo creyendo que las noches lluviosas me acercan mucho m\u00e1s a ti!","source":"web","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":105252134,"id_str":"105252134","name":"Alejandra Le\u00f3n","screen_name":"aleon_11","location":"Guatemala","url":null,"description":"La vida se disfruta m\u00e1s, cuando no se le pone tanta importancia.","protected":false,"followers_count":143,"friends_count":251,"listed_count":0,"created_at":"Fri Jan 15 20:49:38 +0000 2010","favourites_count":83,"utc_offset":-28800,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"verified":false,"statuses_count":1863,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"F8F2FC","profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/811443451\/81abf2f37ee3e37deda396befa7fb557.jpeg","profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/811443451\/81abf2f37ee3e37deda396befa7fb557.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3578979563\/e973196904e25af5d960f2971616eb61_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3578979563\/e973196904e25af5d960f2971616eb61_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/105252134\/1364957374","profile_link_color":"F01A1A","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"7AC3EE","profile_text_color":"3D1957","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":2,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"lang":"es"},"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"aleon_11","name":"Alejandra Le\u00f3n","id":105252134,"id_str":"105252134","indices":[3,12]}]},"favorited":false,"retweeted":false,"filter_level":"low"}

> 
> For all we know, the file might be a few terabytes in size.
> 
> 
> 
> 
> 
> -- 
> 
> DaveA

[toc] | [prev] | [next] | [standalone]

#46262

From	Fábio Santos <fabiosantosart@gmail.com>
Date	2013-05-28 08:31 +0100
Message-ID	<mailman.2286.1369726304.3114.python-list@python.org>
In reply to	#46248

[Multipart message — attachments visible in raw view] — view raw

On 28 May 2013 04:19, "Bryan Britten" <britten.bryan@gmail.com> wrote:
> I'm not familiar with using read(4096), I'll have to look into that. When
I tried to just save the file, my computer just sat in limbo for some time
and didn't seem to want to process the command.

That's just file.read with an integer argument. You can read a file by
chunks by repeatedly calling that function until you get the empty string.

> Based on my *extremely* limited knowledge of JSON, that's definitely the
type of file this is. Here is a snippet of what is seen when you log in:
...
That's json. It's pretty big, but not big enough to stall a slow computer
more than half a second.

-

I've looked for documentation on that method on twitter.

It seems that it's part of the twitter streaming api.

https://dev.twitter.com/docs/streaming-apis

What this means is that the requests aren't supposed to end. They are
supposed to be read gradually, using the lines to split the response into
meaningful chunks. That's why you can't read the data and why your browser
never gets around to download it. Both urlopen and your browser block while
waiting for the request to end.

Here's more info on streaming requests on their docs:

https://dev.twitter.com/docs/streaming-apis/processing

For streaming requests in python, I would point you to the requests
library, but I am not sure it handles streaming requests.

[toc] | [prev] | [next] | [standalone]

#46291

From	Bryan Britten <britten.bryan@gmail.com>
Date	2013-05-28 07:32 -0700
Message-ID	<31d86773-88d4-43e2-8699-39021a5f27b8@googlegroups.com>
In reply to	#46262

Thanks to everyone for the help and insight. I think for now I'll just back away from this file and go back to something much easier to practice with.

[toc] | [prev] | [next] | [standalone]

#46316

From	Alister <alister.ware@ntlworld.com>
Date	2013-05-28 17:52 +0000
Message-ID	<w96pt.47153$0L6.13686@fx20.am4>
In reply to	#46262

On Tue, 28 May 2013 08:31:35 +0100, Fábio Santos wrote:

> On 28 May 2013 04:19, "Bryan Britten" <britten.bryan@gmail.com> wrote:
>> I'm not familiar with using read(4096), I'll have to look into that.
>> When
> I tried to just save the file, my computer just sat in limbo for some
> time and didn't seem to want to process the command.
> 
> That's just file.read with an integer argument. You can read a file by
> chunks by repeatedly calling that function until you get the empty
> string.
> 
>> Based on my *extremely* limited knowledge of JSON, that's definitely
>> the
> type of file this is. Here is a snippet of what is seen when you log in:
> ...
> That's json. It's pretty big, but not big enough to stall a slow
> computer more than half a second.
> 
> -
> 
> I've looked for documentation on that method on twitter.
> 
> It seems that it's part of the twitter streaming api.
> 
> https://dev.twitter.com/docs/streaming-apis
> 
> What this means is that the requests aren't supposed to end. They are
> supposed to be read gradually, using the lines to split the response
> into meaningful chunks. That's why you can't read the data and why your
> browser never gets around to download it. Both urlopen and your browser
> block while waiting for the request to end.

Are we overlooking the obvious
why not use one of the Python twitter modules to isolate your app from 
the nitty-gritty details of the twitter stream 

https://dev.twitter.com/docs/twitter-libraries

-- 
Given sufficient time, what you put off doing today will get done by 
itself.

[toc] | [prev] | [next] | [standalone]

#46243

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2013-05-27 21:40 -0400
Message-ID	<mailman.2274.1369705505.3114.python-list@python.org>
In reply to	#46223

On Mon, 27 May 2013 19:58:05 -0400, Dave Angel <davea@davea.name>
declaimed the following in gmane.comp.python.general:


> unusual to need at least double that space in memory, and in Windoze 
> you're limited to two gig max, regardless of how big your hardware might be.
>
	If the boot config is set for "server mode", WinXP can give 3GB to
user process.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [standalone]

csiph-web

Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

Contents

#46223 — Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

#46224

#46226

#46227

#46231

#46234

#46248

#46262

#46291

#46316

#46243