Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #46223 > unrolled thread
| Started by | Bryan Britten <britten.bryan@gmail.com> |
|---|---|
| First post | 2013-05-27 13:47 -0700 |
| Last post | 2013-05-27 21:40 -0400 |
| Articles | 11 — 7 participants |
Back to article view | Back to comp.lang.python
Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Bryan Britten <britten.bryan@gmail.com> - 2013-05-27 13:47 -0700
Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Roy Smith <roy@panix.com> - 2013-05-27 16:56 -0400
Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Bryan Britten <britten.bryan@gmail.com> - 2013-05-27 14:29 -0700
Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Denis McMahon <denismfmcmahon@gmail.com> - 2013-05-27 21:35 +0000
Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Fábio Santos <fabiosantosart@gmail.com> - 2013-05-28 00:36 +0100
Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Dave Angel <davea@davea.name> - 2013-05-27 19:58 -0400
Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Bryan Britten <britten.bryan@gmail.com> - 2013-05-27 20:11 -0700
Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Fábio Santos <fabiosantosart@gmail.com> - 2013-05-28 08:31 +0100
Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Bryan Britten <britten.bryan@gmail.com> - 2013-05-28 07:32 -0700
Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Alister <alister.ware@ntlworld.com> - 2013-05-28 17:52 +0000
Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-05-27 21:40 -0400
| From | Bryan Britten <britten.bryan@gmail.com> |
|---|---|
| Date | 2013-05-27 13:47 -0700 |
| Subject | Reading *.json from URL - json.loads() versus urllib.urlopen.readlines() |
| Message-ID | <10be5c62-4c58-4b4f-b00a-82d85ee4ef8e@googlegroups.com> |
Hey, everyone! I'm very new to Python and have only been using it for a couple of days, but have some experience in programming (albeit mostly statistical programming in SAS or R) so I'm hoping someone can answer this question in a technical way, but without using an abundant amount of jargon. The issue I'm having is that I'm trying to pull information from a website to practice Python with, but I'm having trouble getting the data in a timely fashion. If I use the following code: <code> import json import urllib urlStr = "https://stream.twitter.com/1/statuses/sample.json" twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)] </code> I get a memory issue. I'm running 32-bit Python 2.7 with 4 gigs of RAM if that helps at all. If I use the following code: <code> import urllib urlStr = "https://stream.twitter.com/1/statuses/sample.json" fileHandle = urllib.urlopen(urlStr) twtrText = fileHandle.readlines() </code> It takes hours (upwards of 6 or 7, if not more) to finish computing the last command. With that being said, my question is whether there is a more efficient manner to do this. I'm worried that if it's taking this long to process the .readlines() command, trying to work with the data is going to be a computational nightmare. Thanks in advance for any insights or advice!
[toc] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-05-27 16:56 -0400 |
| Message-ID | <roy-E04E0C.16563127052013@news.panix.com> |
| In reply to | #46223 |
In article <10be5c62-4c58-4b4f-b00a-82d85ee4ef8e@googlegroups.com>, Bryan Britten <britten.bryan@gmail.com> wrote: > If I use the following code: > > <code> > import urllib > > urlStr = "https://stream.twitter.com/1/statuses/sample.json" > > fileHandle = urllib.urlopen(urlStr) > > twtrText = fileHandle.readlines() > </code> > > It takes hours (upwards of 6 or 7, if not more) to finish computing the last > command. I'm not surprised! readlines() reads in the ENTIRE file in one gulp. That a lot of tweets! > With that being said, my question is whether there is a more efficient manner > to do this. In general, when reading a large file, you want to iterate over lines of the file and process each one. Something like: for line in urllib.urlopen(urlStr): twtrDict = json.loads(line) You still need to download and process all the data, but at least you don't need to store it in memory all at once. There is an assumption here that there's exactly one json object per line. If that's not the case, things might get a little more complicated.
[toc] | [prev] | [next] | [standalone]
| From | Bryan Britten <britten.bryan@gmail.com> |
|---|---|
| Date | 2013-05-27 14:29 -0700 |
| Message-ID | <a35f5ef7-d458-4ae6-a64a-2690a335b0f4@googlegroups.com> |
| In reply to | #46224 |
Try to not sigh audibly as I ask what I'm sure are two asinine questions. 1) How is this approach different from twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]? 2) How do I tell how many JSON objects are on each line?
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2013-05-27 21:35 +0000 |
| Message-ID | <ko0jiq$a5h$3@dont-email.me> |
| In reply to | #46226 |
On Mon, 27 May 2013 14:29:38 -0700, Bryan Britten wrote: > Try to not sigh audibly as I ask what I'm sure are two asinine > questions. > > 1) How is this approach different from twtrDict = [json.loads(line) for > line in urllib.urlopen(urlStr)]? > > 2) How do I tell how many JSON objects are on each line? Your code at (1) creates a single list of all the json objects The code you replied to loaded each object, assumed you did something with it, and then over-wrote it with the next one. As for (2) - either inspection, or errors from the json parser. -- Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Fábio Santos <fabiosantosart@gmail.com> |
|---|---|
| Date | 2013-05-28 00:36 +0100 |
| Message-ID | <mailman.2266.1369697823.3114.python-list@python.org> |
| In reply to | #46226 |
[Multipart message — attachments visible in raw view] — view raw
On 27 May 2013 22:36, "Bryan Britten" <britten.bryan@gmail.com> wrote: > > Try to not sigh audibly as I ask what I'm sure are two asinine questions. > > 1) How is this approach different from twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]? > The suggested approach made use of generators. Just because you can iterate over something, that doesn't mean it is all in memory ;) Check out the difference between range() and xrange() in python 2
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-05-27 19:58 -0400 |
| Message-ID | <mailman.2268.1369699108.3114.python-list@python.org> |
| In reply to | #46223 |
On 05/27/2013 04:47 PM, Bryan Britten wrote: > Hey, everyone! > > I'm very new to Python and have only been using it for a couple of days, but have some experience in programming (albeit mostly statistical programming in SAS or R) so I'm hoping someone can answer this question in a technical way, but without using an abundant amount of jargon. > > The issue I'm having is that I'm trying to pull information from a website to practice Python with, but I'm having trouble getting the data in a timely fashion. If I use the following code: > > <code> > import json > import urllib > > urlStr = "https://stream.twitter.com/1/statuses/sample.json" > > twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)] > </code> > > I get a memory issue. I'm running 32-bit Python 2.7 with 4 gigs of RAM if that helps at all. Which OS? The first question I'd ask is how big this file is. I can't tell, since it needs a user name & password to actually get the file. But it's not unusual to need at least double that space in memory, and in Windoze you're limited to two gig max, regardless of how big your hardware might be. If you separately fetch the file, then you can experiment with it, including cutting it down to a dozen lines, and see if you can deal with that much. How could you fetch it? With wget, with a browser (and saveAs), with a simple loop which uses read(4096) repeatedly and writes each block to a local file. Don't forget to use 'wb', as you don't know yet what line endings it might use. Once you have an idea what the data looks like, you can answer such questions as whether it's json at all, whether the lines each contain a single json record, or what. For all we know, the file might be a few terabytes in size. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Bryan Britten <britten.bryan@gmail.com> |
|---|---|
| Date | 2013-05-27 20:11 -0700 |
| Message-ID | <4db5a3be-d9dc-455c-8e3b-5adebad2dcdd@googlegroups.com> |
| In reply to | #46234 |
On Monday, May 27, 2013 7:58:05 PM UTC-4, Dave Angel wrote:
> On 05/27/2013 04:47 PM, Bryan Britten wrote:
>
> > Hey, everyone!
>
> >
>
> > I'm very new to Python and have only been using it for a couple of days, but have some experience in programming (albeit mostly statistical programming in SAS or R) so I'm hoping someone can answer this question in a technical way, but without using an abundant amount of jargon.
>
> >
>
> > The issue I'm having is that I'm trying to pull information from a website to practice Python with, but I'm having trouble getting the data in a timely fashion. If I use the following code:
>
> >
>
> > <code>
>
> > import json
>
> > import urllib
>
> >
>
> > urlStr = "https://stream.twitter.com/1/statuses/sample.json"
>
> >
>
> > twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]
>
> > </code>
>
> >
>
> > I get a memory issue. I'm running 32-bit Python 2.7 with 4 gigs of RAM if that helps at all.
>
>
>
> Which OS?
I'm operating on Windows 7.
>
> The first question I'd ask is how big this file is. I can't tell, since
>
> it needs a user name & password to actually get the file.
If you have Twitter, you can just use your log-in information to access the file.
> But it's not unusual to need at least double that space in memory, and in Windoze
>
> you're limited to two gig max, regardless of how big your hardware might be.
>
>
>
> If you separately fetch the file, then you can experiment with it,
>
> including cutting it down to a dozen lines, and see if you can deal with
>
> that much.
>
>
>
> How could you fetch it? With wget, with a browser (and saveAs), with a
>
> simple loop which uses read(4096) repeatedly and writes each block to a
>
> local file. Don't forget to use 'wb', as you don't know yet what line
>
> endings it might use.
>
I'm not familiar with using read(4096), I'll have to look into that. When I tried to just save the file, my computer just sat in limbo for some time and didn't seem to want to process the command.
>
> Once you have an idea what the data looks like, you can answer such
>
> questions as whether it's json at all, whether the lines each contain a
>
> single json record, or what.
>
Based on my *extremely* limited knowledge of JSON, that's definitely the type of file this is. Here is a snippet of what is seen when you log in:
{"created_at":"Tue May 28 03:09:23 +0000 2013","id":339216806461972481,"id_str":"339216806461972481","text":"RT @aleon_11: Sigo creyendo que las noches lluviosas me acercan mucho m\u00e1s a ti!","source":"\u003ca href=\"http:\/\/blackberry.com\/twitter\" rel=\"nofollow\"\u003eTwitter for BlackBerry\u00ae\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":310910123,"id_str":"310910123","name":"\u2661","screen_name":"LaMarielita_","location":"","url":null,"description":"MERCADOLOGA & PUBLICISTA EN PROCESO, AMO A MI DIOS & MI FAMILIA\u2665 ME ENCANTA REIRME , MOLESTAR & HABLAR :D BFF, pancho, ale & china :) LY\u2661","protected":false,"followers_count":506,"friends_count":606,"listed_count":1,"created_at":"Sat Jun 04 15:24:19 +0000 2011","favourites_count":207,"utc_offset":-25200,"time_zone":"Mountain Time (US & Canada)","geo_enabled":false,"verified":false,"statuses_count":17241,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"FF6699","profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme11\/bg.gif","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme11\/bg.gif","profile_background_tile":true,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3720425493\/13a48910e56ca34edeea07ff04075c77_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3720425493\/13a48910e56ca34edeea07ff04075c77_normal.jpeg","profile_link_color":"B40B43","profile_sidebar_border_color":"CC3366","profile_sidebar_fill_color":"E5507E","profile_text_color":"362720","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Tue May 28 02:57:40 +0000 2013","id":339213856922537984,"id_str":"339213856922537984","text":"Sigo creyendo que las noches lluviosas me acercan mucho m\u00e1s a ti!","source":"web","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":105252134,"id_str":"105252134","name":"Alejandra Le\u00f3n","screen_name":"aleon_11","location":"Guatemala","url":null,"description":"La vida se disfruta m\u00e1s, cuando no se le pone tanta importancia.","protected":false,"followers_count":143,"friends_count":251,"listed_count":0,"created_at":"Fri Jan 15 20:49:38 +0000 2010","favourites_count":83,"utc_offset":-28800,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"verified":false,"statuses_count":1863,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"F8F2FC","profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/811443451\/81abf2f37ee3e37deda396befa7fb557.jpeg","profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/811443451\/81abf2f37ee3e37deda396befa7fb557.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3578979563\/e973196904e25af5d960f2971616eb61_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3578979563\/e973196904e25af5d960f2971616eb61_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/105252134\/1364957374","profile_link_color":"F01A1A","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"7AC3EE","profile_text_color":"3D1957","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":2,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"lang":"es"},"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"aleon_11","name":"Alejandra Le\u00f3n","id":105252134,"id_str":"105252134","indices":[3,12]}]},"favorited":false,"retweeted":false,"filter_level":"low"}
>
> For all we know, the file might be a few terabytes in size.
>
>
>
>
>
> --
>
> DaveA
[toc] | [prev] | [next] | [standalone]
| From | Fábio Santos <fabiosantosart@gmail.com> |
|---|---|
| Date | 2013-05-28 08:31 +0100 |
| Message-ID | <mailman.2286.1369726304.3114.python-list@python.org> |
| In reply to | #46248 |
[Multipart message — attachments visible in raw view] — view raw
On 28 May 2013 04:19, "Bryan Britten" <britten.bryan@gmail.com> wrote: > I'm not familiar with using read(4096), I'll have to look into that. When I tried to just save the file, my computer just sat in limbo for some time and didn't seem to want to process the command. That's just file.read with an integer argument. You can read a file by chunks by repeatedly calling that function until you get the empty string. > Based on my *extremely* limited knowledge of JSON, that's definitely the type of file this is. Here is a snippet of what is seen when you log in: ... That's json. It's pretty big, but not big enough to stall a slow computer more than half a second. - I've looked for documentation on that method on twitter. It seems that it's part of the twitter streaming api. https://dev.twitter.com/docs/streaming-apis What this means is that the requests aren't supposed to end. They are supposed to be read gradually, using the lines to split the response into meaningful chunks. That's why you can't read the data and why your browser never gets around to download it. Both urlopen and your browser block while waiting for the request to end. Here's more info on streaming requests on their docs: https://dev.twitter.com/docs/streaming-apis/processing For streaming requests in python, I would point you to the requests library, but I am not sure it handles streaming requests.
[toc] | [prev] | [next] | [standalone]
| From | Bryan Britten <britten.bryan@gmail.com> |
|---|---|
| Date | 2013-05-28 07:32 -0700 |
| Message-ID | <31d86773-88d4-43e2-8699-39021a5f27b8@googlegroups.com> |
| In reply to | #46262 |
Thanks to everyone for the help and insight. I think for now I'll just back away from this file and go back to something much easier to practice with.
[toc] | [prev] | [next] | [standalone]
| From | Alister <alister.ware@ntlworld.com> |
|---|---|
| Date | 2013-05-28 17:52 +0000 |
| Message-ID | <w96pt.47153$0L6.13686@fx20.am4> |
| In reply to | #46262 |
On Tue, 28 May 2013 08:31:35 +0100, Fábio Santos wrote: > On 28 May 2013 04:19, "Bryan Britten" <britten.bryan@gmail.com> wrote: >> I'm not familiar with using read(4096), I'll have to look into that. >> When > I tried to just save the file, my computer just sat in limbo for some > time and didn't seem to want to process the command. > > That's just file.read with an integer argument. You can read a file by > chunks by repeatedly calling that function until you get the empty > string. > >> Based on my *extremely* limited knowledge of JSON, that's definitely >> the > type of file this is. Here is a snippet of what is seen when you log in: > ... > That's json. It's pretty big, but not big enough to stall a slow > computer more than half a second. > > - > > I've looked for documentation on that method on twitter. > > It seems that it's part of the twitter streaming api. > > https://dev.twitter.com/docs/streaming-apis > > What this means is that the requests aren't supposed to end. They are > supposed to be read gradually, using the lines to split the response > into meaningful chunks. That's why you can't read the data and why your > browser never gets around to download it. Both urlopen and your browser > block while waiting for the request to end. Are we overlooking the obvious why not use one of the Python twitter modules to isolate your app from the nitty-gritty details of the twitter stream https://dev.twitter.com/docs/twitter-libraries -- Given sufficient time, what you put off doing today will get done by itself.
[toc] | [prev] | [next] | [standalone]
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2013-05-27 21:40 -0400 |
| Message-ID | <mailman.2274.1369705505.3114.python-list@python.org> |
| In reply to | #46223 |
On Mon, 27 May 2013 19:58:05 -0400, Dave Angel <davea@davea.name>
declaimed the following in gmane.comp.python.general:
> unusual to need at least double that space in memory, and in Windoze
> you're limited to two gig max, regardless of how big your hardware might be.
>
If the boot config is set for "server mode", WinXP can give 3GB to
user process.
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web