Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #45904 > unrolled thread
| Started by | logan.c.graham@gmail.com |
|---|---|
| First post | 2013-05-24 10:32 -0700 |
| Last post | 2013-05-25 03:15 -0700 |
| Articles | 12 — 7 participants |
Back to article view | Back to comp.lang.python
Total Beginner - Extracting Data from a Database Online (Screenshot) logan.c.graham@gmail.com - 2013-05-24 10:32 -0700
Re: Total Beginner - Extracting Data from a Database Online (Screenshot) Dave Angel <davea@davea.name> - 2013-05-24 15:41 -0400
RE: Total Beginner - Extracting Data from a Database Online (Screenshot) Carlos Nepomuceno <carlosnepomuceno@outlook.com> - 2013-05-25 02:36 +0300
Re: Total Beginner - Extracting Data from a Database Online (Screenshot) John Ladasky <john_ladasky@sbcglobal.net> - 2013-05-25 18:33 -0700
Re: Total Beginner - Extracting Data from a Database Online (Screenshot) logan.c.graham@gmail.com - 2013-05-27 17:58 -0700
RE: Total Beginner - Extracting Data from a Database Online (Screenshot) Carlos Nepomuceno <carlosnepomuceno@outlook.com> - 2013-05-28 04:21 +0300
RE: Total Beginner - Extracting Data from a Database Online (Screenshot) Phil Connell <pconnell@gmail.com> - 2013-05-28 07:40 +0100
Re: Total Beginner - Extracting Data from a Database Online (Screenshot) Dave Angel <davea@davea.name> - 2013-05-24 21:16 -0400
Re: Total Beginner - Extracting Data from a Database Online (Screenshot) Chris Angelico <rosuav@gmail.com> - 2013-05-25 13:22 +1000
Re: Total Beginner - Extracting Data from a Database Online (Screenshot) logan.c.graham@gmail.com - 2013-05-25 17:48 -0700
Total Beginner - Extracting Data from a Database Online (Screenshot) "neil.suffield@gmail.com" <neil.suffield@gmail.com> - 2013-05-25 03:13 -0700
Total Beginner - Extracting Data from a Database Online (Screenshot) "neil.suffield@gmail.com" <neil.suffield@gmail.com> - 2013-05-25 03:15 -0700
| From | logan.c.graham@gmail.com |
|---|---|
| Date | 2013-05-24 10:32 -0700 |
| Subject | Total Beginner - Extracting Data from a Database Online (Screenshot) |
| Message-ID | <b3730ef1-90bb-4ef4-8683-239e722aa1da@googlegroups.com> |
Hey guys, I'm learning Python and I'm experimenting with different projects -- I like learning by doing. I'm wondering if you can help me here: http://i.imgur.com/KgvSKWk.jpg What this is is a publicly-accessible webpage that's a simple database of people who have used the website. Ideally what I'd like to end up with is an excel spreadsheet with data from the columns #fb, # vids, fb sent?, # email tm. I'd like to use Python to do it -- crawl the page and extract the data in a usable way. I'd love your input! I'm just a learner.
[toc] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-05-24 15:41 -0400 |
| Message-ID | <mailman.2076.1369424506.3114.python-list@python.org> |
| In reply to | #45904 |
On 05/24/2013 01:32 PM, logan.c.graham@gmail.com wrote: > Hey guys, > > I'm learning Python Welcome. > and I'm experimenting with different projects -- I like learning by doing. I'm wondering if you can help me here: > >na > > What this is is a publicly-accessible webpage No, it's just a jpeg file, an image. > that's a simple database of people who have used the website. Ideally what I'd like to end up with is an excel spreadsheet with data from the columns #fb, # vids, fb sent?, # email tm. > > I'd like to use Python to do it -- crawl the page and extract the data in a usable way. > But there's no page to crawl. You may have to start by finding an ocr to interpret the image as characters. Or find some other source for your data. > I'd love your input! I'm just a learner. > -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Carlos Nepomuceno <carlosnepomuceno@outlook.com> |
|---|---|
| Date | 2013-05-25 02:36 +0300 |
| Message-ID | <mailman.2088.1369438663.3114.python-list@python.org> |
| In reply to | #45904 |
### table_data_extraction.py ###
# Usage: table[id][row][column]
# tables[0] : 1st table
# tables[1][2] : 3rd row of 2nd table
# tables[3][4][5] : cell content of 6th column of 5th row of 4th table
# len(table) : quantity of tables
# len(table[6]) : quantity of rows of 7th table
# len(table[7][8]): quantity of columns of 9th row of 8th table
impor re
import urllib2
#to retrieve the contents of the page
page = urllib2.urlopen("http://example.com/page.html").read().strip()
#to create the tables list
tables=[[re.findall('<TD>(.*?)</TD>',r,re.S) for r in re.findall('<TR>(.*?)</TR>',t,re.S)] for t in re.findall('<TABLE>(.*?)</TABLE>',page,re.S)]
Pretty simple. Good luck!
----------------------------------------
> Date: Fri, 24 May 2013 10:32:26 -0700
> Subject: Total Beginner - Extracting Data from a Database Online (Screenshot)
> From: logan.c.graham@gmail.com
> To: python-list@python.org
>
> Hey guys,
>
> I'm learning Python and I'm experimenting with different projects -- I like learning by doing. I'm wondering if you can help me here:
>
> http://i.imgur.com/KgvSKWk.jpg
>
> What this is is a publicly-accessible webpage that's a simple database of people who have used the website. Ideally what I'd like to end up with is an excel spreadsheet with data from the columns #fb, # vids, fb sent?, # email tm.
>
> I'd like to use Python to do it -- crawl the page and extract the data in a usable way.
>
> I'd love your input! I'm just a learner.
> --
> http://mail.python.org/mailman/listinfo/python-list
[toc] | [prev] | [next] | [standalone]
| From | John Ladasky <john_ladasky@sbcglobal.net> |
|---|---|
| Date | 2013-05-25 18:33 -0700 |
| Message-ID | <cceeff0e-611b-40eb-83a1-e45c37a4f04e@googlegroups.com> |
| In reply to | #45929 |
On Friday, May 24, 2013 4:36:35 PM UTC-7, Carlos Nepomuceno wrote:
> #to create the tables list
> tables=[[re.findall('<TD>(.*?)</TD>',r,re.S) for r in re.findall('<TR>(.*?)</TR>',t,re.S)] for t in re.findall('<TABLE>(.*?)</TABLE>',page,re.S)]
>
>
> Pretty simple.
Two nested list comprehensions, with regex pattern matching?
Logan did say he was a "total beginner." :^)
[toc] | [prev] | [next] | [standalone]
| From | logan.c.graham@gmail.com |
|---|---|
| Date | 2013-05-27 17:58 -0700 |
| Message-ID | <faa2c4de-d033-4a01-a9ab-79189f39867e@googlegroups.com> |
| In reply to | #46025 |
On Saturday, May 25, 2013 6:33:25 PM UTC-7, John Ladasky wrote:
> On Friday, May 24, 2013 4:36:35 PM UTC-7, Carlos Nepomuceno wrote:
>
> > #to create the tables list
>
> > tables=[[re.findall('<TD>(.*?)</TD>',r,re.S) for r in re.findall('<TR>(.*?)</TR>',t,re.S)] for t in re.findall('<TABLE>(.*?)</TABLE>',page,re.S)]
>
> >
>
> >
>
> > Pretty simple.
>
>
>
> Two nested list comprehensions, with regex pattern matching?
>
>
>
> Logan did say he was a "total beginner." :^)
Oh goodness, yes, I have no clue.
[toc] | [prev] | [next] | [standalone]
| From | Carlos Nepomuceno <carlosnepomuceno@outlook.com> |
|---|---|
| Date | 2013-05-28 04:21 +0300 |
| Message-ID | <mailman.2272.1369704073.3114.python-list@python.org> |
| In reply to | #46239 |
---------------------------------------- > Date: Mon, 27 May 2013 17:58:00 -0700 > Subject: Re: Total Beginner - Extracting Data from a Database Online (Screenshot) > From: logan.c.graham@gmail.com > To: python-list@python.org [...] > > Oh goodness, yes, I have no clue. For example: # to retrieve the contents of all column '# fb' (11th column from the image you sent) c11 = [tables[0][r][10] for r in range(len(tables[0]))] # ---------------- ------------- # this is the content this is the quantity # of the 11th cell of rows in table[0] # of row 'r'
[toc] | [prev] | [next] | [standalone]
| From | Phil Connell <pconnell@gmail.com> |
|---|---|
| Date | 2013-05-28 07:40 +0100 |
| Message-ID | <mailman.2282.1369723209.3114.python-list@python.org> |
| In reply to | #46239 |
[Multipart message — attachments visible in raw view] — view raw
On 28 May 2013 02:21, "Carlos Nepomuceno" <carlosnepomuceno@outlook.com> wrote: > > ---------------------------------------- > > Date: Mon, 27 May 2013 17:58:00 -0700 > > Subject: Re: Total Beginner - Extracting Data from a Database Online (Screenshot) > > From: logan.c.graham@gmail.com > > To: python-list@python.org > [...] > > > > Oh goodness, yes, I have no clue. > > For example: > > # to retrieve the contents of all column '# fb' (11th column from the image you sent) > > c11 = [tables[0][r][10] for r in range(len(tables[0]))] Or rather: c11 = [row[10] for row in tables[0]] In most cases, range(len(x)) is a sign that you're doing it wrong :)
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-05-24 21:16 -0400 |
| Message-ID | <mailman.2096.1369444611.3114.python-list@python.org> |
| In reply to | #45904 |
On 05/24/2013 07:36 PM, Carlos Nepomuceno wrote:
>
> <SNIP>
> page = urllib2.urlopen("http://example.com/page.html").read().strip()
>
> #to create the tables list
> tables=[[re.findall('<TD>(.*?)</TD>',r,re.S) for r in re.findall('<TR>(.*?)</TR>',t,re.S)] for t in re.findall('<TABLE>(.*?)</TABLE>',page,re.S)]
>
>
> Pretty simple. Good luck!
Only if the page is html, which the OP's was not. It was an image. Try
parsing that with regex.
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-05-25 13:22 +1000 |
| Message-ID | <mailman.2099.1369452182.3114.python-list@python.org> |
| In reply to | #45904 |
On Sat, May 25, 2013 at 3:32 AM, <logan.c.graham@gmail.com> wrote: > http://i.imgur.com/KgvSKWk.jpg > > What this is is a publicly-accessible webpage... If that's a screenshot of something that we'd be able to access directly, then why not just post a link to the actual thing? More likely I'm thinking it's NOT publicly accessible, which is why it's been censored. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | logan.c.graham@gmail.com |
|---|---|
| Date | 2013-05-25 17:48 -0700 |
| Message-ID | <3126e4b6-d685-4ac8-8532-134f46a904ba@googlegroups.com> |
| In reply to | #45945 |
Sorry to be unclear -- it's a screenshot of the webpage, which is publicly accessible, but it contains sensitive information. A bad combination, admittedly, and something that'll be soon fixed.
[toc] | [prev] | [next] | [standalone]
| From | "neil.suffield@gmail.com" <neil.suffield@gmail.com> |
|---|---|
| Date | 2013-05-25 03:13 -0700 |
| Message-ID | <d17c7b10-67aa-4fd3-bffc-fbd6dab81d71@googlegroups.com> |
| In reply to | #45904 |
If you are talking about accessing a web page, rather than an image, then you want to do what is known as screen scraping. One of the best tools for this is called BeautifulSoup. http://www.crummy.com/software/BeautifulSoup/
[toc] | [prev] | [next] | [standalone]
| From | "neil.suffield@gmail.com" <neil.suffield@gmail.com> |
|---|---|
| Date | 2013-05-25 03:15 -0700 |
| Message-ID | <29195880-e277-486e-a5dc-77110b093d4e@googlegroups.com> |
| In reply to | #45904 |
If you are talking about accessing a web page, rather than an image, then what you want to do is known as 'screen scraping'. One of the best tools for this is called BeautifulSoup. http://www.crummy.com/software/BeautifulSoup/
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web