Groups > comp.lang.python > #45904 > unrolled thread

Total Beginner - Extracting Data from a Database Online (Screenshot)

Started by	logan.c.graham@gmail.com
First post	2013-05-24 10:32 -0700
Last post	2013-05-25 03:15 -0700
Articles	12 — 7 participants

Back to article view | Back to comp.lang.python

  Total Beginner - Extracting Data from a Database Online (Screenshot) logan.c.graham@gmail.com - 2013-05-24 10:32 -0700
    Re: Total Beginner - Extracting Data from a Database Online (Screenshot) Dave Angel <davea@davea.name> - 2013-05-24 15:41 -0400
    RE: Total Beginner - Extracting Data from a Database Online (Screenshot) Carlos Nepomuceno <carlosnepomuceno@outlook.com> - 2013-05-25 02:36 +0300
      Re: Total Beginner - Extracting Data from a Database Online (Screenshot) John Ladasky <john_ladasky@sbcglobal.net> - 2013-05-25 18:33 -0700
        Re: Total Beginner - Extracting Data from a Database Online (Screenshot) logan.c.graham@gmail.com - 2013-05-27 17:58 -0700
          RE: Total Beginner - Extracting Data from a Database Online (Screenshot) Carlos Nepomuceno <carlosnepomuceno@outlook.com> - 2013-05-28 04:21 +0300
          RE: Total Beginner - Extracting Data from a Database Online (Screenshot) Phil Connell <pconnell@gmail.com> - 2013-05-28 07:40 +0100
    Re: Total Beginner - Extracting Data from a Database Online (Screenshot) Dave Angel <davea@davea.name> - 2013-05-24 21:16 -0400
    Re: Total Beginner - Extracting Data from a Database Online (Screenshot) Chris Angelico <rosuav@gmail.com> - 2013-05-25 13:22 +1000
      Re: Total Beginner - Extracting Data from a Database Online (Screenshot) logan.c.graham@gmail.com - 2013-05-25 17:48 -0700
    Total Beginner - Extracting Data from a Database Online (Screenshot) "neil.suffield@gmail.com" <neil.suffield@gmail.com> - 2013-05-25 03:13 -0700
    Total Beginner - Extracting Data from a Database Online (Screenshot) "neil.suffield@gmail.com" <neil.suffield@gmail.com> - 2013-05-25 03:15 -0700

#45904 — Total Beginner - Extracting Data from a Database Online (Screenshot)

From	logan.c.graham@gmail.com
Date	2013-05-24 10:32 -0700
Subject	Total Beginner - Extracting Data from a Database Online (Screenshot)
Message-ID	<b3730ef1-90bb-4ef4-8683-239e722aa1da@googlegroups.com>

Hey guys,

I'm learning Python and I'm experimenting with different projects -- I like learning by doing. I'm wondering if you can help me here:

http://i.imgur.com/KgvSKWk.jpg

What this is is a publicly-accessible webpage that's a simple database of people who have used the website. Ideally what I'd like to end up with is an excel spreadsheet with data from the columns #fb, # vids, fb sent?, # email tm.

I'd like to use Python to do it -- crawl the page and extract the data in a usable way.

I'd love your input! I'm just a learner.

[toc] | [next] | [standalone]

#45914

From	Dave Angel <davea@davea.name>
Date	2013-05-24 15:41 -0400
Message-ID	<mailman.2076.1369424506.3114.python-list@python.org>
In reply to	#45904

On 05/24/2013 01:32 PM, logan.c.graham@gmail.com wrote:
> Hey guys,
>
> I'm learning Python

Welcome.

> and I'm experimenting with different projects -- I like learning by doing. I'm wondering if you can help me here:
>
>na
>
> What this is is a publicly-accessible webpage

No, it's just a jpeg file, an image.

> that's a simple database of people who have used the website. Ideally what I'd like to end up with is an excel spreadsheet with data from the columns #fb, # vids, fb sent?, # email tm.
>
> I'd like to use Python to do it -- crawl the page and extract the data in a usable way.
>

But there's no page to crawl.  You may have to start by finding an ocr 
to interpret the image as characters.  Or find some other source for 
your data.

> I'd love your input! I'm just a learner.
>


-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#45929

From	Carlos Nepomuceno <carlosnepomuceno@outlook.com>
Date	2013-05-25 02:36 +0300
Message-ID	<mailman.2088.1369438663.3114.python-list@python.org>
In reply to	#45904

### table_data_extraction.py ###
# Usage: table[id][row][column]
# tables[0]       : 1st table
# tables[1][2]    : 3rd row of 2nd table
# tables[3][4][5] : cell content of 6th column of 5th row of 4th table
# len(table)      : quantity of tables
# len(table[6])   : quantity of rows of 7th table
# len(table[7][8]): quantity of columns of 9th row of 8th table

impor re
import urllib2

#to retrieve the contents of the page
page = urllib2.urlopen("http://example.com/page.html").read().strip()

#to create the tables list
tables=[[re.findall('<TD>(.*?)</TD>',r,re.S) for r in re.findall('<TR>(.*?)</TR>',t,re.S)] for t in re.findall('<TABLE>(.*?)</TABLE>',page,re.S)]


Pretty simple. Good luck!

----------------------------------------
> Date: Fri, 24 May 2013 10:32:26 -0700
> Subject: Total Beginner - Extracting Data from a Database Online (Screenshot)
> From: logan.c.graham@gmail.com
> To: python-list@python.org
>
> Hey guys,
>
> I'm learning Python and I'm experimenting with different projects -- I like learning by doing. I'm wondering if you can help me here:
>
> http://i.imgur.com/KgvSKWk.jpg
>
> What this is is a publicly-accessible webpage that's a simple database of people who have used the website. Ideally what I'd like to end up with is an excel spreadsheet with data from the columns #fb, # vids, fb sent?, # email tm.
>
> I'd like to use Python to do it -- crawl the page and extract the data in a usable way.
>
> I'd love your input! I'm just a learner.
> --
> http://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]

#46025

From	John Ladasky <john_ladasky@sbcglobal.net>
Date	2013-05-25 18:33 -0700
Message-ID	<cceeff0e-611b-40eb-83a1-e45c37a4f04e@googlegroups.com>
In reply to	#45929

On Friday, May 24, 2013 4:36:35 PM UTC-7, Carlos Nepomuceno wrote:
> #to create the tables list
> tables=[[re.findall('<TD>(.*?)</TD>',r,re.S) for r in re.findall('<TR>(.*?)</TR>',t,re.S)] for t in re.findall('<TABLE>(.*?)</TABLE>',page,re.S)]
> 
> 
> Pretty simple. 

Two nested list comprehensions, with regex pattern matching?

Logan did say he was a "total beginner."  :^)

[toc] | [prev] | [next] | [standalone]

#46239

From	logan.c.graham@gmail.com
Date	2013-05-27 17:58 -0700
Message-ID	<faa2c4de-d033-4a01-a9ab-79189f39867e@googlegroups.com>
In reply to	#46025

On Saturday, May 25, 2013 6:33:25 PM UTC-7, John Ladasky wrote:
> On Friday, May 24, 2013 4:36:35 PM UTC-7, Carlos Nepomuceno wrote:
> 
> > #to create the tables list
> 
> > tables=[[re.findall('<TD>(.*?)</TD>',r,re.S) for r in re.findall('<TR>(.*?)</TR>',t,re.S)] for t in re.findall('<TABLE>(.*?)</TABLE>',page,re.S)]
> 
> > 
> 
> > 
> 
> > Pretty simple. 
> 
> 
> 
> Two nested list comprehensions, with regex pattern matching?
> 
> 
> 
> Logan did say he was a "total beginner."  :^)



Oh goodness, yes, I have no clue.

[toc] | [prev] | [next] | [standalone]

#46240

From	Carlos Nepomuceno <carlosnepomuceno@outlook.com>
Date	2013-05-28 04:21 +0300
Message-ID	<mailman.2272.1369704073.3114.python-list@python.org>
In reply to	#46239

----------------------------------------
> Date: Mon, 27 May 2013 17:58:00 -0700
> Subject: Re: Total Beginner - Extracting Data from a Database Online (Screenshot)
> From: logan.c.graham@gmail.com
> To: python-list@python.org
[...]
>
> Oh goodness, yes, I have no clue.

For example:

# to retrieve the contents of all column '# fb' (11th column from the image you sent)

c11 = [tables[0][r][10] for r in range(len(tables[0]))]
#      ----------------                -------------
#      this is the content             this is the quantity
#      of the 11th cell                of rows in table[0]
#      of row 'r'

[toc] | [prev] | [next] | [standalone]

#46258

From	Phil Connell <pconnell@gmail.com>
Date	2013-05-28 07:40 +0100
Message-ID	<mailman.2282.1369723209.3114.python-list@python.org>
In reply to	#46239

[Multipart message — attachments visible in raw view] — view raw

On 28 May 2013 02:21, "Carlos Nepomuceno" <carlosnepomuceno@outlook.com>
wrote:
>
> ----------------------------------------
> > Date: Mon, 27 May 2013 17:58:00 -0700
> > Subject: Re: Total Beginner - Extracting Data from a Database Online
(Screenshot)
> > From: logan.c.graham@gmail.com
> > To: python-list@python.org
> [...]
> >
> > Oh goodness, yes, I have no clue.
>
> For example:
>
> # to retrieve the contents of all column '# fb' (11th column from the
image you sent)
>
> c11 = [tables[0][r][10] for r in range(len(tables[0]))]

Or rather:

c11 = [row[10] for row in tables[0]]

In most cases, range(len(x)) is a sign that you're doing it wrong :)

[toc] | [prev] | [next] | [standalone]

#45938

From	Dave Angel <davea@davea.name>
Date	2013-05-24 21:16 -0400
Message-ID	<mailman.2096.1369444611.3114.python-list@python.org>
In reply to	#45904

On 05/24/2013 07:36 PM, Carlos Nepomuceno wrote:
>
>      <SNIP>
> page = urllib2.urlopen("http://example.com/page.html").read().strip()
>
> #to create the tables list
> tables=[[re.findall('<TD>(.*?)</TD>',r,re.S) for r in re.findall('<TR>(.*?)</TR>',t,re.S)] for t in re.findall('<TABLE>(.*?)</TABLE>',page,re.S)]
>
>
> Pretty simple. Good luck!

Only if the page is html, which the OP's was not. It was an image.  Try 
parsing that with regex.



-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#45945

From	Chris Angelico <rosuav@gmail.com>
Date	2013-05-25 13:22 +1000
Message-ID	<mailman.2099.1369452182.3114.python-list@python.org>
In reply to	#45904

On Sat, May 25, 2013 at 3:32 AM,  <logan.c.graham@gmail.com> wrote:
> http://i.imgur.com/KgvSKWk.jpg
>
> What this is is a publicly-accessible webpage...

If that's a screenshot of something that we'd be able to access
directly, then why not just post a link to the actual thing? More
likely I'm thinking it's NOT publicly accessible, which is why it's
been censored.

ChrisA

[toc] | [prev] | [next] | [standalone]

#46022

From	logan.c.graham@gmail.com
Date	2013-05-25 17:48 -0700
Message-ID	<3126e4b6-d685-4ac8-8532-134f46a904ba@googlegroups.com>
In reply to	#45945

Sorry to be unclear -- it's a screenshot of the webpage, which is publicly accessible, but it contains sensitive information. A bad combination, admittedly, and something that'll be soon fixed.

[toc] | [prev] | [next] | [standalone]

#45985

From	"neil.suffield@gmail.com" <neil.suffield@gmail.com>
Date	2013-05-25 03:13 -0700
Message-ID	<d17c7b10-67aa-4fd3-bffc-fbd6dab81d71@googlegroups.com>
In reply to	#45904

If you are talking about accessing a web page, rather than an image, then you want to do what is known as screen scraping. 

One of the best tools for this is called BeautifulSoup.

http://www.crummy.com/software/BeautifulSoup/

[toc] | [prev] | [next] | [standalone]

#45986

From	"neil.suffield@gmail.com" <neil.suffield@gmail.com>
Date	2013-05-25 03:15 -0700
Message-ID	<29195880-e277-486e-a5dc-77110b093d4e@googlegroups.com>
In reply to	#45904

If you are talking about accessing a web page, rather than an image, then what you want to do is known as 'screen scraping'. 

One of the best tools for this is called BeautifulSoup. 

http://www.crummy.com/software/BeautifulSoup/

[toc] | [prev] | [standalone]

csiph-web

Total Beginner - Extracting Data from a Database Online (Screenshot)

Contents

#45904 — Total Beginner - Extracting Data from a Database Online (Screenshot)

#45914

#45929

#46025

#46239

#46240

#46258

#45938

#45945

#46022

#45985

#45986