Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
Date: Tue, 31 Dec 2013 07:19:33 -0800
Subject: Python/Django Extract and append only new links
From: Max Cuban <edzeame@gmail.com>
To: python-list@python.org
Content-Type: multipart/alternative; boundary=e89a8fb2088874105804eed61618
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.4759.1388509472.18130.python-list@python.org>
Lines: 281
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:62934

--e89a8fb2088874105804eed61618
Content-Type: text/plain; charset=ISO-8859-1

 I am putting together a project using Python 2.7 Django 1.5 on Windows 7.
I believe this should be on the django group but I haven't had help from
there so I figured I would try the python list
I have the following view:
views.py:

def foo():
    site = "http://www.foo.com/portal/jobs"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    soup = BeautifulSoup(jobpass)
    for tag in soup.find_all('a', href = True):
        tag['href'] = urlparse.urljoin('http://www.businessghana.com/portal/',
tag['href'])
    return map(str, soup.find_all('a', href = re.compile('.getJobInfo')))
def example():
    site = "http://example.com"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    soup = BeautifulSoup(jobpass)
    return map(str, soup.find_all('a', href = re.compile('.display-job')))
 foo_links = foo()
 example_links = example()
def all_links():
    return (foo_links + example_links)
def display_links(request):
    name = all_links()
    paginator = Paginator(name, 25)
    page = request.GET.get('page')
    try:
        name = paginator.page(page)
    except PageNotAnInteger:
        name = paginator.page(1)
    except EmptyPage:
        name = paginator.page(paginator.num_pages)
    return render_to_response('jobs.html', {'name' : name})

My template looks like this:

<ol>
{% for link in name %}
  <li> {{ link|safe }}</li>
{% endfor %}
 </ol>
 <div class="pagination">
<span class= "step-links">
    {% if name.has_previous %}
        <a href="?page={{ names.previous_page_number }}">Previous</a>
    {% endif %}
    <span class = "current">
        Page {{ name.number }} of {{ name.paginator.num_pages}}.
    </span>
    {% if name.has_next %}
        <a href="?page={{ name.next_page_number}}">next</a>
    {% endif %}
</span>
 </div>

 Right now as my code stands, anytime I run it, it scraps all the links on
the frontpage of the sites selected and presents them paginated *all afresh*.
However, I don't think its a good idea for the script to read/write all the
links that had previously extracted links all over again and therefore
would like to check for and append only new links. I would like to save the
previously scraped links so that over the course of say, a week, all the
links that have appeared on the frontpage of these sites will be available
on my site as older pages.

It's my first programming project and don't know how to incorporate this
logic into my code.

Any help/pointers/references will be greatly appreciated.

regards, Max

--e89a8fb2088874105804eed61618
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div style=3D"font:14px/18px Arial,&quot;Liberation Sans&q=
uot;,&quot;DejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;bord=
er:0px currentColor;text-align:left;text-transform:none;text-indent:0px;let=
ter-spacing:normal;clear:both;word-spacing:0px;vertical-align:baseline;whit=
e-space:normal;font-size-adjust:none;font-stretch:normal">

I am putting together a project using Python 2.7 Django 1.5 on Windows 7. <=
/div><div style=3D"font:14px/18px Arial,&quot;Liberation Sans&quot;,&quot;D=
ejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;border:0px curre=
ntColor;text-align:left;text-transform:none;text-indent:0px;letter-spacing:=
normal;clear:both;word-spacing:0px;vertical-align:baseline;white-space:norm=
al;font-size-adjust:none;font-stretch:normal">

I believe this should be on the django group but I haven&#39;t had help fro=
m there so I figured I would try the python list</div><div style=3D"font:14=
px/18px Arial,&quot;Liberation Sans&quot;,&quot;DejaVu Sans&quot;,sans-seri=
f;margin:0px 0px 1em;padding:0px;border:0px currentColor;text-align:left;te=
xt-transform:none;text-indent:0px;letter-spacing:normal;clear:both;word-spa=
cing:0px;vertical-align:baseline;white-space:normal;font-size-adjust:none;f=
ont-stretch:normal">

I have the following view:</div><div style=3D"font:14px/18px Arial,&quot;Li=
beration Sans&quot;,&quot;DejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;p=
adding:0px;border:0px currentColor;text-align:left;text-transform:none;text=
-indent:0px;letter-spacing:normal;clear:both;word-spacing:0px;vertical-alig=
n:baseline;white-space:normal;font-size-adjust:none;font-stretch:normal">

views.py:</div><div style=3D"font:14px/18px Arial,&quot;Liberation Sans&quo=
t;,&quot;DejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;border=
:0px currentColor;text-align:left;text-transform:none;text-indent:0px;lette=
r-spacing:normal;clear:both;word-spacing:0px;vertical-align:baseline;white-=
space:normal;font-size-adjust:none;font-stretch:normal">

<br></div><div style=3D"font:14px/18px Arial,&quot;Liberation Sans&quot;,&q=
uot;DejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;border:0px =
currentColor;text-align:left;text-transform:none;text-indent:0px;letter-spa=
cing:normal;clear:both;word-spacing:0px;vertical-align:baseline;white-space=
:normal;font-size-adjust:none;font-stretch:normal">

def foo():<br>=A0=A0=A0 site =3D &quot;<a href=3D"http://www.foo.com/portal=
/jobs" target=3D"_blank">http://www.foo.com/portal/jobs</a>&quot;<br>=A0=A0=
=A0 hdr =3D {&#39;User-Agent&#39; : &#39;Mozilla/5.0&#39;}<br>=A0=A0=A0 req=
 =3D urllib2.Request(site, headers=3Dhdr)<br>

=A0=A0=A0 jobpass =3D urllib2.urlopen(req)<br>=A0=A0=A0 soup =3D BeautifulS=
oup(jobpass)<br>=A0=A0=A0 for tag in soup.find_all(&#39;a&#39;, href =3D Tr=
ue):<br>=A0=A0=A0=A0=A0=A0=A0 tag[&#39;href&#39;] =3D urlparse.urljoin(&#39=
;<a href=3D"http://www.businessghana.com/portal/" target=3D"_blank">http://=
www.businessghana.com/portal/</a>&#39;,=A0 tag[&#39;href&#39;])<br>

=A0=A0=A0 return map(str, soup.find_all(&#39;a&#39;, href =3D re.compile(&#=
39;.getJobInfo&#39;)))</div><div style=3D"font:14px/18px Arial,&quot;Libera=
tion Sans&quot;,&quot;DejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;paddi=
ng:0px;border:0px currentColor;text-align:left;text-transform:none;text-ind=
ent:0px;letter-spacing:normal;clear:both;word-spacing:0px;vertical-align:ba=
seline;white-space:normal;font-size-adjust:none;font-stretch:normal">

def example():<br>=A0=A0=A0 site =3D &quot;<a href=3D"http://example.com" t=
arget=3D"_blank">http://example.com</a>&quot;<br>=A0=A0=A0 hdr =3D {&#39;Us=
er-Agent&#39; : &#39;Mozilla/5.0&#39;}<br>=A0=A0=A0 req =3D urllib2.Request=
(site, headers=3Dhdr)<br>
=A0=A0=A0 jobpass =3D urllib2.urlopen(req)<br>
=A0=A0=A0 soup =3D BeautifulSoup(jobpass)<br>=A0=A0=A0 return map(str, soup=
.find_all(&#39;a&#39;, href =3D re.compile(&#39;.display-job&#39;)))</div><=
div style=3D"font:14px/18px Arial,&quot;Liberation Sans&quot;,&quot;DejaVu =
Sans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;border:0px currentColo=
r;text-align:left;text-transform:none;text-indent:0px;letter-spacing:normal=
;clear:both;word-spacing:0px;vertical-align:baseline;white-space:normal;fon=
t-size-adjust:none;font-stretch:normal">

=A0foo_links =3D foo()<br>=A0example_links =3D example()</div><div style=3D=
"font:14px/18px Arial,&quot;Liberation Sans&quot;,&quot;DejaVu Sans&quot;,s=
ans-serif;margin:0px 0px 1em;padding:0px;border:0px currentColor;text-align=
:left;text-transform:none;text-indent:0px;letter-spacing:normal;clear:both;=
word-spacing:0px;vertical-align:baseline;white-space:normal;font-size-adjus=
t:none;font-stretch:normal">

def all_links():<br>=A0=A0=A0 return (foo_links + example_links)</div><div =
style=3D"font:14px/18px Arial,&quot;Liberation Sans&quot;,&quot;DejaVu Sans=
&quot;,sans-serif;margin:0px 0px 1em;padding:0px;border:0px currentColor;te=
xt-align:left;text-transform:none;text-indent:0px;letter-spacing:normal;cle=
ar:both;word-spacing:0px;vertical-align:baseline;white-space:normal;font-si=
ze-adjust:none;font-stretch:normal">

def display_links(request):<br>=A0=A0=A0 name =3D all_links()<br>=A0=A0=A0 =
paginator =3D Paginator(name, 25)<br>=A0=A0=A0 page =3D request.GET.get(=
9;page&#39;)<br>=A0=A0=A0 try:<br>=A0=A0=A0=A0=A0=A0=A0 name =3D paginator.=
page(page)<br>=A0=A0=A0 except PageNotAnInteger:<br>

=A0=A0=A0=A0=A0=A0=A0 name =3D paginator.page(1)<br>=A0=A0=A0 except EmptyP=
age:<br>=A0=A0=A0=A0=A0=A0=A0 name =3D paginator.page(paginator.num_pages)<=
/div><div style=3D"font:14px/18px Arial,&quot;Liberation Sans&quot;,&quot;D=
ejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;border:0px curre=
ntColor;text-align:left;text-transform:none;text-indent:0px;letter-spacing:=
normal;clear:both;word-spacing:0px;vertical-align:baseline;white-space:norm=
al;font-size-adjust:none;font-stretch:normal">

=A0=A0=A0 return render_to_response(&#39;jobs.html&#39;, {&#39;name&#39; : =
name})=A0=A0 </div><div style=3D"font:14px/18px Arial,&quot;Liberation Sans=
&quot;,&quot;DejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;bo=
rder:0px currentColor;text-align:left;text-transform:none;text-indent:0px;l=
etter-spacing:normal;clear:both;word-spacing:0px;vertical-align:baseline;wh=
ite-space:normal;font-size-adjust:none;font-stretch:normal">

<br></div><div style=3D"font:14px/18px Arial,&quot;Liberation Sans&quot;,&q=
uot;DejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;border:0px =
currentColor;text-align:left;text-transform:none;text-indent:0px;letter-spa=
cing:normal;clear:both;word-spacing:0px;vertical-align:baseline;white-space=
:normal;font-size-adjust:none;font-stretch:normal">

My template looks like this:</div><div style=3D"font:14px/18px Arial,&quot;=
Liberation Sans&quot;,&quot;DejaVu Sans&quot;,sans-serif;margin:0px 0px 1em=
;padding:0px;border:0px currentColor;text-align:left;text-transform:none;te=
xt-indent:0px;letter-spacing:normal;clear:both;word-spacing:0px;vertical-al=
ign:baseline;white-space:normal;font-size-adjust:none;font-stretch:normal">

<br></div><div style=3D"font:14px/18px Arial,&quot;Liberation Sans&quot;,&q=
uot;DejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;border:0px =
currentColor;text-align:left;text-transform:none;text-indent:0px;letter-spa=
cing:normal;clear:both;word-spacing:0px;vertical-align:baseline;white-space=
:normal;font-size-adjust:none;font-stretch:normal">

&lt;ol&gt;<br>{% for link in name %}<br>=A0 &lt;li&gt; {{ link|safe }}&lt;/=
li&gt;<br>{% endfor %}<br>=A0&lt;/ol&gt;<br>=A0&lt;div class=3D&quot;pagina=
tion&quot;&gt;<br>&lt;span class=3D &quot;step-links&quot;&gt;<br>=A0=A0=A0=
 {% if name.has_previous %}<br>

=A0=A0=A0=A0=A0=A0=A0 &lt;a href=3D&quot;?page=3D{{ names.previous_page_num=
ber }}&quot;&gt;Previous&lt;/a&gt;<br>=A0=A0=A0 {% endif %}</div><div style=
=3D"font:14px/18px Arial,&quot;Liberation Sans&quot;,&quot;DejaVu Sans&quot=
;,sans-serif;margin:0px 0px 1em;padding:0px;border:0px currentColor;text-al=
ign:left;text-transform:none;text-indent:0px;letter-spacing:normal;clear:bo=
th;word-spacing:0px;vertical-align:baseline;white-space:normal;font-size-ad=
just:none;font-stretch:normal">

=A0=A0=A0 &lt;span class =3D &quot;current&quot;&gt;<br>=A0=A0=A0=A0=A0=A0=
=A0 Page {{ name.number }} of {{ name.paginator.num_pages}}.<br>=A0=A0=A0 &=
lt;/span&gt;</div><div style=3D"font:14px/18px Arial,&quot;Liberation Sans&=
quot;,&quot;DejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;bor=
der:0px currentColor;text-align:left;text-transform:none;text-indent:0px;le=
tter-spacing:normal;clear:both;word-spacing:0px;vertical-align:baseline;whi=
te-space:normal;font-size-adjust:none;font-stretch:normal">

=A0=A0=A0 {% if name.has_next %}<br>=A0=A0=A0=A0=A0=A0=A0 &lt;a href=3D&quo=
t;?page=3D{{ name.next_page_number}}&quot;&gt;next&lt;/a&gt;<br>=A0=A0=A0 {=
% endif %}<br>&lt;/span&gt;<br>=A0&lt;/div&gt;</div><div style=3D"font:14px=
/18px Arial,&quot;Liberation Sans&quot;,&quot;DejaVu Sans&quot;,sans-serif;=
margin:0px 0px 1em;padding:0px;border:0px currentColor;text-align:left;text=
-transform:none;text-indent:0px;letter-spacing:normal;clear:both;word-spaci=
ng:0px;vertical-align:baseline;white-space:normal;font-size-adjust:none;fon=
t-stretch:normal">

<br></div><div style=3D"font:14px/18px Arial,&quot;Liberation Sans&quot;,&q=
uot;DejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;border:0px =
currentColor;text-align:left;text-transform:none;text-indent:0px;letter-spa=
cing:normal;clear:both;word-spacing:0px;vertical-align:baseline;white-space=
:normal;font-size-adjust:none;font-stretch:normal">

<p style=3D"font:14px/18px Arial,&quot;Liberation Sans&quot;,&quot;DejaVu S=
ans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;border:0px currentColor=
;text-align:left;text-transform:none;text-indent:0px;letter-spacing:normal;=
clear:both;word-spacing:0px;vertical-align:baseline;white-space:normal;font=
-size-adjust:none;font-stretch:normal">

Right now as my code stands, anytime I run it, it scraps all the links on t=
he frontpage of the sites selected and presents them paginated<span>=A0</sp=
an><strong style=3D"margin:0px;padding:0px;border:0px currentColor;font-siz=
e:14px;font-weight:bold;vertical-align:baseline;background-color:transparen=
t">all afresh</strong>. However, I don&#39;t think its a good idea for the =
script to read/write all the links that had previously extracted links all =
over again and therefore would like to check for and append only new links.=
 I would like to save the previously scraped links so that over the course =
of say, a week, all the links that have appeared on the frontpage of these =
sites will be available on my site as older pages.</p>

<p style=3D"font:14px/18px Arial,&quot;Liberation Sans&quot;,&quot;DejaVu S=
ans&quot;,sans-serif;margin:0px 0px 1em;padding:0px;border:0px currentColor=
;text-align:left;text-transform:none;text-indent:0px;letter-spacing:normal;=
clear:both;word-spacing:0px;vertical-align:baseline;white-space:normal;font=
-size-adjust:none;font-stretch:normal">

It&#39;s my first programming project and don&#39;t know how to incorporate=
 this logic into my code.</p><p style=3D"font:14px/18px Arial,&quot;Liberat=
ion Sans&quot;,&quot;DejaVu Sans&quot;,sans-serif;margin:0px 0px 1em;paddin=
g:0px;border:0px currentColor;text-align:left;text-transform:none;text-inde=
nt:0px;letter-spacing:normal;clear:both;word-spacing:0px;vertical-align:bas=
eline;white-space:normal;font-size-adjust:none;font-stretch:normal">

Any help/pointers/references will be greatly appreciated.</p><p style=3D"fo=
nt:14px/18px Arial,&quot;Liberation Sans&quot;,&quot;DejaVu Sans&quot;,sans=
-serif;margin:0px 0px 1em;padding:0px;border:0px currentColor;text-align:le=
ft;text-transform:none;text-indent:0px;letter-spacing:normal;clear:both;wor=
d-spacing:0px;vertical-align:baseline;white-space:normal;font-size-adjust:n=
one;font-stretch:normal">

regards, Max</p></div></div>

--e89a8fb2088874105804eed61618--