Python/Django Extract and append only new links

Path	csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<edzeame@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.003
X-Spam-Evidence	'H': 0.99; 'S': 0.00; 'subject:Django': 0.03; 'say,': 0.05; 'subject:Python': 0.06; 'append': 0.09; 'extracted': 0.09; 'links.': 0.09; 'logic': 0.09; 'pages.': 0.09; 'req': 0.09; 'try:': 0.09; 'python': 0.11; 'django': 0.11; 'def': 0.12; '2.7': 0.14; 'template': 0.14; 'windows': 0.15; 'endif': 0.16; 'foo()': 0.16; 'foo():': 0.16; 'skip:} 10': 0.16; 'soup': 0.16; 'url:example': 0.16; 'url:foo': 0.16; 'code.': 0.18; '8bit%:5': 0.22; 'programming': 0.22; 'previously': 0.22; 'putting': 0.22; '<div': 0.24; "haven't": 0.24; 'looks': 0.24; 'script': 0.25; 'this:': 0.26; 'subject:/': 0.26; 'idea': 0.28; 'appreciated.': 0.29; 'skip:p 30': 0.29; 'message-id:@mail.gmail.com': 0.30; 'code': 0.31; "skip:' 10": 0.31; 'class': 0.32; 'skip:c 30': 0.32; 'run': 0.32; 'older': 0.33; 'skip:d 20': 0.34; 'except': 0.35; 'skip:s 30': 0.35; 'skip:u 20': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'url:jobs': 0.36; 'next': 0.36; 'should': 0.36; 'list': 0.37; 'project': 0.37; 'subject:new': 0.38; 'skip:& 10': 0.38; 'to:addr:python-list': 0.38; 'previous': 0.38; 'skip:& 20': 0.39; '\xa0\xa0\xa0': 0.39; 'to:addr:python.org': 0.39; 'skip:p 20': 0.39; '8bit%:6': 0.40; 'how': 0.40; 'anytime': 0.60; 'tag': 0.61; 'new': 0.61; 'course': 0.61; 'first': 0.61; 'save': 0.62; 'name': 0.63; 'skip:n 10': 0.64; 'week,': 0.64; 'skip:1 20': 0.65; 'believe': 0.68; 'incorporate': 0.68; 'skip:r 30': 0.69; 'therefore': 0.72; 'url:portal': 0.74; 'skip:n 40': 0.81; '<a': 0.84
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=9kKDHngCA/VdDRNj5KvrBIIYVKSkywitwsLHdP6c3Tg=; b=SQQ2q97CkRmaE1lbNcGnbcXNVHuD+eB6Dbm4ndMLdwEm6vBwOybM3hijDTHSXYeNpi a548Ga4859uOr4N2l6TZEtuU1DfwdVebT1bhIbNV89YIsc7WSgcMTep+I3KahFo0oGWA H3z8IFc3v8Wk0tUP5m5w/RaLHz9pK8+/UKAJ84c7c+uXj5Ulr/QCboWb9VG9xemuE6LQ nehBPrjGrjZrLMSO2rv87MfjzR498/EtBtcZLTlAEAIwt7/hO/6gCX4jOyTj8Mh9plBk mPBtAPccNTOqCk/hmLU6jS3l3by9X2PCejX9gKqI3EtoCBZB0mKriLfdxG4wIpgfCntn GpNQ==
MIME-Version	1.0
X-Received	by 10.68.212.10 with SMTP id ng10mr11114185pbc.158.1388503173957; Tue, 31 Dec 2013 07:19:33 -0800 (PST)
Date	Tue, 31 Dec 2013 07:19:33 -0800
Subject	Python/Django Extract and append only new links
From	Max Cuban <edzeame@gmail.com>
To	python-list@python.org
Content-Type	multipart/alternative; boundary=e89a8fb2088874105804eed61618
X-Mailman-Approved-At	Tue, 31 Dec 2013 18:04:31 +0100
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.4759.1388509472.18130.python-list@python.org> (permalink)
Lines	281
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1388509472 news.xs4all.nl 2868 [2001:888:2000:d::a6]:43512
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:62934

Show key headers only | View raw

[Multipart message — attachments visible in raw view] - view raw

 I am putting together a project using Python 2.7 Django 1.5 on Windows 7.
I believe this should be on the django group but I haven't had help from
there so I figured I would try the python list
I have the following view:
views.py:

def foo():
    site = "http://www.foo.com/portal/jobs"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    soup = BeautifulSoup(jobpass)
    for tag in soup.find_all('a', href = True):
        tag['href'] = urlparse.urljoin('http://www.businessghana.com/portal/',
tag['href'])
    return map(str, soup.find_all('a', href = re.compile('.getJobInfo')))
def example():
    site = "http://example.com"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    soup = BeautifulSoup(jobpass)
    return map(str, soup.find_all('a', href = re.compile('.display-job')))
 foo_links = foo()
 example_links = example()
def all_links():
    return (foo_links + example_links)
def display_links(request):
    name = all_links()
    paginator = Paginator(name, 25)
    page = request.GET.get('page')
    try:
        name = paginator.page(page)
    except PageNotAnInteger:
        name = paginator.page(1)
    except EmptyPage:
        name = paginator.page(paginator.num_pages)
    return render_to_response('jobs.html', {'name' : name})

My template looks like this:

<ol>
{% for link in name %}
  <li> {{ link|safe }}</li>
{% endfor %}
 </ol>
 <div class="pagination">
<span class= "step-links">
    {% if name.has_previous %}
        <a href="?page={{ names.previous_page_number }}">Previous</a>
    {% endif %}
    <span class = "current">
        Page {{ name.number }} of {{ name.paginator.num_pages}}.
    </span>
    {% if name.has_next %}
        <a href="?page={{ name.next_page_number}}">next</a>
    {% endif %}
</span>
 </div>

 Right now as my code stands, anytime I run it, it scraps all the links on
the frontpage of the sites selected and presents them paginated *all afresh*.
However, I don't think its a good idea for the script to read/write all the
links that had previously extracted links all over again and therefore
would like to check for and append only new links. I would like to save the
previously scraped links so that over the course of say, a week, all the
links that have appeared on the frontpage of these sites will be available
on my site as older pages.

It's my first programming project and don't know how to incorporate this
logic into my code.

Any help/pointers/references will be greatly appreciated.

regards, Max

Back to comp.lang.python | Previous | Next — Next in thread | Find similar | Unroll thread

Thread

Python/Django Extract and append only new links Max Cuban <edzeame@gmail.com> - 2013-12-31 07:19 -0800
  Re: Python/Django Extract and append only new links Piet van Oostrum <piet@vanoostrum.org> - 2014-01-01 10:30 +0100

csiph-web