Groups > comp.lang.python > #62934 > unrolled thread

Python/Django Extract and append only new links

Started by	Max Cuban <edzeame@gmail.com>
First post	2013-12-31 07:19 -0800
Last post	2014-01-01 10:30 +0100
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

  Python/Django Extract and append only new links Max Cuban <edzeame@gmail.com> - 2013-12-31 07:19 -0800
    Re: Python/Django Extract and append only new links Piet van Oostrum <piet@vanoostrum.org> - 2014-01-01 10:30 +0100

#62934 — Python/Django Extract and append only new links

From	Max Cuban <edzeame@gmail.com>
Date	2013-12-31 07:19 -0800
Subject	Python/Django Extract and append only new links
Message-ID	<mailman.4759.1388509472.18130.python-list@python.org>

[Multipart message — attachments visible in raw view] — view raw

 I am putting together a project using Python 2.7 Django 1.5 on Windows 7.
I believe this should be on the django group but I haven't had help from
there so I figured I would try the python list
I have the following view:
views.py:

def foo():
    site = "http://www.foo.com/portal/jobs"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    soup = BeautifulSoup(jobpass)
    for tag in soup.find_all('a', href = True):
        tag['href'] = urlparse.urljoin('http://www.businessghana.com/portal/',
tag['href'])
    return map(str, soup.find_all('a', href = re.compile('.getJobInfo')))
def example():
    site = "http://example.com"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    soup = BeautifulSoup(jobpass)
    return map(str, soup.find_all('a', href = re.compile('.display-job')))
 foo_links = foo()
 example_links = example()
def all_links():
    return (foo_links + example_links)
def display_links(request):
    name = all_links()
    paginator = Paginator(name, 25)
    page = request.GET.get('page')
    try:
        name = paginator.page(page)
    except PageNotAnInteger:
        name = paginator.page(1)
    except EmptyPage:
        name = paginator.page(paginator.num_pages)
    return render_to_response('jobs.html', {'name' : name})

My template looks like this:

<ol>
{% for link in name %}
  <li> {{ link|safe }}</li>
{% endfor %}
 </ol>
 <div class="pagination">
<span class= "step-links">
    {% if name.has_previous %}
        <a href="?page={{ names.previous_page_number }}">Previous</a>
    {% endif %}
    <span class = "current">
        Page {{ name.number }} of {{ name.paginator.num_pages}}.
    </span>
    {% if name.has_next %}
        <a href="?page={{ name.next_page_number}}">next</a>
    {% endif %}
</span>
 </div>

 Right now as my code stands, anytime I run it, it scraps all the links on
the frontpage of the sites selected and presents them paginated *all afresh*.
However, I don't think its a good idea for the script to read/write all the
links that had previously extracted links all over again and therefore
would like to check for and append only new links. I would like to save the
previously scraped links so that over the course of say, a week, all the
links that have appeared on the frontpage of these sites will be available
on my site as older pages.

It's my first programming project and don't know how to incorporate this
logic into my code.

Any help/pointers/references will be greatly appreciated.

regards, Max

[toc] | [next] | [standalone]

#62953

From	Piet van Oostrum <piet@vanoostrum.org>
Date	2014-01-01 10:30 +0100
Message-ID	<m2y5306ocn.fsf@cochabamba.vanoostrum.org>
In reply to	#62934

Max Cuban <edzeame@gmail.com> writes:

> I am putting together a project using Python 2.7 Django 1.5 on Windows 7.
> I believe this should be on the django group but I haven't had help
> from there so I figured I would try the python list
> I have the following view:
> views.py:

[snip]

> Right now as my code stands, anytime I run it, it scraps all the links
> on the frontpage of the sites selected and presents them paginated all
> afresh. However, I don't think its a good idea for the script to
> read/write all the links that had previously extracted links all over
> again and therefore would like to check for and append only new links.
> I would like to save the previously scraped links so that over the
> course of say, a week, all the links that have appeared on the
> frontpage of these sites will be available on my site as older pages.
>
> It's my first programming project and don't know how to incorporate
> this logic into my code.
>
> Any help/pointers/references will be greatly appreciated.
>
> regards, Max

I don't know anything about Django, but I don't think this is a Django
question.

I think the best way would be to put the urls in a database with the
time that they have been retrieved. Then you could retrieve the links
from the database next time, and when present, sort them on time
retrieved and put them at the end of the list.

Now if you want to do this on a user basis you should add user
information with it also (and then it would be partly a Django problem
because you would get the user id from Django).

-- 
Piet van Oostrum <piet@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]

[toc] | [prev] | [standalone]

csiph-web

Python/Django Extract and append only new links

Contents

#62934 — Python/Django Extract and append only new links

#62953