Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Gelonida N <gelonida@gmail.com>
Subject: lxml precaching DTD for document verification.
Date: Sun, 27 Nov 2011 19:57:29 +0100
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110921 Lightning/1.0b2 ""
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3078.1322420265.27778.python-list@python.org>
Lines: 65
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:16295

Hi,

I'd like to verify some (x)html / / html5 / xml documents from a server.

These documents have a very limited number of different doc types / DTDs.

So what I would like to do is to build a small DTD cache and some code,
that would avoid searching the DTDs over and over from the net.

What would be the best way to do this?
I guess, that
the fields od en ElementTre, that I have to look at are
docinfo.public_id
docinfo.system_uri

There's also mentioning af a catalogue, but I don't know how to
use a catalog and how to know what is inside my catalogue
and what isn't.


Below a non working skeleto (first shot):
---------------------------------------------
Would this be the right way??

### ufnctions with '???' are not implemented / are the ones
### where I don't know whether they exist alreday.

import os
import urllib

from lxml import etree

cache_dir = os.path.join(os.environ['HOME'], ''.my_dtd_cache')

def get_from_cache(docinfo):
    """ the function which I'd like to implement most efficiently """
    fpi = docinfo.public_id
    uri = docinfo.system_uri
    dtd = ???get_from_dtd_cache(fpi, uri)
    if dtd is not None:
        return dtd
    # how can I check what is in my 'catalogue'
    if ???dtd_in_catalogue(??):
        return ???get_dtd_from_catalogue???
    dtd_rdr = urllib.urlopen(uri)
    dtd_filename = ???create_cache_filename(docinfo)
    (fname, _headers) = urllib.urlretrieve(uri, dtd_filename)
    return  etree.DTD(fname)


def check_doc_cached(filename):
    """ function, which should report errors
        if a doc doesn't validate.
    """
    doc = etree.parse(filename)
    dtd = get_from_cache(doc.docinfo)
    rslt = dtd.validate(doc)
    if not rlst:
        print "validate error:"
        print(dtd.error_log.filter_from_errors()[0])