Groups > comp.lang.python > #16295 > unrolled thread

lxml precaching DTD for document verification.

Started by	Gelonida N <gelonida@gmail.com>
First post	2011-11-27 19:57 +0100
Last post	2011-11-28 01:32 +0100
Articles	4 — 3 participants

Back to article view | Back to comp.lang.python

  lxml precaching DTD for document verification. Gelonida N <gelonida@gmail.com> - 2011-11-27 19:57 +0100
    Re: lxml precaching DTD for document verification. Roy Smith <roy@panix.com> - 2011-11-27 15:29 -0500
      Re: lxml precaching DTD for document verification. John Gordon <gordon@panix.com> - 2011-11-27 21:33 +0000
        Re: lxml precaching DTD for document verification. Gelonida N <gelonida@gmail.com> - 2011-11-28 01:32 +0100

#16295 — lxml precaching DTD for document verification.

From	Gelonida N <gelonida@gmail.com>
Date	2011-11-27 19:57 +0100
Subject	lxml precaching DTD for document verification.
Message-ID	<mailman.3078.1322420265.27778.python-list@python.org>

Hi,

I'd like to verify some (x)html / / html5 / xml documents from a server.

These documents have a very limited number of different doc types / DTDs.

So what I would like to do is to build a small DTD cache and some code,
that would avoid searching the DTDs over and over from the net.

What would be the best way to do this?
I guess, that
the fields od en ElementTre, that I have to look at are
docinfo.public_id
docinfo.system_uri

There's also mentioning af a catalogue, but I don't know how to
use a catalog and how to know what is inside my catalogue
and what isn't.


Below a non working skeleto (first shot):
---------------------------------------------
Would this be the right way??

### ufnctions with '???' are not implemented / are the ones
### where I don't know whether they exist alreday.

import os
import urllib

from lxml import etree

cache_dir = os.path.join(os.environ['HOME'], ''.my_dtd_cache')

def get_from_cache(docinfo):
    """ the function which I'd like to implement most efficiently """
    fpi = docinfo.public_id
    uri = docinfo.system_uri
    dtd = ???get_from_dtd_cache(fpi, uri)
    if dtd is not None:
        return dtd
    # how can I check what is in my 'catalogue'
    if ???dtd_in_catalogue(??):
        return ???get_dtd_from_catalogue???
    dtd_rdr = urllib.urlopen(uri)
    dtd_filename = ???create_cache_filename(docinfo)
    (fname, _headers) = urllib.urlretrieve(uri, dtd_filename)
    return  etree.DTD(fname)


def check_doc_cached(filename):
    """ function, which should report errors
        if a doc doesn't validate.
    """
    doc = etree.parse(filename)
    dtd = get_from_cache(doc.docinfo)
    rslt = dtd.validate(doc)
    if not rlst:
        print "validate error:"
        print(dtd.error_log.filter_from_errors()[0])

[toc] | [next] | [standalone]

#16296

From	Roy Smith <roy@panix.com>
Date	2011-11-27 15:29 -0500
Message-ID	<roy-6F0FD0.15291227112011@news.panix.com>
In reply to	#16295

In article <mailman.3078.1322420265.27778.python-list@python.org>,
 Gelonida N <gelonida@gmail.com> wrote:
 
> I'd like to verify some (x)html / / html5 / xml documents from a server.

I'm sure you could roll your own validator with lxml and some DTDs, but 
you would probably save yourself a huge amount of effort by just using 
the validator the W3C provides (http://validator.w3.org/).

[toc] | [prev] | [next] | [standalone]

#16297

From	John Gordon <gordon@panix.com>
Date	2011-11-27 21:33 +0000
Message-ID	<jauabo$sbr$1@reader1.panix.com>
In reply to	#16296

In <roy-6F0FD0.15291227112011@news.panix.com> Roy Smith <roy@panix.com> writes:

> In article <mailman.3078.1322420265.27778.python-list@python.org>,
>  Gelonida N <gelonida@gmail.com> wrote:
>  
> > I'd like to verify some (x)html / / html5 / xml documents from a server.

> I'm sure you could roll your own validator with lxml and some DTDs, but 
> you would probably save yourself a huge amount of effort by just using 
> the validator the W3C provides (http://validator.w3.org/).

With regards to XML, he may mean that he wants to validate that the
document conforms to a specific format, not just that it is generally
valid XML.  I don't think the w3 validator will do that.

-- 
John Gordon                   A is for Amy, who fell down the stairs
gordon@panix.com              B is for Basil, assaulted by bears
                                -- Edward Gorey, "The Gashlycrumb Tinies"

[toc] | [prev] | [next] | [standalone]

#16303

From	Gelonida N <gelonida@gmail.com>
Date	2011-11-28 01:32 +0100
Message-ID	<mailman.3081.1322440368.27778.python-list@python.org>
In reply to	#16297

On 11/27/2011 10:33 PM, John Gordon wrote:
> In <roy-6F0FD0.15291227112011@news.panix.com> Roy Smith <roy@panix.com> writes:
> 
>> In article <mailman.3078.1322420265.27778.python-list@python.org>,
>>  Gelonida N <gelonida@gmail.com> wrote:
>>  
>>> I'd like to verify some (x)html / / html5 / xml documents from a server.
> 
>> I'm sure you could roll your own validator with lxml and some DTDs, but 
>> you would probably save yourself a huge amount of effort by just using 
>> the validator the W3C provides (http://validator.w3.org/).

This validator requires that I post the code to some host.
The contents that I'd like to verify is intranet contents, which I am
not allowed to post to an external site.
> 
> With regards to XML, he may mean that he wants to validate that the
> document conforms to a specific format, not just that it is generally
> valid XML.  I don't think the w3 validator will do that.
> 


Basically I want to integrate this into a django unit test.

I noticed, that some of of the templates generate documents with
mismatching DTD headers / contents.
All of the HTML code is parsable as xml (if it isn't it's a bug)

There are also some custom XML files, which have their specific DTDs

So I thought about validating some of the generated html with lxml.

the django test environment allows to run test clients, which are
supposedly much faster than a real http client.

[toc] | [prev] | [standalone]

csiph-web

lxml precaching DTD for document verification.

Contents

#16295 — lxml precaching DTD for document verification.

#16296

#16297

#16303