Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #16295 > unrolled thread
| Started by | Gelonida N <gelonida@gmail.com> |
|---|---|
| First post | 2011-11-27 19:57 +0100 |
| Last post | 2011-11-28 01:32 +0100 |
| Articles | 4 — 3 participants |
Back to article view | Back to comp.lang.python
lxml precaching DTD for document verification. Gelonida N <gelonida@gmail.com> - 2011-11-27 19:57 +0100
Re: lxml precaching DTD for document verification. Roy Smith <roy@panix.com> - 2011-11-27 15:29 -0500
Re: lxml precaching DTD for document verification. John Gordon <gordon@panix.com> - 2011-11-27 21:33 +0000
Re: lxml precaching DTD for document verification. Gelonida N <gelonida@gmail.com> - 2011-11-28 01:32 +0100
| From | Gelonida N <gelonida@gmail.com> |
|---|---|
| Date | 2011-11-27 19:57 +0100 |
| Subject | lxml precaching DTD for document verification. |
| Message-ID | <mailman.3078.1322420265.27778.python-list@python.org> |
Hi,
I'd like to verify some (x)html / / html5 / xml documents from a server.
These documents have a very limited number of different doc types / DTDs.
So what I would like to do is to build a small DTD cache and some code,
that would avoid searching the DTDs over and over from the net.
What would be the best way to do this?
I guess, that
the fields od en ElementTre, that I have to look at are
docinfo.public_id
docinfo.system_uri
There's also mentioning af a catalogue, but I don't know how to
use a catalog and how to know what is inside my catalogue
and what isn't.
Below a non working skeleto (first shot):
---------------------------------------------
Would this be the right way??
### ufnctions with '???' are not implemented / are the ones
### where I don't know whether they exist alreday.
import os
import urllib
from lxml import etree
cache_dir = os.path.join(os.environ['HOME'], ''.my_dtd_cache')
def get_from_cache(docinfo):
""" the function which I'd like to implement most efficiently """
fpi = docinfo.public_id
uri = docinfo.system_uri
dtd = ???get_from_dtd_cache(fpi, uri)
if dtd is not None:
return dtd
# how can I check what is in my 'catalogue'
if ???dtd_in_catalogue(??):
return ???get_dtd_from_catalogue???
dtd_rdr = urllib.urlopen(uri)
dtd_filename = ???create_cache_filename(docinfo)
(fname, _headers) = urllib.urlretrieve(uri, dtd_filename)
return etree.DTD(fname)
def check_doc_cached(filename):
""" function, which should report errors
if a doc doesn't validate.
"""
doc = etree.parse(filename)
dtd = get_from_cache(doc.docinfo)
rslt = dtd.validate(doc)
if not rlst:
print "validate error:"
print(dtd.error_log.filter_from_errors()[0])
[toc] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2011-11-27 15:29 -0500 |
| Message-ID | <roy-6F0FD0.15291227112011@news.panix.com> |
| In reply to | #16295 |
In article <mailman.3078.1322420265.27778.python-list@python.org>, Gelonida N <gelonida@gmail.com> wrote: > I'd like to verify some (x)html / / html5 / xml documents from a server. I'm sure you could roll your own validator with lxml and some DTDs, but you would probably save yourself a huge amount of effort by just using the validator the W3C provides (http://validator.w3.org/).
[toc] | [prev] | [next] | [standalone]
| From | John Gordon <gordon@panix.com> |
|---|---|
| Date | 2011-11-27 21:33 +0000 |
| Message-ID | <jauabo$sbr$1@reader1.panix.com> |
| In reply to | #16296 |
In <roy-6F0FD0.15291227112011@news.panix.com> Roy Smith <roy@panix.com> writes:
> In article <mailman.3078.1322420265.27778.python-list@python.org>,
> Gelonida N <gelonida@gmail.com> wrote:
>
> > I'd like to verify some (x)html / / html5 / xml documents from a server.
> I'm sure you could roll your own validator with lxml and some DTDs, but
> you would probably save yourself a huge amount of effort by just using
> the validator the W3C provides (http://validator.w3.org/).
With regards to XML, he may mean that he wants to validate that the
document conforms to a specific format, not just that it is generally
valid XML. I don't think the w3 validator will do that.
--
John Gordon A is for Amy, who fell down the stairs
gordon@panix.com B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"
[toc] | [prev] | [next] | [standalone]
| From | Gelonida N <gelonida@gmail.com> |
|---|---|
| Date | 2011-11-28 01:32 +0100 |
| Message-ID | <mailman.3081.1322440368.27778.python-list@python.org> |
| In reply to | #16297 |
On 11/27/2011 10:33 PM, John Gordon wrote: > In <roy-6F0FD0.15291227112011@news.panix.com> Roy Smith <roy@panix.com> writes: > >> In article <mailman.3078.1322420265.27778.python-list@python.org>, >> Gelonida N <gelonida@gmail.com> wrote: >> >>> I'd like to verify some (x)html / / html5 / xml documents from a server. > >> I'm sure you could roll your own validator with lxml and some DTDs, but >> you would probably save yourself a huge amount of effort by just using >> the validator the W3C provides (http://validator.w3.org/). This validator requires that I post the code to some host. The contents that I'd like to verify is intranet contents, which I am not allowed to post to an external site. > > With regards to XML, he may mean that he wants to validate that the > document conforms to a specific format, not just that it is generally > valid XML. I don't think the w3 validator will do that. > Basically I want to integrate this into a django unit test. I noticed, that some of of the templates generate documents with mismatching DTD headers / contents. All of the HTML code is parsable as xml (if it isn't it's a bug) There are also some custom XML files, which have their specific DTDs So I thought about validating some of the generated html with lxml. the django test environment allows to run test clients, which are supposedly much faster than a real http client.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web