Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.005 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'skip:p 40': 0.04; '"""': 0.07; 'function,': 0.07; 'none:': 0.07; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:80.91.229.12': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'received:lo.gmane.org': 0.09; 'def': 0.13; 'cache': 0.15; '###': 0.16; 'cache_dir': 0.16; "isn't.": 0.16; 'subject:lxml': 0.16; 'this?': 0.19; "doesn't": 0.22; 'function': 0.27; 'import': 0.27; 'code,': 0.27; 'server.': 0.28; "skip:' 10": 0.29; 'print': 0.29; '(first': 0.30; 'efficiently': 0.30; 'mentioning': 0.30; 'xml': 0.31; 'hi,': 0.32; 'implement': 0.32; 'header:User-Agent:1': 0.33; 'header:X-Complaints-To:1': 0.33; 'to:addr:python-list': 0.34; 'doc': 0.34; 'but': 0.37; "there's": 0.37; 'skip:- 40': 0.37; 'received:org': 0.38; 'some': 0.38; 'non': 0.38; "i'd": 0.39; 'received:de': 0.39; 'should': 0.39; 'to:addr:python.org': 0.40; 'types': 0.61; 'skip:o 30': 0.63; 'below': 0.63; 'catalog': 0.73; 'catalogue,': 0.84; 'rslt': 0.84; 'subject:skip:v 10': 0.84; 'html5': 0.91; 'catalogue': 0.93 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Gelonida N Subject: lxml precaching DTD for document verification. Date: Sun, 27 Nov 2011 19:57:29 +0100 Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Gmane-NNTP-Posting-Host: unicorn.dungeon.de User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110921 Lightning/1.0b2 "" X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 65 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1322420265 news.xs4all.nl 6852 [2001:888:2000:d::a6]:33522 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:16295 Hi, I'd like to verify some (x)html / / html5 / xml documents from a server. These documents have a very limited number of different doc types / DTDs. So what I would like to do is to build a small DTD cache and some code, that would avoid searching the DTDs over and over from the net. What would be the best way to do this? I guess, that the fields od en ElementTre, that I have to look at are docinfo.public_id docinfo.system_uri There's also mentioning af a catalogue, but I don't know how to use a catalog and how to know what is inside my catalogue and what isn't. Below a non working skeleto (first shot): --------------------------------------------- Would this be the right way?? ### ufnctions with '???' are not implemented / are the ones ### where I don't know whether they exist alreday. import os import urllib from lxml import etree cache_dir = os.path.join(os.environ['HOME'], ''.my_dtd_cache') def get_from_cache(docinfo): """ the function which I'd like to implement most efficiently """ fpi = docinfo.public_id uri = docinfo.system_uri dtd = ???get_from_dtd_cache(fpi, uri) if dtd is not None: return dtd # how can I check what is in my 'catalogue' if ???dtd_in_catalogue(??): return ???get_dtd_from_catalogue??? dtd_rdr = urllib.urlopen(uri) dtd_filename = ???create_cache_filename(docinfo) (fname, _headers) = urllib.urlretrieve(uri, dtd_filename) return etree.DTD(fname) def check_doc_cached(filename): """ function, which should report errors if a doc doesn't validate. """ doc = etree.parse(filename) dtd = get_from_cache(doc.docinfo) rslt = dtd.validate(doc) if not rlst: print "validate error:" print(dtd.error_log.filter_from_errors()[0])