Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.081 X-Spam-Evidence: '*H*': 0.84; '*S*': 0.00; 'xml,': 0.05; 'badly': 0.07; 'stack,': 0.09; 'algorithm': 0.13; 'stack': 0.13; 'mean,': 0.16; 'modules,': 0.16; 'scrape': 0.16; 'wrote:': 0.16; 'tree': 0.18; 'parse': 0.22; 'parser': 0.22; '2015': 0.23; 'this:': 0.23; 'header:In-Reply-To:1': 0.24; 'second': 0.24; 'appreciated.': 0.27; 'xml': 0.27; 'message-id:@mail.gmail.com': 0.28; 'currently,': 0.29; 'dom': 0.29; 'structure,': 0.29; 'sure,': 0.29; 'fri,': 0.31; 'maybe': 0.31; 'structure': 0.32; 'file': 0.34; 'received:google.com': 0.34; 'could': 0.35; 'to:addr:python- list': 0.35; 'direction': 0.35; 'draft': 0.35; 'something': 0.35; "isn't": 0.35; 'but': 0.36; 'url:org': 0.36; 'created': 0.36; 'basic': 0.36; 'so,': 0.37; 'should': 0.37; 'subject:: ': 0.37; 'requirement': 0.37; "won't": 0.38; 'pm,': 0.39; 'enough': 0.39; 'to:addr:python.org': 0.39; 'data': 0.40; 'some': 0.40; 'easy': 0.60; 'your': 0.60; 'even': 0.61; 'necessarily': 0.61; 'real': 0.61; 'great': 0.64; 'incorporate': 0.66; 'subject:Get': 0.66; 'special': 0.72; 'url:2011': 0.75; '.etc': 0.84; 'subject:tree': 0.84; 'to:name:python': 0.84; 'tree,': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=DvL2H2d/OB8OYEe/1t+fgdtOaZV1wyMW3YKELl4MIO8=; b=qhBeDqE+WuDAotNwKcU+LL/gyYXQYH8CLtDn76AfTd9l2/DxQ+SlxzH+VXRx8mgmbj ttIVWgnHIYEJtgXEDh0Vn7L924vE09E5Z+UNI9t1jjZBHR7erHw2Id/LGWBMD7iqtQmZ hDebFwvaB5PxlaBju1Kopt++qv4hpIwQwsGo/x+/bfoMFlm6hAFf7BcigeZfw2Q6ojdW gESU0SNVYNyFQnsTT0F762r3RFi2TfWfndDElwx4FOLKZKu3aUiDnDhuaKaGLlCrfjVw YwJUV87wdlC+1sO8IUdPXGWcgU3AJGrba5ATjTGX3c49QHJkMbSj5YlFW9W6xbanJPOu 2CxA== X-Received: by 10.42.226.8 with SMTP id iu8mr12011912icb.17.1433533775993; Fri, 05 Jun 2015 12:49:35 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <7317a823-5ed6-4078-886b-0d3caf897a84@googlegroups.com> References: <099a955d-134a-46d6-bdba-61ec2b1eb44f@googlegroups.com> <7317a823-5ed6-4078-886b-0d3caf897a84@googlegroups.com> From: Ian Kelly Date: Fri, 5 Jun 2015 13:48:55 -0600 Subject: Re: Get html DOM tree by only basic builtin moudles To: Python Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 33 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1433533780 news.xs4all.nl 2955 [2001:888:2000:d::a6]:60863 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:92162 On Fri, Jun 5, 2015 at 12:10 PM, Wesley wrote: > Hi Laura, > Sure, I got special requirement that just parse html file into DOM tree, by only general basic modules, and based on my DOM tree structure, draft an bitmap. > > So, could you give me an direction how to get the DOM tree? > Currently, I just think out to use something like stack, I mean, maybe read the file line by line, adding to a stack data structure(list for example), and, then, got the parent/child relation .etc > > I don't know if what I said is easy to achieve, I am just trying. > Any better suggestions will be great appreciated. If you want to recreate the same DOM structure that would be created by a browser, the standardized algorithm to do so is very complicated, but you can find it at http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html. If you're not necessarily seeking perfect fidelity, I would encourage you to try to find some way to incorporate beautifulsoup into your project. It likely won't produce the same structure that a real browser would, but it should do well enough to scrape from even badly malformed html. I recommend against using an XML parser, because HTML isn't XML, and such a parser may choke even on perfectly valid HTML such as this: Document First line
Second line