Re: Get html DOM tree by only basic builtin moudles

Path	csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<ian.g.kelly@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.081
X-Spam-Evidence	'H': 0.84; 'S': 0.00; 'xml,': 0.05; 'badly': 0.07; 'stack,': 0.09; 'algorithm': 0.13; 'stack': 0.13; 'mean,': 0.16; 'modules,': 0.16; 'scrape': 0.16; 'wrote:': 0.16; 'tree': 0.18; 'parse': 0.22; 'parser': 0.22; '2015': 0.23; 'this:': 0.23; 'header:In-Reply-To:1': 0.24; 'second': 0.24; 'appreciated.': 0.27; 'xml': 0.27; 'message-id:@mail.gmail.com': 0.28; 'currently,': 0.29; 'dom': 0.29; 'structure,': 0.29; 'sure,': 0.29; 'fri,': 0.31; 'maybe': 0.31; 'structure': 0.32; 'file': 0.34; 'received:google.com': 0.34; 'could': 0.35; 'to:addr:python- list': 0.35; 'direction': 0.35; 'draft': 0.35; 'something': 0.35; "isn't": 0.35; 'but': 0.36; 'url:org': 0.36; 'created': 0.36; 'basic': 0.36; 'so,': 0.37; 'should': 0.37; 'subject:: ': 0.37; 'requirement': 0.37; "won't": 0.38; 'pm,': 0.39; 'enough': 0.39; 'to:addr:python.org': 0.39; 'data': 0.40; 'some': 0.40; 'easy': 0.60; 'your': 0.60; 'even': 0.61; 'necessarily': 0.61; 'real': 0.61; 'great': 0.64; 'incorporate': 0.66; 'subject:Get': 0.66; 'special': 0.72; 'url:2011': 0.75; '.etc': 0.84; 'subject:tree': 0.84; 'to:name:python': 0.84; 'tree,': 0.84
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=DvL2H2d/OB8OYEe/1t+fgdtOaZV1wyMW3YKELl4MIO8=; b=qhBeDqE+WuDAotNwKcU+LL/gyYXQYH8CLtDn76AfTd9l2/DxQ+SlxzH+VXRx8mgmbj ttIVWgnHIYEJtgXEDh0Vn7L924vE09E5Z+UNI9t1jjZBHR7erHw2Id/LGWBMD7iqtQmZ hDebFwvaB5PxlaBju1Kopt++qv4hpIwQwsGo/x+/bfoMFlm6hAFf7BcigeZfw2Q6ojdW gESU0SNVYNyFQnsTT0F762r3RFi2TfWfndDElwx4FOLKZKu3aUiDnDhuaKaGLlCrfjVw YwJUV87wdlC+1sO8IUdPXGWcgU3AJGrba5ATjTGX3c49QHJkMbSj5YlFW9W6xbanJPOu 2CxA==
X-Received	by 10.42.226.8 with SMTP id iu8mr12011912icb.17.1433533775993; Fri, 05 Jun 2015 12:49:35 -0700 (PDT)
MIME-Version	1.0
In-Reply-To	<7317a823-5ed6-4078-886b-0d3caf897a84@googlegroups.com>
References	<099a955d-134a-46d6-bdba-61ec2b1eb44f@googlegroups.com> <mailman.157.1433422910.13271.python-list@python.org> <7317a823-5ed6-4078-886b-0d3caf897a84@googlegroups.com>
From	Ian Kelly <ian.g.kelly@gmail.com>
Date	Fri, 5 Jun 2015 13:48:55 -0600
Subject	Re: Get html DOM tree by only basic builtin moudles
To	Python <python-list@python.org>
Content-Type	text/plain; charset=UTF-8
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.20+
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.206.1433533780.13271.python-list@python.org> (permalink)
Lines	33
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1433533780 news.xs4all.nl 2955 [2001:888:2000:d::a6]:60863
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:92162

Show key headers only | View raw

On Fri, Jun 5, 2015 at 12:10 PM, Wesley <nispray@gmail.com> wrote:
> Hi Laura,
>   Sure, I got special requirement that just parse html file into DOM tree, by only general basic modules, and based on my DOM tree structure, draft an bitmap.
>
>   So, could you give me an direction how to get the DOM tree?
> Currently, I just think out to use something like stack, I mean, maybe read the file line by line, adding to a stack data structure(list for example), and, then, got the parent/child relation .etc
>
> I don't know if what I said is easy to achieve, I am just trying.
> Any better suggestions will be great appreciated.

If you want to recreate the same DOM structure that would be created
by a browser, the standardized algorithm to do so is very complicated,
but you can find it at
http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html.

If you're not necessarily seeking perfect fidelity, I would encourage
you to try to find some way to incorporate beautifulsoup into your
project. It likely won't produce the same structure that a real
browser would, but it should do well enough to scrape from even badly
malformed html.

I recommend against using an XML parser, because HTML isn't XML, and
such a parser may choke even on perfectly valid HTML such as this:

<!DOCTYPE html>
<html>
  <head><title>Document</title></head>
  <body>
    First line
    <br>
    Second line
  </body>
</html>

Thread

Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-04 04:58 -0700
  Re: Get html DOM tree by only basic builtin moudles Laura Creighton <lac@openend.se> - 2015-06-04 15:01 +0200
    Re: Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-05 11:10 -0700
      Re: Get html DOM tree by only basic builtin moudles Ian Kelly <ian.g.kelly@gmail.com> - 2015-06-05 13:48 -0600
        Re: Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-05 16:24 -0700

csiph-web