Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #92162
| References | <099a955d-134a-46d6-bdba-61ec2b1eb44f@googlegroups.com> <mailman.157.1433422910.13271.python-list@python.org> <7317a823-5ed6-4078-886b-0d3caf897a84@googlegroups.com> |
|---|---|
| From | Ian Kelly <ian.g.kelly@gmail.com> |
| Date | 2015-06-05 13:48 -0600 |
| Subject | Re: Get html DOM tree by only basic builtin moudles |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.206.1433533780.13271.python-list@python.org> (permalink) |
On Fri, Jun 5, 2015 at 12:10 PM, Wesley <nispray@gmail.com> wrote:
> Hi Laura,
> Sure, I got special requirement that just parse html file into DOM tree, by only general basic modules, and based on my DOM tree structure, draft an bitmap.
>
> So, could you give me an direction how to get the DOM tree?
> Currently, I just think out to use something like stack, I mean, maybe read the file line by line, adding to a stack data structure(list for example), and, then, got the parent/child relation .etc
>
> I don't know if what I said is easy to achieve, I am just trying.
> Any better suggestions will be great appreciated.
If you want to recreate the same DOM structure that would be created
by a browser, the standardized algorithm to do so is very complicated,
but you can find it at
http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html.
If you're not necessarily seeking perfect fidelity, I would encourage
you to try to find some way to incorporate beautifulsoup into your
project. It likely won't produce the same structure that a real
browser would, but it should do well enough to scrape from even badly
malformed html.
I recommend against using an XML parser, because HTML isn't XML, and
such a parser may choke even on perfectly valid HTML such as this:
<!DOCTYPE html>
<html>
<head><title>Document</title></head>
<body>
First line
<br>
Second line
</body>
</html>
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-04 04:58 -0700
Re: Get html DOM tree by only basic builtin moudles Laura Creighton <lac@openend.se> - 2015-06-04 15:01 +0200
Re: Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-05 11:10 -0700
Re: Get html DOM tree by only basic builtin moudles Ian Kelly <ian.g.kelly@gmail.com> - 2015-06-05 13:48 -0600
Re: Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-05 16:24 -0700
csiph-web