Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #92044 > unrolled thread
| Started by | Wesley <nispray@gmail.com> |
|---|---|
| First post | 2015-06-04 04:58 -0700 |
| Last post | 2015-06-05 16:24 -0700 |
| Articles | 5 — 3 participants |
Back to article view | Back to comp.lang.python
Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-04 04:58 -0700
Re: Get html DOM tree by only basic builtin moudles Laura Creighton <lac@openend.se> - 2015-06-04 15:01 +0200
Re: Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-05 11:10 -0700
Re: Get html DOM tree by only basic builtin moudles Ian Kelly <ian.g.kelly@gmail.com> - 2015-06-05 13:48 -0600
Re: Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-05 16:24 -0700
| From | Wesley <nispray@gmail.com> |
|---|---|
| Date | 2015-06-04 04:58 -0700 |
| Subject | Get html DOM tree by only basic builtin moudles |
| Message-ID | <099a955d-134a-46d6-bdba-61ec2b1eb44f@googlegroups.com> |
Hi guys,
I know there are many modules(builtin or not, e.g. beautifulsoup,xml,lxml,htmlparser .etc) to parse html files and output the DOM tree. However, if there is any better way to get the DOM tree without using those html/xml related modules? I mean, just by some general standard modules, e.g. file operations, re module .etc
Input file is something like this:
<html>
<head>
<title>DOM Tree test</title>
</head>
<body>
<h1>Header 1</h1>
<p>Hello world!</p>
</body>
</html>
Need the dom tree or just something like:
html -- head -- title(DOM Tree test)
html -- body -- h1(Header 1)
html -- body -- p(Hello world!)
Thanks.
Wesley
[toc] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-06-04 15:01 +0200 |
| Message-ID | <mailman.157.1433422910.13271.python-list@python.org> |
| In reply to | #92044 |
Elementtree is part of the Python standard library. You are better off using it than rolling your own. (If you were one of the rare people who have some very strange requirements that make you better off writing your own, you wouldn't be asking us. You'd already know.) https://docs.python.org/2.7/library/xml.etree.elementtree.html#module-xml.etree.ElementTree https://docs.python.org/3.4/library/xml.etree.elementtree.html
[toc] | [prev] | [next] | [standalone]
| From | Wesley <nispray@gmail.com> |
|---|---|
| Date | 2015-06-05 11:10 -0700 |
| Message-ID | <7317a823-5ed6-4078-886b-0d3caf897a84@googlegroups.com> |
| In reply to | #92046 |
Hi Laura, Sure, I got special requirement that just parse html file into DOM tree, by only general basic modules, and based on my DOM tree structure, draft an bitmap. So, could you give me an direction how to get the DOM tree? Currently, I just think out to use something like stack, I mean, maybe read the file line by line, adding to a stack data structure(list for example), and, then, got the parent/child relation .etc I don't know if what I said is easy to achieve, I am just trying. Any better suggestions will be great appreciated. Thanks. Wesley > Elementtree is part of the Python standard library. You are better off > using it than rolling your own. (If you were one of the rare people who > have some very strange requirements that make you better off writing your > own, you wouldn't be asking us. You'd already know.) > > https://docs.python.org/2.7/library/xml.etree.elementtree.html#module-xml.etree.ElementTree > https://docs.python.org/3.4/library/xml.etree.elementtree.html
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2015-06-05 13:48 -0600 |
| Message-ID | <mailman.206.1433533780.13271.python-list@python.org> |
| In reply to | #92160 |
On Fri, Jun 5, 2015 at 12:10 PM, Wesley <nispray@gmail.com> wrote:
> Hi Laura,
> Sure, I got special requirement that just parse html file into DOM tree, by only general basic modules, and based on my DOM tree structure, draft an bitmap.
>
> So, could you give me an direction how to get the DOM tree?
> Currently, I just think out to use something like stack, I mean, maybe read the file line by line, adding to a stack data structure(list for example), and, then, got the parent/child relation .etc
>
> I don't know if what I said is easy to achieve, I am just trying.
> Any better suggestions will be great appreciated.
If you want to recreate the same DOM structure that would be created
by a browser, the standardized algorithm to do so is very complicated,
but you can find it at
http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html.
If you're not necessarily seeking perfect fidelity, I would encourage
you to try to find some way to incorporate beautifulsoup into your
project. It likely won't produce the same structure that a real
browser would, but it should do well enough to scrape from even badly
malformed html.
I recommend against using an XML parser, because HTML isn't XML, and
such a parser may choke even on perfectly valid HTML such as this:
<!DOCTYPE html>
<html>
<head><title>Document</title></head>
<body>
First line
<br>
Second line
</body>
</html>
[toc] | [prev] | [next] | [standalone]
| From | Wesley <nispray@gmail.com> |
|---|---|
| Date | 2015-06-05 16:24 -0700 |
| Message-ID | <33670551-7df5-487a-8b64-ea85dc165596@googlegroups.com> |
| In reply to | #92162 |
> On Fri, Jun 5, 2015 at 12:10 PM, Wesley <nispray@gmail.com> wrote:
> > Hi Laura,
> > Sure, I got special requirement that just parse html file into DOM tree, by only general basic modules, and based on my DOM tree structure, draft an bitmap.
> >
> > So, could you give me an direction how to get the DOM tree?
> > Currently, I just think out to use something like stack, I mean, maybe read the file line by line, adding to a stack data structure(list for example), and, then, got the parent/child relation .etc
> >
> > I don't know if what I said is easy to achieve, I am just trying.
> > Any better suggestions will be great appreciated.
>
> If you want to recreate the same DOM structure that would be created
> by a browser, the standardized algorithm to do so is very complicated,
> but you can find it at
> http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html.
>
> If you're not necessarily seeking perfect fidelity, I would encourage
> you to try to find some way to incorporate beautifulsoup into your
> project. It likely won't produce the same structure that a real
> browser would, but it should do well enough to scrape from even badly
> malformed html.
>
> I recommend against using an XML parser, because HTML isn't XML, and
> such a parser may choke even on perfectly valid HTML such as this:
>
> <!DOCTYPE html>
> <html>
> <head><title>Document</title></head>
> <body>
> First line
> <br>
> Second line
> </body>
> </html>
Hi,
Hmm, it's really complex.
Currently, I don't need to involve all error handling,and assume html is well formatted, then, generate the DOM tree.
Html sample below:
<!DOCTYPE html>
<!-- saved from url=(0026)http://www.opera.com/about -->
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="description" content="Opera is an independent Scandinavian company that's been in the business of making web browsers since 1994. Read more about Opera Software here.">
<title>About - Opera Software</title>
<link rel="apple-touch-icon" sizes="57x57" href="http://d2jc9zwbrclgz3.cloudfront.net/static-heap/da/dafd15591b35d4f81ca96cf7de6582d705850ff0/apple-touch-icon-57x57.png">
</head>
<body screen_capture_injected="true"><div style="position: fixed; top: 0px; left: 0px; height: 0px; width: 0px; z-index: 9999999;"><div style="position: fixed; top: 100%; height: 0px;"><div style="position: relative;"></div></div></div>
<!-- Google Tag Manager -->
<nav class="business-menu">
<ul>
<li><a data-action-id="header_item" href="http://operamediaworks.com/">Opera Mediaworks</a></li>
</ul>
</nav>
<main role="main" class="generic_landing_page">
<h1>Who we are, what we do</h1> <figure class="visuals">
<img src="./About - Opera Software_files/pro-kompaniyu.jpg" alt="" width="900" height="424">
</figure>
<ul class="blocks col3">
<li>
<h3>Vision</h3>
<p>We strive to develop superior products and services for our users around the world, through state-of-the-art technology, innovation, leadership and partnerships.</p><p><a href="http://www.operasoftware.com/company/vision" target="_self">Find out more</a>.</p>
</li>
<li>
</ul>
</main>
<footer class="ns--hf">
<aside>
<div class="hf--extra">
<h2 class="hf--visuallyhidden">Page language</h2>
<div id="language" class="hf--language hf--hover-enabled hf--popup-container">
<input id="language-toggle" class="hf--popup-toggle hf--visuallyhidden" type="checkbox" aria-haspopup="true">
<label for="language-toggle" class="hf--popup-toggle-label" tabindex="0">
<span class="hf--hide-overflow">
<span class="">Select your language:</span>
<span class="">English</span>
</span>
</label>
</div>
</div>
</aside>
<div class="hf--meta hf--clearfix">
<small class="hf--company">Copyright ? 2014 Opera Software ASA. All rights reserved.
<a data-action-id="footer_item" href="http://www.opera.com/privacy">Privacy.</a> <a data-action-id="footer_item" href="http://www.opera.com/terms">Terms of Use.</a>
</small>
</div>
</footer>
</body></html>
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web