Groups > comp.lang.python > #92044 > unrolled thread

Get html DOM tree by only basic builtin moudles

Started by	Wesley <nispray@gmail.com>
First post	2015-06-04 04:58 -0700
Last post	2015-06-05 16:24 -0700
Articles	5 — 3 participants

Back to article view | Back to comp.lang.python

  Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-04 04:58 -0700
    Re: Get html DOM tree by only basic builtin moudles Laura Creighton <lac@openend.se> - 2015-06-04 15:01 +0200
      Re: Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-05 11:10 -0700
        Re: Get html DOM tree by only basic builtin moudles Ian Kelly <ian.g.kelly@gmail.com> - 2015-06-05 13:48 -0600
          Re: Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-05 16:24 -0700

#92044 — Get html DOM tree by only basic builtin moudles

From	Wesley <nispray@gmail.com>
Date	2015-06-04 04:58 -0700
Subject	Get html DOM tree by only basic builtin moudles
Message-ID	<099a955d-134a-46d6-bdba-61ec2b1eb44f@googlegroups.com>

Hi guys,
  I know there are many modules(builtin or not, e.g. beautifulsoup,xml,lxml,htmlparser .etc) to parse html files and output the DOM tree. However, if there is any better way to get the DOM tree without using those html/xml related modules? I mean, just by some general standard modules, e.g. file operations, re module .etc

Input file is something like this:
<html> 
  <head> 
    <title>DOM Tree test</title> 
  </head> 
  <body> 
    <h1>Header 1</h1> 
    <p>Hello world!</p> 
  </body>
</html>

Need the dom tree or just something like:
html -- head -- title(DOM Tree test)
html -- body -- h1(Header 1)
html -- body -- p(Hello world!)

Thanks.
Wesley

[toc] | [next] | [standalone]

#92046

From	Laura Creighton <lac@openend.se>
Date	2015-06-04 15:01 +0200
Message-ID	<mailman.157.1433422910.13271.python-list@python.org>
In reply to	#92044

Elementtree is part of the Python standard library.  You are better off
using it than rolling your own.   (If you were one of the rare people who
have some very strange requirements that make you better off writing your
own, you wouldn't be asking us.  You'd already know.)

https://docs.python.org/2.7/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
https://docs.python.org/3.4/library/xml.etree.elementtree.html

[toc] | [prev] | [next] | [standalone]

#92160

From	Wesley <nispray@gmail.com>
Date	2015-06-05 11:10 -0700
Message-ID	<7317a823-5ed6-4078-886b-0d3caf897a84@googlegroups.com>
In reply to	#92046

Hi Laura,
  Sure, I got special requirement that just parse html file into DOM tree, by only general basic modules, and based on my DOM tree structure, draft an bitmap.

  So, could you give me an direction how to get the DOM tree?
Currently, I just think out to use something like stack, I mean, maybe read the file line by line, adding to a stack data structure(list for example), and, then, got the parent/child relation .etc

I don't know if what I said is easy to achieve, I am just trying.
Any better suggestions will be great appreciated.

Thanks.
Wesley

> Elementtree is part of the Python standard library.  You are better off
> using it than rolling your own.   (If you were one of the rare people who
> have some very strange requirements that make you better off writing your
> own, you wouldn't be asking us.  You'd already know.)
> 
> https://docs.python.org/2.7/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
> https://docs.python.org/3.4/library/xml.etree.elementtree.html

[toc] | [prev] | [next] | [standalone]

#92162

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2015-06-05 13:48 -0600
Message-ID	<mailman.206.1433533780.13271.python-list@python.org>
In reply to	#92160

On Fri, Jun 5, 2015 at 12:10 PM, Wesley <nispray@gmail.com> wrote:
> Hi Laura,
>   Sure, I got special requirement that just parse html file into DOM tree, by only general basic modules, and based on my DOM tree structure, draft an bitmap.
>
>   So, could you give me an direction how to get the DOM tree?
> Currently, I just think out to use something like stack, I mean, maybe read the file line by line, adding to a stack data structure(list for example), and, then, got the parent/child relation .etc
>
> I don't know if what I said is easy to achieve, I am just trying.
> Any better suggestions will be great appreciated.

If you want to recreate the same DOM structure that would be created
by a browser, the standardized algorithm to do so is very complicated,
but you can find it at
http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html.

If you're not necessarily seeking perfect fidelity, I would encourage
you to try to find some way to incorporate beautifulsoup into your
project. It likely won't produce the same structure that a real
browser would, but it should do well enough to scrape from even badly
malformed html.

I recommend against using an XML parser, because HTML isn't XML, and
such a parser may choke even on perfectly valid HTML such as this:

<!DOCTYPE html>
<html>
  <head><title>Document</title></head>
  <body>
    First line
    <br>
    Second line
  </body>
</html>

[toc] | [prev] | [next] | [standalone]

#92164

From	Wesley <nispray@gmail.com>
Date	2015-06-05 16:24 -0700
Message-ID	<33670551-7df5-487a-8b64-ea85dc165596@googlegroups.com>
In reply to	#92162

> On Fri, Jun 5, 2015 at 12:10 PM, Wesley <nispray@gmail.com> wrote:
> > Hi Laura,
> >   Sure, I got special requirement that just parse html file into DOM tree, by only general basic modules, and based on my DOM tree structure, draft an bitmap.
> >
> >   So, could you give me an direction how to get the DOM tree?
> > Currently, I just think out to use something like stack, I mean, maybe read the file line by line, adding to a stack data structure(list for example), and, then, got the parent/child relation .etc
> >
> > I don't know if what I said is easy to achieve, I am just trying.
> > Any better suggestions will be great appreciated.
> 
> If you want to recreate the same DOM structure that would be created
> by a browser, the standardized algorithm to do so is very complicated,
> but you can find it at
> http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html.
> 
> If you're not necessarily seeking perfect fidelity, I would encourage
> you to try to find some way to incorporate beautifulsoup into your
> project. It likely won't produce the same structure that a real
> browser would, but it should do well enough to scrape from even badly
> malformed html.
> 
> I recommend against using an XML parser, because HTML isn't XML, and
> such a parser may choke even on perfectly valid HTML such as this:
> 
> <!DOCTYPE html>
> <html>
>   <head><title>Document</title></head>
>   <body>
>     First line
>     <br>
>     Second line
>   </body>
> </html>

Hi,
  Hmm, it's really complex.
Currently, I don't need to involve all error handling,and assume html is well formatted, then, generate the DOM tree.

Html sample below:
<!DOCTYPE html>
<!-- saved from url=(0026)http://www.opera.com/about -->
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <meta name="description" content="Opera is an independent Scandinavian company that's been in the business of making web browsers since 1994. Read more about Opera Software here.">
  <title>About - Opera Software</title>
  <link rel="apple-touch-icon" sizes="57x57" href="http://d2jc9zwbrclgz3.cloudfront.net/static-heap/da/dafd15591b35d4f81ca96cf7de6582d705850ff0/apple-touch-icon-57x57.png">
</head>
<body screen_capture_injected="true"><div style="position: fixed; top: 0px; left: 0px; height: 0px; width: 0px; z-index: 9999999;"><div style="position: fixed; top: 100%; height: 0px;"><div style="position: relative;"></div></div></div>
<!-- Google Tag Manager -->
<nav class="business-menu">
  <ul>
    <li><a data-action-id="header_item" href="http://operamediaworks.com/">Opera Mediaworks</a></li>
  </ul>
</nav>
<main role="main" class="generic_landing_page">
<h1>Who we are, what we do</h1>  <figure class="visuals">
  <img src="./About - Opera Software_files/pro-kompaniyu.jpg" alt="" width="900" height="424">
</figure>  
<ul class="blocks col3">
<li>
<h3>Vision</h3>
<p>We strive to develop superior products and services for our users around the world, through state-of-the-art technology, innovation, leadership and partnerships.</p><p><a href="http://www.operasoftware.com/company/vision" target="_self">Find out more</a>.</p>
</li>
<li>
</ul>
</main>
<footer class="ns--hf">
<aside>
<div class="hf--extra">
  <h2 class="hf--visuallyhidden">Page language</h2>
  <div id="language" class="hf--language hf--hover-enabled hf--popup-container">
    <input id="language-toggle" class="hf--popup-toggle hf--visuallyhidden" type="checkbox" aria-haspopup="true">
    <label for="language-toggle" class="hf--popup-toggle-label" tabindex="0">
      <span class="hf--hide-overflow">
      <span class="">Select your language:</span>
      <span class="">English</span>
      </span>
    </label>
  </div>
</div>
</aside>
<div class="hf--meta hf--clearfix">
<small class="hf--company">Copyright ? 2014 Opera Software ASA. All rights reserved.
<a data-action-id="footer_item" href="http://www.opera.com/privacy">Privacy.</a> <a data-action-id="footer_item" href="http://www.opera.com/terms">Terms of Use.</a>
</small>
</div>
</footer>
</body></html>

[toc] | [prev] | [standalone]

csiph-web

Get html DOM tree by only basic builtin moudles

Contents

#92044 — Get html DOM tree by only basic builtin moudles

#92046

#92160

#92162

#92164