Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #92162
| Path | csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <ian.g.kelly@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.081 |
| X-Spam-Evidence | '*H*': 0.84; '*S*': 0.00; 'xml,': 0.05; 'badly': 0.07; 'stack,': 0.09; 'algorithm': 0.13; 'stack': 0.13; 'mean,': 0.16; 'modules,': 0.16; 'scrape': 0.16; 'wrote:': 0.16; 'tree': 0.18; 'parse': 0.22; 'parser': 0.22; '2015': 0.23; 'this:': 0.23; 'header:In-Reply-To:1': 0.24; 'second': 0.24; 'appreciated.': 0.27; 'xml': 0.27; 'message-id:@mail.gmail.com': 0.28; 'currently,': 0.29; 'dom': 0.29; 'structure,': 0.29; 'sure,': 0.29; 'fri,': 0.31; 'maybe': 0.31; 'structure': 0.32; 'file': 0.34; 'received:google.com': 0.34; 'could': 0.35; 'to:addr:python- list': 0.35; 'direction': 0.35; 'draft': 0.35; 'something': 0.35; "isn't": 0.35; 'but': 0.36; 'url:org': 0.36; 'created': 0.36; 'basic': 0.36; 'so,': 0.37; 'should': 0.37; 'subject:: ': 0.37; 'requirement': 0.37; "won't": 0.38; 'pm,': 0.39; 'enough': 0.39; 'to:addr:python.org': 0.39; 'data': 0.40; 'some': 0.40; 'easy': 0.60; 'your': 0.60; 'even': 0.61; 'necessarily': 0.61; 'real': 0.61; 'great': 0.64; 'incorporate': 0.66; 'subject:Get': 0.66; 'special': 0.72; 'url:2011': 0.75; '.etc': 0.84; 'subject:tree': 0.84; 'to:name:python': 0.84; 'tree,': 0.84 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=DvL2H2d/OB8OYEe/1t+fgdtOaZV1wyMW3YKELl4MIO8=; b=qhBeDqE+WuDAotNwKcU+LL/gyYXQYH8CLtDn76AfTd9l2/DxQ+SlxzH+VXRx8mgmbj ttIVWgnHIYEJtgXEDh0Vn7L924vE09E5Z+UNI9t1jjZBHR7erHw2Id/LGWBMD7iqtQmZ hDebFwvaB5PxlaBju1Kopt++qv4hpIwQwsGo/x+/bfoMFlm6hAFf7BcigeZfw2Q6ojdW gESU0SNVYNyFQnsTT0F762r3RFi2TfWfndDElwx4FOLKZKu3aUiDnDhuaKaGLlCrfjVw YwJUV87wdlC+1sO8IUdPXGWcgU3AJGrba5ATjTGX3c49QHJkMbSj5YlFW9W6xbanJPOu 2CxA== |
| X-Received | by 10.42.226.8 with SMTP id iu8mr12011912icb.17.1433533775993; Fri, 05 Jun 2015 12:49:35 -0700 (PDT) |
| MIME-Version | 1.0 |
| In-Reply-To | <7317a823-5ed6-4078-886b-0d3caf897a84@googlegroups.com> |
| References | <099a955d-134a-46d6-bdba-61ec2b1eb44f@googlegroups.com> <mailman.157.1433422910.13271.python-list@python.org> <7317a823-5ed6-4078-886b-0d3caf897a84@googlegroups.com> |
| From | Ian Kelly <ian.g.kelly@gmail.com> |
| Date | Fri, 5 Jun 2015 13:48:55 -0600 |
| Subject | Re: Get html DOM tree by only basic builtin moudles |
| To | Python <python-list@python.org> |
| Content-Type | text/plain; charset=UTF-8 |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.20+ |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.206.1433533780.13271.python-list@python.org> (permalink) |
| Lines | 33 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1433533780 news.xs4all.nl 2955 [2001:888:2000:d::a6]:60863 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:92162 |
Show key headers only | View raw
On Fri, Jun 5, 2015 at 12:10 PM, Wesley <nispray@gmail.com> wrote:
> Hi Laura,
> Sure, I got special requirement that just parse html file into DOM tree, by only general basic modules, and based on my DOM tree structure, draft an bitmap.
>
> So, could you give me an direction how to get the DOM tree?
> Currently, I just think out to use something like stack, I mean, maybe read the file line by line, adding to a stack data structure(list for example), and, then, got the parent/child relation .etc
>
> I don't know if what I said is easy to achieve, I am just trying.
> Any better suggestions will be great appreciated.
If you want to recreate the same DOM structure that would be created
by a browser, the standardized algorithm to do so is very complicated,
but you can find it at
http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html.
If you're not necessarily seeking perfect fidelity, I would encourage
you to try to find some way to incorporate beautifulsoup into your
project. It likely won't produce the same structure that a real
browser would, but it should do well enough to scrape from even badly
malformed html.
I recommend against using an XML parser, because HTML isn't XML, and
such a parser may choke even on perfectly valid HTML such as this:
<!DOCTYPE html>
<html>
<head><title>Document</title></head>
<body>
First line
<br>
Second line
</body>
</html>
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-04 04:58 -0700
Re: Get html DOM tree by only basic builtin moudles Laura Creighton <lac@openend.se> - 2015-06-04 15:01 +0200
Re: Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-05 11:10 -0700
Re: Get html DOM tree by only basic builtin moudles Ian Kelly <ian.g.kelly@gmail.com> - 2015-06-05 13:48 -0600
Re: Get html DOM tree by only basic builtin moudles Wesley <nispray@gmail.com> - 2015-06-05 16:24 -0700
csiph-web