Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #5791

Re: Trying to understand html.parser.HTMLParser

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <karim.liateni@free.fr>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.001
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'anyway': 0.03; 'suppose': 0.05; 'prints': 0.07; 'complicate': 0.09; 'derived': 0.09; 'revision': 0.09; 'pm,': 0.10; 'am,': 0.14; 'wrote:': 0.14; 'bad:': 0.16; 'comparison.': 0.16; 'from:addr:free.fr': 0.16; 'received:212.27': 0.16; 'received:212.27.42': 0.16; 'received:free.fr': 0.16; 'subject:Trying': 0.16; 'useful,': 0.16; 'cc:addr:python-list': 0.17; 'bytes': 0.19; 'object,': 0.19; 'simpler': 0.19; 'solution.': 0.19; 'header:In-Reply-To:1': 0.21; "wasn't": 0.22; 'cc:2**0': 0.22; 'cc:no real name:2**0': 0.23; 'parse': 0.23; 'code': 0.24; 'extract': 0.25; 'string': 0.26; 'bugs': 0.29; 'import': 0.29; 'consistently': 0.29; 'code,': 0.29; 'cc:addr:python.org': 0.30; 'listing': 0.31; "skip:' 10": 0.32; 'andrew': 0.32; 'cheers': 0.32; 'expression': 0.32; 'list': 0.33; 'actually': 0.33; 'too': 0.33; 'regular': 0.34; 'header:User- Agent:1': 0.35; 'stuck': 0.35; 'using': 0.35; 'skip:r 30': 0.37; 'page': 0.37; 'pretty': 0.37; 'but': 0.38; 'subject:: ': 0.38; 'received:192': 0.38; 'below': 0.61; 'link': 0.64; 'links:': 0.67; 'document.': 0.84; 'num': 0.84; 'prone': 0.84; 'rev.': 0.84; 'url:dir': 0.84
Date Thu, 19 May 2011 23:52:20 +0200
From Karim <karim.liateni@free.fr>
User-Agent Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110424 Thunderbird/3.1.10
MIME-Version 1.0
To Andrew Berg <bahamutzero8825@gmail.com>
Subject Re: Trying to understand html.parser.HTMLParser
References <4DD03B69.6050301@gmail.com> <BANLkTikcc6wVX+aO7KATa8AK1BJJKN5kMw@mail.gmail.com> <4DD0D1A2.6060109@free.fr> <4DD58D0D.4050508@gmail.com>
In-Reply-To <4DD58D0D.4050508@gmail.com>
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
Cc python-list@python.org
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.1804.1305841948.9059.python-list@python.org> (permalink)
Lines 32
NNTP-Posting-Host 82.94.164.166
X-Trace 1305841948 news.xs4all.nl 49044 [::ffff:82.94.164.166]:36516
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:5791

Show key headers only | View raw


On 05/19/2011 11:35 PM, Andrew Berg wrote:
> On 2011.05.16 02:26 AM, Karim wrote:
>> Use regular expression for bad HTLM or beautifulSoup (google it), below
>> a exemple to extract all html links:
> Actually, using regex wasn't so bad:
>> import re
>> import urllib.request
>>
>> url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
>> page = str(urllib.request.urlopen(url).read(), encoding='utf-8') #
>> urlopen() returns a bytes object, need to get a normal string
>> rev_re = re.compile('revision[0-9][0-9][0-9][0-9]')
>> num_re = re.compile('[0-9][0-9][0-9][0-9]')
>> rev = rev_re.findall(str(page))[0] # only need the first item since
>> the first listing is the latest revision
>> num = num_re.findall(rev)[0] # findall() always returns a list
>> print(num)
> prints out the revision number - 1995. 'revision1995' might be useful,
> so I saved that to rev.
>
> This actually works pretty well for consistently formatted lists. I
> suppose I went about this the wrong way - I thought I needed to parse
> the HTML to get the links and do simple regexes on those, but I can just
> do simple regexes on the entire HTML document.
Great for you!
Use what works well and easy to code, always the simpler is the better.
For complicate search link to avoid using too complex and bugs prone regex
you can derived the code I gave on HTMLParser with max comparison.
Anyway you get the choice which is cool, not be stuck on only one solution.

Cheers
Karim

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Trying to understand html.parser.HTMLParser Karim <karim.liateni@free.fr> - 2011-05-19 23:52 +0200

csiph-web