Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #5609

Re: Trying to understand html.parser.HTMLParser

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <karim.liateni@free.fr>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.002
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'python.': 0.05; 'attributes': 0.05; 'modified': 0.05; 'parser': 0.05; 'bug.': 0.07; 'example)': 0.07; 'python': 0.07; '"""': 0.09; 'attribute': 0.09; 'revision': 0.09; 'def': 0.13; 'am,': 0.14; 'broken': 0.14; 'wrote:': 0.14; '"""return': 0.16; 'exemple': 0.16; 'from:addr:free.fr': 0.16; 'received:212.27': 0.16; 'received:212.27.42': 0.16; 'received:free.fr': 0.16; 'revision.': 0.16; 'subject:Trying': 0.16; 'cc:no real name:2**0': 0.20; 'project,': 0.20; 'cc:2**0': 0.20; 'code': 0.22; 'header:In-Reply- To:1': 0.22; 'cc:addr:python-list': 0.22; '(and': 0.22; 'wrote': 0.25; 'extract': 0.25; 'value.': 0.25; 'assume': 0.25; 'compare': 0.26; "i'm": 0.26; "i'll": 0.26; "doesn't": 0.28; 'looks': 0.28; 'thanks': 0.29; 'class': 0.29; 'query': 0.29; 'all.': 0.30; 'implement': 0.30; 'seem': 0.30; "won't": 0.30; 'cc:addr:python.org': 0.31; 'does': 0.31; "skip:' 10": 0.32; 'skip:e 20': 0.33; 'expression': 0.33; 'implemented': 0.33; 'page': 0.33; 'updated': 0.33; 'received:192': 0.34; 'regular': 0.34; 'change': 0.34; 'received:192.168.0': 0.35; 'couple': 0.35; 'print': 0.35; 'url:en': 0.35; 'header:User-Agent:1': 0.35; 'frame': 0.35; 'try:': 0.35; 'none': 0.36; 'received:192.168': 0.37; 'some': 0.37; 'andrew': 0.38; 'but': 0.38; 'url:org': 0.38; 'current': 0.38; 'hold': 0.39; 'not,': 0.39; 'docs': 0.39; 'could': 0.39; 'add': 0.39; 'except': 0.39; "it's": 0.40; 'might': 0.40; 'max': 0.60; 'skip:h 20': 0.60; 'results': 0.61; 'url:net': 0.62; 'link': 0.62; 'below': 0.63; 'website': 0.66; 'links:': 0.68; 'afraid': 0.69; 'url:en-us': 0.69; 'customized': 0.72; 'below:': 0.83; 'tag,': 0.84; 'url:addons': 0.84; 'url:mozilla': 0.84; 'url:pl': 0.93
Date Tue, 17 May 2011 22:26:13 +0200
From Karim <karim.liateni@free.fr>
User-Agent Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110424 Thunderbird/3.1.10
MIME-Version 1.0
To Andrew Berg <bahamutzero8825@gmail.com>
Subject Re: Trying to understand html.parser.HTMLParser
References <4DD03B69.6050301@gmail.com> <BANLkTikcc6wVX+aO7KATa8AK1BJJKN5kMw@mail.gmail.com> <4DD0D1A2.6060109@free.fr> <4DD1C9D5.4070206@gmail.com>
In-Reply-To <4DD1C9D5.4070206@gmail.com>
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
Cc python-list@python.org
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.1709.1305663988.9059.python-list@python.org> (permalink)
Lines 72
NNTP-Posting-Host 82.94.164.166
X-Trace 1305663988 news.xs4all.nl 49181 [::ffff:82.94.164.166]:42963
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:5609

Show key headers only | View raw


On 05/17/2011 03:05 AM, Andrew Berg wrote:
> On 2011.05.16 02:26 AM, Karim wrote:
>> Use regular expression for bad HTLM or beautifulSoup (google it), below
>> a exemple to extract all html links:
>>
>> linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource)
>> for link in linksList:
>>       print link
> I was afraid I might have to use regexes (mostly because I could never
> understand them).
> Even the BeautifulSoup website itself admits it's awful with Python 3 -
> only the admittedly broken 3.1.0 will work with Python 3 at all.
> ElementTree doesn't seem to have been updated in a long time, so I'll
> assume it won't work with Python 3.
> lxml looks promising, but it doesn't say anywhere whether it'll work on
> Python 3 or not, which is puzzling since the latest release was only a
> couple months ago.
>
> Actually, if I'm going to use regex, I might as well try to implement
> Versions* in Python.
>
> Thanks for the answers!
>
> *http://en.totalcmd.pl/download/wfx/net/Versions (original, made for
> Total Commander) and
> https://addons.mozilla.org/en-US/firefox/addon/versions-wfx_versions/
> (clone implemented as a Firefox add-on; it's so wonderful, I even wrote
> the docs for it!)

Andrew,

I wrote a class with HMLTParser to get only one link for a given 
project, cf below:

   73 class ResultsLinkParser(HTMLParser.HTMLParser):
   74     """Class ResultsLinkParser inherits form HTMLParser to extract
   75     the original 'Submission date' of the a bug.
   76     This customized parser will deals with the 'View Defect' HTML
   77     page from Clear DDTS.
   78     """
   79     def __init__(self):
   80         HTMLParser.HTMLParser.__init__(self)
   81         self._link = None
   82
   83     def handle_starttag(self, tag, attrs):
   84         """Implement standard class HTMLParser customizing method."""
   85         if tag == 'frame':
   86             try:
   87                 attributes = dict(attrs)
   88                 if attributes['name'] == 'indexframe':
   89                     self._link = attributes['src']
   90             except KeyError, e:
   91                 print("""WARNING: Attribute '{keyname}' from frame tag
   92                   in QueryResult page does not 
exist!""".format(keyname=e))
   93
   94     def link(self):
   95         """Return the html link of the query results page."""
   96         return self._link

You can use it and just modified it to get the latest just add some code 
(and change the tag 'name' of my example) to compare revision number 
with max and keep the max to compare it to the next value. I let you add 
this little code just create self._revision = None in the __init__(self) 
which hold the current max revision. After parser.feed() you can get the 
value by parser._revision or a public parser.revision() method to get 
the value.

Cheers
Karim

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Trying to understand html.parser.HTMLParser Karim <karim.liateni@free.fr> - 2011-05-17 22:26 +0200

csiph-web