Groups > comp.lang.python > #50437 > unrolled thread

Re: ElementTree: can't figure out a mismached-tag error

Started by	"F.R." <anthra.norell@bluewin.ch>
First post	2013-07-11 14:25 +0200
Last post	2013-07-11 05:49 -0700
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: ElementTree: can't figure out a mismached-tag error "F.R." <anthra.norell@bluewin.ch> - 2013-07-11 14:25 +0200
    Re: ElementTree: can't figure out a mismached-tag error fronagzen@gmail.com - 2013-07-11 05:49 -0700

#50437 — Re: ElementTree: can't figure out a mismached-tag error

From	"F.R." <anthra.norell@bluewin.ch>
Date	2013-07-11 14:25 +0200
Subject	Re: ElementTree: can't figure out a mismached-tag error
Message-ID	<mailman.4582.1373545582.3114.python-list@python.org>

On 07/11/2013 10:59 AM, F.R. wrote:
> Hi all,
>
> I haven't been able to get up to speed with XML. I do examples from 
> the tutorials and experiment with variations. Time and time again I 
> fail with errors messages I can't make sense of. Here's the latest 
> one. The url is "http://finance.yahoo.com/q?s=XIDEQ&ql=0". Ubuntu 
> 12.04 LTS, Python 2.7.3 (default, Aug  1 2012, 05:16:07) [GCC 4.6.3]
>
> >>> import xml.etree.ElementTree as ET
> >>> tree = ET.parse('q?s=XIDEQ')  # output of wget 
> http://finance.yahoo.com/q?s=XIDEQ&ql=0
> Traceback (most recent call last):
>   File "<pyshell#69>", line 1, in <module>
>     tree = ET.parse('q?s=XIDEQ')
>   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
>     tree.parse(source, parser)
>   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
>     parser.feed(data)
>   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
>     self._raiseerror(v)
>   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in 
> _raiseerror
>     raise err
> ParseError: mismatched tag: line 9, column 2
>
> Below first nine lines. The line numbers and the following space are 
> hand-edited in. Three dots stand for sections cut out to fit long 
> lines. Line 6 is a bunch of "meta" statements, all of which I show on 
> a separate line each in order to preserve the angled brackets. On all 
> lines the angled brackets have been preserved. The mismatched 
> character is the slash of the closing tag </head>. What could be wrong 
> with it? And if it is, what about fault tolerance?
>
> 1 <!DOCTYPE html PUBLIC "-//W3C//DTD  . . . /strict.dtd">
> 2 <html lang="en-US">
> 3 <head><meta http-equiv="Content-Type" content="text/html; 
> charset=utf-8">
> 4 <title>XIDEQ: Summary for EXIDE TECH NEW- Yahoo! Finance</title>
> 5 <meta name="description" xml:space="default" content="View the basic 
> XIDEQ . . .
> 6 . . . other companies."><meta name="keywords" content="XIDEQ, EXIDE 
> TECH . . .">
>   <meta property="fb:app_id" content="118155468215844">
>   <meta property="fb:admins" content="503762770,100001149693905">
>   <meta property="og:type" content="company">
>   <meta property="og:site_name" content="Yahoo! Finance">
>   <meta property="og:title" content="Exide Technologies">
>   <meta property="og:image" 
> content="http://l.yimg.com/a/p/fi/31/09/00.jpg">
>   <meta property="og:url" content="http://finance.yahoo.com/q?s=XIDEQ">
>   <meta property="og:description" content="View the basic XIDEQ . . .
> 7 other companies."><link rel="canonical" 
> href="http://finance.yahoo.com/q?s=XIDEQ">
> 8 <link rel="stylesheet" href="http://l.yimg.com/zz/ . . . 
> type="text/css">
> 9 </head>
>    ^
>     Mismatch!
>
> Thanks for suggestions
>
> Frederic
>
Thank you all!

I was a little apprehensive it could be a silly mistake. And so it was. 
I have BeautifulSoup somewhere. Having had no urgent need for it I 
remember shirking the learning curve.

lxml seems to be a package with these components (from help (lxml)):

PACKAGE CONTENTS
     ElementInclude
     _elementpath
     builder
     cssselect
     doctestcompare
     etree
     html (package)
     isoschematron (package)
     objectify
     pyclasslookup
     sax
     usedoctest

I would start with "from lxml import html" and see what comes out.

Break time now. Thanks again!

Frederic

[toc] | [next] | [standalone]

#50439

From	fronagzen@gmail.com
Date	2013-07-11 05:49 -0700
Message-ID	<a34ae124-2686-4f12-96b2-33734391797d@googlegroups.com>
In reply to	#50437

On Thursday, July 11, 2013 8:25:13 PM UTC+8, F.R. wrote:
> On 07/11/2013 10:59 AM, F.R. wrote:
> 
> > Hi all,
> 
> >
> 
> > I haven't been able to get up to speed with XML. I do examples from 
> 
> > the tutorials and experiment with variations. Time and time again I 
> 
> > fail with errors messages I can't make sense of. Here's the latest 
> 
> > one. The url is "http://finance.yahoo.com/q?s=XIDEQ&ql=0". Ubuntu 
> 
> > 12.04 LTS, Python 2.7.3 (default, Aug  1 2012, 05:16:07) [GCC 4.6.3]
> 
> >
> 
> > >>> import xml.etree.ElementTree as ET
> 
> > >>> tree = ET.parse('q?s=XIDEQ')  # output of wget 
> 
> > http://finance.yahoo.com/q?s=XIDEQ&ql=0
> 
> > Traceback (most recent call last):
> 
> >   File "<pyshell#69>", line 1, in <module>
> 
> >     tree = ET.parse('q?s=XIDEQ')
> 
> >   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
> 
> >     tree.parse(source, parser)
> 
> >   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
> 
> >     parser.feed(data)
> 
> >   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
> 
> >     self._raiseerror(v)
> 
> >   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in 
> 
> > _raiseerror
> 
> >     raise err
> 
> > ParseError: mismatched tag: line 9, column 2
> 
> >
> 
> > Below first nine lines. The line numbers and the following space are 
> 
> > hand-edited in. Three dots stand for sections cut out to fit long 
> 
> > lines. Line 6 is a bunch of "meta" statements, all of which I show on 
> 
> > a separate line each in order to preserve the angled brackets. On all 
> 
> > lines the angled brackets have been preserved. The mismatched 
> 
> > character is the slash of the closing tag </head>. What could be wrong 
> 
> > with it? And if it is, what about fault tolerance?
> 
> >
> 
> > 1 <!DOCTYPE html PUBLIC "-//W3C//DTD  . . . /strict.dtd">
> 
> > 2 <html lang="en-US">
> 
> > 3 <head><meta http-equiv="Content-Type" content="text/html; 
> 
> > charset=utf-8">
> 
> > 4 <title>XIDEQ: Summary for EXIDE TECH NEW- Yahoo! Finance</title>
> 
> > 5 <meta name="description" xml:space="default" content="View the basic 
> 
> > XIDEQ . . .
> 
> > 6 . . . other companies."><meta name="keywords" content="XIDEQ, EXIDE 
> 
> > TECH . . .">
> 
> >   <meta property="fb:app_id" content="118155468215844">
> 
> >   <meta property="fb:admins" content="503762770,100001149693905">
> 
> >   <meta property="og:type" content="company">
> 
> >   <meta property="og:site_name" content="Yahoo! Finance">
> 
> >   <meta property="og:title" content="Exide Technologies">
> 
> >   <meta property="og:image" 
> 
> > content="http://l.yimg.com/a/p/fi/31/09/00.jpg">
> 
> >   <meta property="og:url" content="http://finance.yahoo.com/q?s=XIDEQ">
> 
> >   <meta property="og:description" content="View the basic XIDEQ . . .
> 
> > 7 other companies."><link rel="canonical" 
> 
> > href="http://finance.yahoo.com/q?s=XIDEQ">
> 
> > 8 <link rel="stylesheet" href="http://l.yimg.com/zz/ . . . 
> 
> > type="text/css">
> 
> > 9 </head>
> 
> >    ^
> 
> >     Mismatch!
> 
> >
> 
> > Thanks for suggestions
> 
> >
> 
> > Frederic
> 
> >
> 
> Thank you all!
> 
> 
> 
> I was a little apprehensive it could be a silly mistake. And so it was. 
> 
> I have BeautifulSoup somewhere. Having had no urgent need for it I 
> 
> remember shirking the learning curve.
> 
> 
> 
> lxml seems to be a package with these components (from help (lxml)):
> 
> 
> 
> PACKAGE CONTENTS
> 
>      ElementInclude
> 
>      _elementpath
> 
>      builder
> 
>      cssselect
> 
>      doctestcompare
> 
>      etree
> 
>      html (package)
> 
>      isoschematron (package)
> 
>      objectify
> 
>      pyclasslookup
> 
>      sax
> 
>      usedoctest
> 
> 
> 
> I would start with "from lxml import html" and see what comes out.
> 
> 
> 
> Break time now. Thanks again!
> 
> 
> 
> Frederic

from lxml.html import parse
from lxml.etree import ElementTree
root = parse(target_url).getroot()

This'll get you the root node of the element tree parsed from the URL. The lxml html parser, conveniently enough, can combine in the actual web page access. If you want to control things like socket timeout, though, you'll have to use urllib to request the URL and then feed that to the parser.

[toc] | [prev] | [standalone]

csiph-web

Re: ElementTree: can't figure out a mismached-tag error

Contents

#50437 — Re: ElementTree: can't figure out a mismached-tag error

#50439