Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!eweka.nl!hq-usenetpeers.eweka.nl!xlned.com!feeder7.xlned.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <51DE73E4.6040007@bluewin.ch>
References: <51DE73E4.6040007@bluewin.ch>
Date: Thu, 11 Jul 2013 10:08:04 +0100
Subject: Re: ElementTree: can't figure out a mismached-tag error
From: =?ISO-8859-1?Q?F=E1bio_Santos?= <fabiosantosart@gmail.com>
To: "F.R." <anthra.norell@bluewin.ch>
Content-Type: multipart/alternative; boundary=90e6ba3098a85c8e9d04e138bbfd
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.4578.1373533692.3114.python-list@python.org>
Lines: 171
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:50430

--90e6ba3098a85c8e9d04e138bbfd
Content-Type: text/plain; charset=ISO-8859-1

On 11 Jul 2013 10:04, "F.R." <anthra.norell@bluewin.ch> wrote:
>
> Hi all,
>
> I haven't been able to get up to speed with XML. I do examples from the
tutorials and experiment with variations. Time and time again I fail with
errors messages I can't make sense of. Here's the latest one. The url is "
http://finance.yahoo.com/q?s=XIDEQ&ql=0". Ubuntu 12.04 LTS, Python 2.7.3
(default, Aug  1 2012, 05:16:07) [GCC 4.6.3]
>
> >>> import xml.etree.ElementTree as ET
> >>> tree = ET.parse('q?s=XIDEQ')  # output of wget
http://finance.yahoo.com/q?s=XIDEQ&ql=0
> Traceback (most recent call last):
>   File "<pyshell#69>", line 1, in <module>
>     tree = ET.parse('q?s=XIDEQ')
>   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
>     tree.parse(source, parser)
>   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
>     parser.feed(data)
>   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
>     self._raiseerror(v)
>   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in
_raiseerror
>     raise err
> ParseError: mismatched tag: line 9, column 2
>
> Below first nine lines. The line numbers and the following space are
hand-edited in. Three dots stand for sections cut out to fit long lines.
Line 6 is a bunch of "meta" statements, all of which I show on a separate
line each in order to preserve the angled brackets. On all lines the angled
brackets have been preserved. The mismatched character is the slash of the
closing tag </head>. What could be wrong with it? And if it is, what about
fault tolerance?
>
> 1 <!DOCTYPE html PUBLIC "-//W3C//DTD  . . . /strict.dtd">
> 2 <html lang="en-US">
> 3 <head><meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
> 4 <title>XIDEQ: Summary for EXIDE TECH NEW- Yahoo! Finance</title>
> 5 <meta name="description" xml:space="default" content="View the basic
XIDEQ . . .
> 6 . . . other companies."><meta name="keywords" content="XIDEQ, EXIDE
TECH . . .">
>   <meta property="fb:app_id" content="118155468215844">
>   <meta property="fb:admins" content="503762770,100001149693905">
>   <meta property="og:type" content="company">
>   <meta property="og:site_name" content="Yahoo! Finance">
>   <meta property="og:title" content="Exide Technologies">
>   <meta property="og:image" content="http://l.yimg.com/a/p/fi/31/09/00.jpg
">
>   <meta property="og:url" content="http://finance.yahoo.com/q?s=XIDEQ">
>   <meta property="og:description" content="View the basic XIDEQ . . .
> 7 other companies."><link rel="canonical" href="
http://finance.yahoo.com/q?s=XIDEQ">
> 8 <link rel="stylesheet" href="http://l.yimg.com/zz/ . . .
type="text/css">
> 9 </head>
>    ^
>     Mismatch!
>
> Thanks for suggestions
>
> Frederic

That is not XML. It is HTML. You get a mismatched tag because the <link>
tag doesn't need closing in HTML, but in XML every single tag needs closing.

Use an HTML parser. I strongly recommend  BeautifulSoup but I think etree
has an HTML parser too. I am not sure..

--90e6ba3098a85c8e9d04e138bbfd
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr"><br>
On 11 Jul 2013 10:04, &quot;F.R.&quot; &lt;<a href=3D"mailto:anthra.norell@=
bluewin.ch">anthra.norell@bluewin.ch</a>&gt; wrote:<br>
&gt;<br>
&gt; Hi all,<br>
&gt;<br>
&gt; I haven&#39;t been able to get up to speed with XML. I do examples fro=
m the tutorials and experiment with variations. Time and time again I fail =
with errors messages I can&#39;t make sense of. Here&#39;s the latest one. =
The url is &quot;<a href=3D"http://finance.yahoo.com/q?s=3DXIDEQ&amp;ql=3D0=
">http://finance.yahoo.com/q?s=3DXIDEQ&amp;ql=3D0</a>&quot;. Ubuntu 12.04 L=
TS, Python 2.7.3 (default, Aug =A01 2012, 05:16:07) [GCC 4.6.3]<br>

&gt;<br>
&gt; &gt;&gt;&gt; import xml.etree.ElementTree as ET<br>
&gt; &gt;&gt;&gt; tree =3D ET.parse(&#39;q?s=3DXIDEQ&#39;) =A0# output of w=
get <a href=3D"http://finance.yahoo.com/q?s=3DXIDEQ&amp;ql=3D0">http://fina=
nce.yahoo.com/q?s=3DXIDEQ&amp;ql=3D0</a><br>
&gt; Traceback (most recent call last):<br>
&gt; =A0 File &quot;&lt;pyshell#69&gt;&quot;, line 1, in &lt;module&gt;<br>
&gt; =A0 =A0 tree =3D ET.parse(&#39;q?s=3DXIDEQ&#39;)<br>
&gt; =A0 File &quot;/usr/lib/python2.7/xml/etree/ElementTree.py&quot;, line=
 1183, in parse<br>
&gt; =A0 =A0 tree.parse(source, parser)<br>
&gt; =A0 File &quot;/usr/lib/python2.7/xml/etree/ElementTree.py&quot;, line=
 656, in parse<br>
&gt; =A0 =A0 parser.feed(data)<br>
&gt; =A0 File &quot;/usr/lib/python2.7/xml/etree/ElementTree.py&quot;, line=
 1643, in feed<br>
&gt; =A0 =A0 self._raiseerror(v)<br>
&gt; =A0 File &quot;/usr/lib/python2.7/xml/etree/ElementTree.py&quot;, line=
 1507, in _raiseerror<br>
&gt; =A0 =A0 raise err<br>
&gt; ParseError: mismatched tag: line 9, column 2<br>
&gt;<br>
&gt; Below first nine lines. The line numbers and the following space are h=
and-edited in. Three dots stand for sections cut out to fit long lines. Lin=
e 6 is a bunch of &quot;meta&quot; statements, all of which I show on a sep=
arate line each in order to preserve the angled brackets. On all lines the =
angled brackets have been preserved. The mismatched character is the slash =
of the closing tag &lt;/head&gt;. What could be wrong with it? And if it is=
, what about fault tolerance?<br>

&gt;<br>
&gt; 1 &lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD =A0. . . /strict.dtd&quot=
;&gt;<br>
&gt; 2 &lt;html lang=3D&quot;en-US&quot;&gt;<br>
&gt; 3 &lt;head&gt;&lt;meta http-equiv=3D&quot;Content-Type&quot; content=
=3D&quot;text/html; charset=3Dutf-8&quot;&gt;<br>
&gt; 4 &lt;title&gt;XIDEQ: Summary for EXIDE TECH NEW- Yahoo! Finance&lt;/t=
itle&gt;<br>
&gt; 5 &lt;meta name=3D&quot;description&quot; xml:space=3D&quot;default&qu=
ot; content=3D&quot;View the basic XIDEQ . . .<br>
&gt; 6 . . . other companies.&quot;&gt;&lt;meta name=3D&quot;keywords&quot;=
 content=3D&quot;XIDEQ, EXIDE TECH . . .&quot;&gt;<br>
&gt; =A0 &lt;meta property=3D&quot;fb:app_id&quot; content=3D&quot;11815546=
8215844&quot;&gt;<br>
&gt; =A0 &lt;meta property=3D&quot;fb:admins&quot; content=3D&quot;50376277=
0,100001149693905&quot;&gt;<br>
&gt; =A0 &lt;meta property=3D&quot;og:type&quot; content=3D&quot;company&qu=
ot;&gt;<br>
&gt; =A0 &lt;meta property=3D&quot;og:site_name&quot; content=3D&quot;Yahoo=
! Finance&quot;&gt;<br>
&gt; =A0 &lt;meta property=3D&quot;og:title&quot; content=3D&quot;Exide Tec=
hnologies&quot;&gt;<br>
&gt; =A0 &lt;meta property=3D&quot;og:image&quot; content=3D&quot;<a href=
=3D"http://l.yimg.com/a/p/fi/31/09/00.jpg">http://l.yimg.com/a/p/fi/31/09/0=
0.jpg</a>&quot;&gt;<br>
&gt; =A0 &lt;meta property=3D&quot;og:url&quot; content=3D&quot;<a href=3D"=
http://finance.yahoo.com/q?s=3DXIDEQ">http://finance.yahoo.com/q?s=3DXIDEQ<=
/a>&quot;&gt;<br>
&gt; =A0 &lt;meta property=3D&quot;og:description&quot; content=3D&quot;Vie=
w the basic XIDEQ . . .<br>
&gt; 7 other companies.&quot;&gt;&lt;link rel=3D&quot;canonical&quot; href=
=3D&quot;<a href=3D"http://finance.yahoo.com/q?s=3DXIDEQ">http://finance.ya=
hoo.com/q?s=3DXIDEQ</a>&quot;&gt;<br>
&gt; 8 &lt;link rel=3D&quot;stylesheet&quot; href=3D&quot;<a href=3D"http:/=
/l.yimg.com/zz/">http://l.yimg.com/zz/</a> . . . type=3D&quot;text/css&quot=
;&gt;<br>
&gt; 9 &lt;/head&gt;<br>
&gt; =A0 =A0^<br>
&gt; =A0 =A0 Mismatch!<br>
&gt;<br>
&gt; Thanks for suggestions<br>
&gt;<br>
&gt; Frederic</p>
<p dir=3D"ltr">That is not XML. It is HTML. You get a mismatched tag becaus=
e the &lt;link&gt; tag doesn&#39;t need closing in HTML, but in XML every s=
ingle tag needs closing.</p>
<p dir=3D"ltr">Use an HTML parser. I strongly recommend=A0 BeautifulSoup bu=
t I think etree has an HTML parser too. I am not sure..</p>

--90e6ba3098a85c8e9d04e138bbfd--