Re: Parsing XML RSS feed byte stream for <item> tag

From	John Gordon <gordon@panix.com>
Newsgroups	comp.lang.python
Subject	Re: Parsing XML RSS feed byte stream for <item> tag
Date	2013-02-07 21:00 +0000
Organization	PANIX Public Access Internet and UNIX, NYC
Message-ID	<kf14lm$kt4$1@reader1.panix.com> (permalink)
References	<16828a11-6c7c-4ab6-b406-6b8819883b5e@googlegroups.com>

Show all headers | View raw

In <16828a11-6c7c-4ab6-b406-6b8819883b5e@googlegroups.com> darrel.rendell@gmail.com writes:

> def pageReader(url):
> try:
>     readPage =3D urllib2.urlopen(url)
> except urllib2.URLError, e:
> #   print 'We failed to reach a server.'
> #   print 'Reason: ', e.reason
>     return 404 =20
> except urllib2.HTTPError, e:
> #   print('The server couldn\'t fulfill the request.')
> #   print('Error code: ', e.code)  =20
>     return 404 =20
> else:
>     outputPage =3D readPage.read()       =20
> return outputPage

> To recreate my error, simply call the above function with an argument
> similar to:

> http://www.cert.org/nav/cert_announcements.rss

> You'll see I'm trying to return the first child.

The above code produces no output at all.  The pageReader() function is
defined but never called.

If we add a few lines at the bottom:

if __name__ == '__main__':
    print pageReader('http://www.cert.org/nav/cert_announcements.rss')

Then we get some output:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">

<channel>
<title>CERT Announcements</title>
<link>http://www.cert.org/nav/whatsnew.html</link>
<language>en-us</language>
<description>Announcements: What's New on the CERT web site</description>

<item>
<title>New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)</title>
<link>http://www.cert.org/blogs/insider_threat/2013/02/common_sense_guide_to_mitigating_insider_threats_-_best_practice_16_of_19.html</link>
<description>This sixteenth of 19 blog posts about the fourth edition of the Common Sense Guide to Mitigating Insider Threats describes Practice 16: Develop a formalized insider threat program.</description>
<pubDate>Wed, 06 Feb 2013 06:38:07 -0500</pubDate>
</item>

...

> As I've said, BeautifulSoup fails to find both pubDate and Link, which are =
> crucial to my app.

> Any advice would be greatly appreciated.

You haven't included the BeautifulSoup code which attempts to parse the XML,
so it's impossible to say exactly what the error is.

However, I have a guess: you said you're trying to return the first
child.  Based on the above output, the first child is the <channel>
element, not an <item> element.  Perhaps that's the issue?

-- 
John Gordon                   A is for Amy, who fell down the stairs
gordon@panix.com              B is for Basil, assaulted by bears
                                -- Edward Gorey, "The Gashlycrumb Tinies"

Thread

Parsing XML RSS feed byte stream for <item> tag darrel.rendell@gmail.com - 2013-02-07 12:36 -0800
  Re: Parsing XML RSS feed byte stream for <item> tag John Gordon <gordon@panix.com> - 2013-02-07 21:00 +0000
  Re: Parsing XML RSS feed byte stream for <item> tag xDog Walker <thudfoo@gmail.com> - 2013-02-07 21:10 -0800

csiph-web