nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Lothian (JIRA)" <>
Subject [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
Date Tue, 13 Feb 2007 22:35:05 GMT


Nick Lothian commented on NUTCH-444:

I'm a developer on the ROME project and I done some patches to FeedParser. I've also been
a long-time lurker on the Nutch lists.

To clear up a couple of misconceptions:

The current version of FeedParser is Kevin Burton's one available from
It does have Atom 1.0 support.

ROME only has a single dependency: JDom.  

Both FeedParser & ROME load the feed into a DOM before working on it. FeedParser exposes
a SAX-like API, while ROME exposes objects. My tests (a while ago now, but probably still
reasonable) showed little performance difference between the two libraries (See

I don't understand nutch.newbie's comments about different Atom & RSS mappings. I'm not
aware of any issues with the mapping of Author. There are some docs on mappings at, and

I'd HIGHLY recommend not writing your own custom feed parser. It's a much bigger job than
you'd expect. In particular the difficulties of dealing with the bizzare things seen in real-world
feeds should not be underestimated.

Apache Abdera  ( is another option if anyone is just interested
in Atom parsing.

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>                 Key: NUTCH-444
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser)
has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message