nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject RE: [jira] Commented: (NUTCH-444) Possibly use a different library toparse RSS feed for improved performance and compatibility
Date Tue, 13 Feb 2007 22:45:38 GMT
I am using ROME in a modified version of the feedparse plugin.
It is pretty straight forward and easy.
We had issues with ROME 0.8 and ATOM or some dates. ROME 0.9 resolved

-----Original Message-----
From: Nick Lothian (JIRA) [] 
Sent: Tuesday, February 13, 2007 2:35 PM
Subject: [jira] Commented: (NUTCH-444) Possibly use a different library
toparse RSS feed for improved performance and compatibility

plugin.system.issuetabpanels:comment-tabpanel#action_12472907 ] 

Nick Lothian commented on NUTCH-444:

I'm a developer on the ROME project and I done some patches to
FeedParser. I've also been a long-time lurker on the Nutch lists.

To clear up a couple of misconceptions:

The current version of FeedParser is Kevin Burton's one available from It does have Atom 1.0 support.

ROME only has a single dependency: JDom.  

Both FeedParser & ROME load the feed into a DOM before working on it.
FeedParser exposes a SAX-like API, while ROME exposes objects. My tests
(a while ago now, but probably still reasonable) showed little
performance difference between the two libraries (See and

I don't understand nutch.newbie's comments about different Atom & RSS
mappings. I'm not aware of any issues with the mapping of Author. There
are some docs on mappings at, and

I'd HIGHLY recommend not writing your own custom feed parser. It's a
much bigger job than you'd expect. In particular the difficulties of
dealing with the bizzare things seen in real-world feeds should not be

Apache Abdera  ( is another option
if anyone is just interested in Atom parsing.

> Possibly use a different library to parse RSS feed for improved 
> performance and compatibility
> ----------------------------------------------------------------------
> -----------------------
>                 Key: NUTCH-444
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current
library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the 
> feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message