nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
Date Sat, 10 Feb 2007 06:14:05 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471955
] 

Chris A. Mattmann commented on NUTCH-444:
-----------------------------------------

Hi Renaud,

 In fact, Rome does appear to be quite easy to use, given the above coding example. If I recall,
the main issues that I had with it before involved the large amount of external libraries
that it required in order to run it (which may not be the case anymore). Additionally, I recall
there being an issue with the fact that Rome loaded the entire RSS structure into memory;
on the other hand, commons-feedparser uses a SAX-based approach, which I really liked.

 So, those were some of the deterrents when I originally evaluated the technologies circa
May 2005. I'm not against adapting the current parse-rss plugin, or alternatively writing
a parse-rss++ that utilizes a different underlying feedparser technology. I just need to be
convinced that it makes sense. Non-active development is not a valid excuse for switching
libraries -- I've seen a number of really nice implementations and projects that produced
an awesome piece of software only to have developers abandon active development on it (I won't
name names, but they're out there if you look). This doesn't take away from the fact that
the software works, is proven, and suits the needs of the developers that use it.

  In any case, I'll take the lead on shepherding anything produced out of this into the sources.
Look forward to working with you all.

Cheers,
  Chris



> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser)
has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom
first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message