nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
Date Mon, 26 Feb 2007 00:01:05 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475794
] 

Chris A. Mattmann commented on NUTCH-444:
-----------------------------------------

Hi Nick,

 Thanks for your insightful comments on this issue. I think I can summarize the discussions
on this issue to the following:

1. Folks are seeing limitations in the version of commons-feedparser (0.6) used by parse-rss
in the Nutch trunk
2. There are alternatives to feedparser in the form of ROME, informa, abdera, etc.
3. There is a newer, maintained version of Kevin Burton's feed parser that alleviates some
of the limitations of feedparser (0.6) used in the Nutch trunk
4. We shouldn't be developing our own feedparsing solution

 Did I miss anything? If not, then I'm thinking the following. Perhaps we should write a transparency
layer into the parse-rss plugin to select between different RSS parsing backends, such as
ROME, or feedparser. It probably wouldn't be too hard to write a simple transparency interface,
at least to begin with. The i/f would provide methods to retrieve channels, and items, and
would support arbitrary metadata retrieval from the underlying structures. Would this meet
everyone's needs? If not, then I have an alternate suggestion. Perhaps, at the very least,
we should upgrade the version of commons-feedparser in parse-rss to the latest version from
Kevin Burton? I'd also be willing to hear other suggestions...

Cheers,
  Chris


> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser)
has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom
first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message