nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
Date Sat, 10 Feb 2007 18:04:08 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472005
] 

Chris A. Mattmann commented on NUTCH-444:
-----------------------------------------

Nutch Newbie:

>From the commons-feedparser site: http://jakarta.apache.org/commons/sandbox/feedparser/

" Jakarta FeedParser is a Java RSS/Atom parser designed to elegantly support all versions
of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad
hoc extension and RSS 1.0 modules capability."

According to this site, in fact, commons-feedparser does in fact, support Atom. Your statistics
that you present above make no mention of the version of Atom feeds within the 84, 746. For
instance, how many of those are Atom 0.5 feeds? How many are >0.5? 

Additionally, as I mentioned above, commons-feedparser did not require the large amount of
external libraries that Rome required to run when I looked them at both. Is this still the
case?



> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser)
has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom
first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message