nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doğacan Güney" <doga...@gmail.com>
Subject Re: [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
Date Mon, 26 Feb 2007 08:56:34 GMT
On 2/26/07, Chris A. Mattmann (JIRA) <jira@apache.org> wrote:
>
>     [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475794
]
>
> Chris A. Mattmann commented on NUTCH-444:
> -----------------------------------------
>
> Hi Nick,
>
>  Thanks for your insightful comments on this issue. I think I can summarize the discussions
on this issue to the following:
>
> 1. Folks are seeing limitations in the version of commons-feedparser (0.6) used by parse-rss
in the Nutch trunk
> 2. There are alternatives to feedparser in the form of ROME, informa, abdera, etc.
> 3. There is a newer, maintained version of Kevin Burton's feed parser that alleviates
some of the limitations of feedparser (0.6) used in the Nutch trunk
> 4. We shouldn't be developing our own feedparsing solution
>

I have been using ROME lately and it has a really nice feature:
Modules. With the necessary modules installed, ROME can extract iTunes
podcast, MediaRSS, etc. information from feeds. So it is also very
useful to build a video/audio search engine. I haven't looked into
feedparser in detail but it doesn't seem to have that. So ROME has as
big plus here for me.

Anyway, if the decision is to write an abstraction layer, I think
RSSChannel and RSSItem classes (with an extra metadata field) will
provide a good starting point.

>
> Cheers,
>   Chris
>
>
> > Possibly use a different library to parse RSS feed for improved performance and
compatibility
> > ---------------------------------------------------------------------------------------------
> >
> >                 Key: NUTCH-444
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-444
> >             Project: Nutch
> >          Issue Type: Improvement
> >          Components: fetcher
> >    Affects Versions: 0.9.0
> >            Reporter: Renaud Richardet
> >         Assigned To: Chris A. Mattmann
> >            Priority: Minor
> >             Fix For: 0.9.0
> >
> >         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
> >
> >
> > As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser)
has the following issues:
> > - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to
jdom first
> > - no support for Atom 1.0
> > - there has been no development in the last year
> > Alternatives are:
> > - Rome
> > - Informa
> > - custom implementation based on Stax
> > - ??
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Doğacan Güney
Mime
View raw message