nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nutch.newbie (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
Date Sat, 10 Feb 2007 04:51:05 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471952
] 

nutch.newbie commented on NUTCH-444:
------------------------------------

Renaud :

Thanks for moving the discussion here. First to answer your question yes its based on mime
type detectation problem. The goal of the trial was to see if one could make just a feed search
site i.e just feeds but I didn't succeed. I will give it a go over the weekend.

Dogcan:

Yes, one could just replace the feedparser with rome or stax and submit back here or use it
internally. My discussion point was to see how others see about it and maybe there are others
who have ran into problem and their experience. As Gal pointed out about rome (At least it
is being further developed) and stax and you pointed out that you are doing something with
rome.. I just wanted to know what other think and their experience thats all. Yes you are
correct i posted it in the wrong forum nutch-443. But Nutch-443 started off as someone having
trouble with RSS and it is important in my view to discuss the issue as we are using (feedparser)
which is not going to solve the original issue if one tries to create just a RSS search engine.
Nutch -443 would have not surfaced in the first place.

I am looking forward to that day when I can use nutch just to do rss feed search engine  so
Dogcan I am very interested in your rome impl. maybe you can post the code here so that i
can participate. 

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser)
has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom
first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message