Return-Path: Delivered-To: apmail-lucene-nutch-dev-archive@www.apache.org Received: (qmail 16572 invoked from network); 10 Feb 2007 04:51:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Feb 2007 04:51:28 -0000 Received: (qmail 50991 invoked by uid 500); 10 Feb 2007 04:51:35 -0000 Delivered-To: apmail-lucene-nutch-dev-archive@lucene.apache.org Received: (qmail 50525 invoked by uid 500); 10 Feb 2007 04:51:33 -0000 Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-dev@lucene.apache.org Received: (qmail 50514 invoked by uid 99); 10 Feb 2007 04:51:33 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Feb 2007 20:51:33 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Feb 2007 20:51:25 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C38F57142FC for ; Fri, 9 Feb 2007 20:51:05 -0800 (PST) Message-ID: <18916490.1171083065798.JavaMail.jira@brutus> Date: Fri, 9 Feb 2007 20:51:05 -0800 (PST) From: "nutch.newbie (JIRA)" To: nutch-dev@lucene.apache.org Subject: [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility In-Reply-To: <22610746.1171064585654.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471952 ] nutch.newbie commented on NUTCH-444: ------------------------------------ Renaud : Thanks for moving the discussion here. First to answer your question yes its based on mime type detectation problem. The goal of the trial was to see if one could make just a feed search site i.e just feeds but I didn't succeed. I will give it a go over the weekend. Dogcan: Yes, one could just replace the feedparser with rome or stax and submit back here or use it internally. My discussion point was to see how others see about it and maybe there are others who have ran into problem and their experience. As Gal pointed out about rome (At least it is being further developed) and stax and you pointed out that you are doing something with rome.. I just wanted to know what other think and their experience thats all. Yes you are correct i posted it in the wrong forum nutch-443. But Nutch-443 started off as someone having trouble with RSS and it is important in my view to discuss the issue as we are using (feedparser) which is not going to solve the original issue if one tries to create just a RSS search engine. Nutch -443 would have not surfaced in the first place. I am looking forward to that day when I can use nutch just to do rss feed search engine so Dogcan I am very interested in your rome impl. maybe you can post the code here so that i can participate. > Possibly use a different library to parse RSS feed for improved performance and compatibility > --------------------------------------------------------------------------------------------- > > Key: NUTCH-444 > URL: https://issues.apache.org/jira/browse/NUTCH-444 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 0.9.0 > Reporter: Renaud Richardet > Priority: Minor > Fix For: 0.9.0 > > > As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: > - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first > - no support for Atom 1.0 > - there has been no development in the last year > Alternatives are: > - Rome > - Informa > - custom implementation based on Stax > - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.