Return-Path: Delivered-To: apmail-lucene-nutch-dev-archive@www.apache.org Received: (qmail 53561 invoked from network); 26 Feb 2007 00:01:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Feb 2007 00:01:27 -0000 Received: (qmail 91110 invoked by uid 500); 26 Feb 2007 00:01:35 -0000 Delivered-To: apmail-lucene-nutch-dev-archive@lucene.apache.org Received: (qmail 91089 invoked by uid 500); 26 Feb 2007 00:01:35 -0000 Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-dev@lucene.apache.org Received: (qmail 91078 invoked by uid 99); 26 Feb 2007 00:01:35 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Feb 2007 16:01:35 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Feb 2007 16:01:25 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C67EB714044 for ; Sun, 25 Feb 2007 16:01:05 -0800 (PST) Message-ID: <10005042.1172448065810.JavaMail.jira@brutus> Date: Sun, 25 Feb 2007 16:01:05 -0800 (PST) From: "Chris A. Mattmann (JIRA)" To: nutch-dev@lucene.apache.org Subject: [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility In-Reply-To: <22610746.1171064585654.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475794 ] Chris A. Mattmann commented on NUTCH-444: ----------------------------------------- Hi Nick, Thanks for your insightful comments on this issue. I think I can summarize the discussions on this issue to the following: 1. Folks are seeing limitations in the version of commons-feedparser (0.6) used by parse-rss in the Nutch trunk 2. There are alternatives to feedparser in the form of ROME, informa, abdera, etc. 3. There is a newer, maintained version of Kevin Burton's feed parser that alleviates some of the limitations of feedparser (0.6) used in the Nutch trunk 4. We shouldn't be developing our own feedparsing solution Did I miss anything? If not, then I'm thinking the following. Perhaps we should write a transparency layer into the parse-rss plugin to select between different RSS parsing backends, such as ROME, or feedparser. It probably wouldn't be too hard to write a simple transparency interface, at least to begin with. The i/f would provide methods to retrieve channels, and items, and would support arbitrary metadata retrieval from the underlying structures. Would this meet everyone's needs? If not, then I have an alternate suggestion. Perhaps, at the very least, we should upgrade the version of commons-feedparser in parse-rss to the latest version from Kevin Burton? I'd also be willing to hear other suggestions... Cheers, Chris > Possibly use a different library to parse RSS feed for improved performance and compatibility > --------------------------------------------------------------------------------------------- > > Key: NUTCH-444 > URL: https://issues.apache.org/jira/browse/NUTCH-444 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 0.9.0 > Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann > Priority: Minor > Fix For: 0.9.0 > > Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2 > > > As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: > - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first > - no support for Atom 1.0 > - there has been no development in the last year > Alternatives are: > - Rome > - Informa > - custom implementation based on Stax > - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.