Return-Path: Delivered-To: apmail-lucene-nutch-dev-archive@www.apache.org Received: (qmail 80972 invoked from network); 10 Feb 2007 17:03:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Feb 2007 17:03:28 -0000 Received: (qmail 96251 invoked by uid 500); 10 Feb 2007 17:03:34 -0000 Delivered-To: apmail-lucene-nutch-dev-archive@lucene.apache.org Received: (qmail 96241 invoked by uid 500); 10 Feb 2007 17:03:34 -0000 Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-dev@lucene.apache.org Received: (qmail 96230 invoked by uid 99); 10 Feb 2007 17:03:34 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Feb 2007 09:03:34 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Feb 2007 09:03:26 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 26929714310 for ; Sat, 10 Feb 2007 09:03:06 -0800 (PST) Message-ID: <3450591.1171126986155.JavaMail.jira@brutus> Date: Sat, 10 Feb 2007 09:03:06 -0800 (PST) From: "nutch.newbie (JIRA)" To: nutch-dev@lucene.apache.org Subject: [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser In-Reply-To: <4050960.1170874445502.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471998 ] nutch.newbie commented on NUTCH-443: ------------------------------------ Hi.. After swaping the parse-plugin.xml i.e. the following way .. (and turning off magic detection) Hoping that parse-rss will pick-up the doc firs and not return NPE so out of 25 RSS URL with 1 round of fetch I managed to escape dedup with only 4 doc being indexed all other 21 docs throw NPE .. Error parsing: http://rss.cnn.com/rss/cnn_warpcnn.rss: failed(2,200): java.lang.NullPointerException Error parsing: http://rss.cnn.com/rss/cnn_ac360blog.rss: failed(2,200): java.lang.NullPointerException Error parsing: http://rss.cnn.com/rss/cnn_marquee.rss: failed(2,200): java.lang.NullPointerException Error parsing: http://rss.cnn.com/rss/cnn_gupta.rss: failed(2,200): java.lang.NullPointerException I must be doing something sily there must be way to tell nutch to index using plugin X.. I thought you do that turning magic off and using plugin-parse.xml .. no? am I missing something .. Please let me know.. I am going to try the parse-feed now to see what happens. Issues regarding that I will post in Nutch-444 Cheers > allow parsers to return multiple Parse object, this will speed up the rss parser > -------------------------------------------------------------------------------- > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Affects Versions: 0.9.0 > Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann > Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. > see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.