Return-Path: Delivered-To: apmail-lucene-nutch-user-archive@www.apache.org Received: (qmail 41577 invoked from network); 3 Jul 2007 06:48:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Jul 2007 06:48:39 -0000 Received: (qmail 76653 invoked by uid 500); 3 Jul 2007 06:48:37 -0000 Delivered-To: apmail-lucene-nutch-user-archive@lucene.apache.org Received: (qmail 76417 invoked by uid 500); 3 Jul 2007 06:48:36 -0000 Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-user@lucene.apache.org Delivered-To: mailing list nutch-user@lucene.apache.org Received: (qmail 76402 invoked by uid 99); 3 Jul 2007 06:48:36 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Jul 2007 23:48:36 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of ssiren@gmail.com designates 209.85.134.188 as permitted sender) Received: from [209.85.134.188] (HELO mu-out-0910.google.com) (209.85.134.188) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Jul 2007 23:48:31 -0700 Received: by mu-out-0910.google.com with SMTP id g7so2099272muf for ; Mon, 02 Jul 2007 23:48:09 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:x-enigmail-version:content-type:content-transfer-encoding; b=t+86VC5Ntekk+cRogCsWyrah9gJnggfNbG7H2XObmjQ/Zs03oPe7Heq7223Gixw1UiAJY7QYI1Pk+cOEWfFZYvrZdpKvenHNpBHHfIzT2jOjEE2hMawo1TAWHRYyF/bqv+hchcr0Ygwnemv0jZhwOuIe+7Fsm6w3S8MMZg1r9n4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:x-enigmail-version:content-type:content-transfer-encoding; b=sE/jog+bzpK9R6GE30Fn3HDAe0WAvtbNB1fIchL4npWAiVxehkEJJuEIFkK5H8Q4j4DUdN/WvPcQhnQ+8axPcQmqAHIPADTo6GlG2bB5a3mMtX5w0H9gakJ5cWn1iWrnsdgrpPs8LUc6z9OZ6AMZs9t+lR+4HvUZyoZteNAfL6w= Received: by 10.82.127.14 with SMTP id z14mr14678793buc.1183445288999; Mon, 02 Jul 2007 23:48:08 -0700 (PDT) Received: from a84-231-80-117.elisa-laajakaista.fi ( [84.231.80.117]) by mx.google.com with ESMTP id z40sm16805072ikz.2007.07.02.23.48.07 (version=TLSv1/SSLv3 cipher=RC4-MD5); Mon, 02 Jul 2007 23:48:07 -0700 (PDT) Message-ID: <4689F125.9030104@gmail.com> Date: Tue, 03 Jul 2007 09:48:05 +0300 From: Sami Siren User-Agent: Thunderbird 2.0.0.4 (X11/20070604) MIME-Version: 1.0 To: nutch-user@lucene.apache.org Subject: Re: IOException using feed plugin - NUTCH-444 References: <173310.81659.qm@web59315.mail.re1.yahoo.com> In-Reply-To: <173310.81659.qm@web59315.mail.re1.yahoo.com> X-Enigmail-Version: 0.95.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org Kai_testing Middleton wrote: > I hope someone can suggest a method to proceed with this RuntimeExceptionI'm getting. recheck that you have scoring plugin enabled properly (scoring-opic) in nutch configuration (in the snippet you gave below it did not exist, also the pluginRepository log you showed did not have it registered) -- Sami Siren > > java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required! > at org.apache.nutch.scoring.ScoringFilters.(ScoringFilters.java:87) > at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61) > at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) > at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) > at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) > > As far as I can tell I'm using NUTCH-444 out-of-the-box since I have a nightly build. > > --Kai M. > > > ----- Original Message ---- > From: Kai_testing Middleton > To: nutch-user@lucene.apache.org > Sent: Friday, June 29, 2007 5:24:57 PM > Subject: Re: IOException using feed plugin - NUTCH-444 > > The exception is: > java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required! > > I note that my nutch-site.xml does contain a reference to scoring-opic so I wonder why it would give that exception. > > --Kai M. > > ----- Original Message ---- > From: Kai_testing Middleton > To: nutch-user@lucene.apache.org > Sent: Friday, June 29, 2007 11:36:11 AM > Subject: Re: IOException using feed plugin - NUTCH-444 > > Here is the more detailed stack trace: > java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required! > at org.apache.nutch.scoring.ScoringFilters.(ScoringFilters.java:87) > at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61) > at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) > at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) > at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) > > In fact, here is a complete hadoop.log for the command I attempt: > nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 2>&1 | tee crawl.log > > 2007-06-29 11:28:58,785 INFO crawl.Crawl - crawl started in: /usr/tmp/lee_apollo > 2007-06-29 11:28:58,788 INFO crawl.Crawl - rootUrlDir = /usr/tmp/lee_urls.txt > 2007-06-29 11:28:58,789 INFO crawl.Crawl - threads = 10 > 2007-06-29 11:28:58,790 INFO crawl.Crawl - depth = 2 > 2007-06-29 11:28:58,925 INFO crawl.Injector - Injector: starting > 2007-06-29 11:28:58,925 INFO crawl.Injector - Injector: crawlDb: /usr/tmp/lee_apollo/crawldb > 2007-06-29 11:28:58,925 INFO crawl.Injector - Injector: urlDir: /usr/tmp/lee_urls.txt > 2007-06-29 11:28:58,926 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. > 2007-06-29 11:28:59,936 INFO plugin.PluginRepository - Plugins: looking in: /usr/local/nutch-2007-06-27_06-52-44/plugins > 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] > 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - Registered Plugins: > 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) > 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - Site Query Filter (query-site) > 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) > 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) > 2007-06-29 11:29:00,253 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) > 2007-06-29 11:29:00,260 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) > 2007-06-29 11:29:00,260 INFO plugin.PluginRepository - Feed Parse/Index/Query Plug-in (feed) > 2007-06-29 11:29:00,260 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - JavaScript Parser (parse-js) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Basic Query Filter (query-basic) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - HTTP Framework (lib-http) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - XML Libraries (lib-xml) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - URL Query Filter (query-url) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Registered Extension-Points: > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) > 2007-06-29 11:29:00,261 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) > 2007-06-29 11:29:00,262 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) > 2007-06-29 11:29:00,262 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) > 2007-06-29 11:29:00,262 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) > 2007-06-29 11:29:00,262 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) > 2007-06-29 11:29:00,262 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) > 2007-06-29 11:29:00,262 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) > 2007-06-29 11:29:00,262 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) > 2007-06-29 11:29:00,367 WARN mapred.LocalJobRunner - job_w7bra3 > java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required! > at org.apache.nutch.scoring.ScoringFilters.(ScoringFilters.java:87) > at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61) > at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) > at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) > at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) > > > ----- Original Message ---- > From: Do�acan G�ney > To: nutch-user@lucene.apache.org > Sent: Friday, June 29, 2007 12:45:36 AM > Subject: Re: IOException using feed plugin - NUTCH-444 > > Hi, > > On 6/29/07, Kai_testing Middleton wrote: >> I have tried the NUTCH-444 "feed" plugin to enable spidering of RSS feeds: >> /nutch-2007-06-27_06-52-44/plugins/feed >> (that's a recent nightly build of nutch). >> >> When I attempt a crawl I get an IOException: >> >> $ nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 >> crawl started in: /usr/tmp/lee_apollo >> rootUrlDir = /usr/tmp/lee_urls.txt >> threads = 10 >> depth = 2 >> Injector: starting >> Injector: crawlDb: /usr/tmp/lee_apollo/crawldb >> Injector: urlDir: /usr/tmp/lee_urls.txt >> Injector: Converting injected urls to crawl db entries. >> Exception in thread "main" java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) >> at org.apache.nutch.crawl.Injector.inject(Injector.java:162) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:115) >> 3.14 real 1.92 user 0.30 sys > > This stack trace is not useful. This is only JobTracker (or > LocalJobRunner) reporting back to us that job has failed. If you are > running in a distributed environment, check your tasktracker logs or > if you are running locally check out logs/hadoop.log. > >> The seed URL is: >> http://www.mt-olympus.com/apollo/feed/ >> >> I enabled the feed plugin via this property in nutch-site.xml: >> >> plugin.includes >> protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opi >> c|urlnormalizer-(pass|regex|basic)|feed >> Regular expression naming plugin directory names to >> include. Any plugin not matching this expression is excluded. >> In any case you need at least include the nutch-extensionpoints plugin. By >> default Nutch includes crawling just HTML and plain text via HTTP, >> and basic indexing and search plugins. In order to use HTTPS please enable >> protocol-httpclient, but be aware of possible intermittent problems with the >> underlying commons-httpclient library. >> >> >> >> >> As a sanity check, when I take out "feed" from above, it no longer throws an exception (but it also doesn't fetch anything): >> >> $ nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 2>&1 | tee crawl.log >> crawl started in: /usr/tmp/lee_apollo >> rootUrlDir = /usr/tmp/lee_urls.txt >> threads = 10 >> depth = 2 >> Injector: starting >> Injector: crawlDb: /usr/tmp/lee_apollo/crawldb >> Injector: urlDir: /usr/tmp/lee_urls.txt >> Injector: Converting injected urls to crawl db entries. >> Injector: Merging injected urls into crawl db. >> Injector: done >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: /usr/tmp/lee_apollo/segments/20070628155854 >> Generator: filtering: false >> Generator: topN: 2147483647 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: Partitioning selected urls by host, for politeness. >> Generator: done. >> Fetcher: starting >> Fetcher: segment: /usr/tmp/lee_apollo/segments/20070628155854 >> Fetcher: threads: 10 >> fetching http://www.mt-olympus.com/apollo/feed/ >> Fetcher: done >> CrawlDb update: starting >> CrawlDb update: db: /usr/tmp/lee_apollo/crawldb >> CrawlDb update: segments: [/usr/tmp/lee_apollo/segments/20070628155854] >> CrawlDb update: additions allowed: true >> CrawlDb update: URL normalizing: true >> CrawlDb update: URL filtering: true >> CrawlDb update: Merging segment data into db. >> CrawlDb update: done >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: /usr/tmp/lee_apollo/segments/20070628155907 >> Generator: filtering: false >> Generator: topN: 2147483647 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: 0 records selected for fetching, exiting ... >> Stopping at depth=1 - no more URLs to fetch. >> LinkDb: starting >> LinkDb: linkdb: /usr/tmp/lee_apollo/linkdb >> LinkDb: URL normalize: true >> LinkDb: URL filter: true >> LinkDb: adding segment: /usr/tmp/lee_apollo/segments/20070628155854 >> LinkDb: done >> Indexer: starting >> Indexer: linkdb: /usr/tmp/lee_apollo/linkdb >> Indexer: adding segment: /usr/tmp/lee_apollo/segments/20070628155854 >> Indexing [http://www.mt-olympus.com/apollo/feed/] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@114b82b (null) >> Optimizing index. >> merging segments _ram_0 (1 docs) into _0 (1 docs) >> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: now checkpoint "segments_2" [isCommit = true] >> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: IncRef "_0.fnm": pre-incr count is 0 >> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: IncRef "_0.fdx": pre-incr count is 0 >> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: IncRef "_0.fdt": pre-incr count is 0 >> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: IncRef "_0.tii": pre-incr count is 0 >> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: IncRef "_0.tis": pre-incr count is 0 >> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: IncRef "_0.frq": pre-incr count is 0 >> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: IncRef "_0.prx": pre-incr count is 0 >> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: IncRef "_0.nrm": pre-incr count is 0 >> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: deleteCommits: now remove commit "segments_1" >> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: DecRef "segments_1": pre-decr count is 1 >> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: delete "segments_1" >> Indexer: done >> Dedup: starting >> Dedup: adding indexes in: /usr/tmp/lee_apollo/indexes >> Dedup: done >> merging indexes to: /usr/tmp/lee_apollo/index >> Adding /usr/tmp/lee_apollo/indexes/part-00000 >> done merging >> crawl finished: /usr/tmp/lee_apollo >> 30.45 real 8.40 user 2.26 sys >> >> >> ----- Original Message ---- >> From: Do�acan G�ney >> To: nutch-user@lucene.apache.org >> Sent: Wednesday, June 27, 2007 10:59:52 PM >> Subject: Re: Possibly use a different library to parse RSS feed for improved performance and compatibility >> >> On 6/28/07, Kai_testing Middleton wrote: >>> I am choosing to use NUTCH-444 for my RSS functionality. Do�acan commented on how to do this; he wrote: >>> ...if you need the functionality of NUTCH-444, I would suggest >>> trying a nightly version of Nutch. Becase NUTCH-444 by itself is not >>> enough. You also need two patches from NUTCH-443 and probably >>> NUTCH-504. >>> >>> I have a couple newbie questions about the mechanics of installing this. >>> >>> Prefatory comments: I have already installed another patch (for NUTCH-505) so I think I already have a nightly build (I'm guessing trunk==nightly?). These were the steps I did: >>> $ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch >>> $ cd nutch >>> $ wget https://issues.apache.org/jira/secure/attachment/12360411/NUTCH-505_draft_v2.patch >>> $ patch -p0 < NUTCH-505_draft_v2.patch >>> $ ant clean && ant >>> >>> --- >>> >>> Now I need NUTCH-443 NUTCH-504 NUTCH-444. Here's my guess: >>> >>> $ cd nutch >>> >>> $ wget http://issues.apache.org/jira/secure/attachment/12359953/NUTCH_443_reopened_v3.patch >>> $ patch -p0 < NUTCH_443_reopened_v3.patch >>> $ wget http://issues.apache.org/jira/secure/attachment/12350644/parse-map-core-draft-v1.patch >>> $ patch -p0 < parse-map-core-draft-v1.patch >>> $ wget http://issues.apache.org/jira/secure/attachment/12350634/parse-map-core-untested.patch >>> $ patch -p0 < parse-map-core-untested.patch >>> $ wget http://issues.apache.org/jira/secure/attachment/12357183/redirect_and_index.patch >>> >>> $ patch -p0 < redirect_and_index.patch >>> >>> >>> $ wget http://issues.apache.org/jira/secure/attachment/12357300/redirect_and_index_v2.patch >>> >>> $ patch -p0 < redirect_and_index_v2.patch >>> >>> I'm really guessing on the above ... continuing: >>> >>> $ wget http://issues.apache.org/jira/secure/attachment/12360361/NUTCH-504_v2.patch >>> >>> $ patch -p0 < NUTCH-504_v2.patch >>> >>> $ wget http://issues.apache.org/jira/secure/attachment/12360348/parse_in_fetchers.patch >>> >>> $ patch -p0 < parse_in_fetchers.patch >>> >>> ... that felt like less of a guess, but now: >>> >>> >>> $ wget http://issues.apache.org/jira/secure/attachment/12357192/NUTCH-444.patch >>> >>> $ patch -p0 < NUTCH-444.patch >>> >>> $ wget http://issues.apache.org/jira/secure/attachment/12350820/parse-feed.tar.bz2 >>> >>> $ tar xjvf parse-feed.tar.bz2 >>> >>> what do I do with this newly created parse-feed directory? >>> >>> so then I would do: >>> >>> $ ant clean && ant >>> >>> >>> Wait a minute: do I have this whole thing wrong? Maybe Do�acan means that the nightly builds ALREADY contain NUTCH-443 and NUTCH-504 so that I would do this: >>> >>> >>> $ wget http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild/artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz >>> $ tar xvzf nutch-2007-06-27_06-52-44.tar.gz >>> $ cd nutch-2007-06-27_06-52-44 >>> >>> then this business: >>> >>> $ wget http://issues.apache.org/jira/secure/attachment/12357192/NUTCH-444.patch >>> >>> >>> $ patch -p0 < NUTCH-444.patch >>> >>> >>> $ wget http://issues.apache.org/jira/secure/attachment/12350820/parse-feed.tar.bz2 >>> >>> >>> $ tar xjvf parse-feed.tar.bz2 >>> >>> >>> >>> what do I do with this newly created parse-feed directory? >>> >>> >>> >>> so then I would do: >>> >>> >>> >>> $ ant clean && ant >>> >>> I guess this is why "release engineer" is a job in and of itself! >>> Please advise. >> If you downloaded nightly build of 27th June, it contains feed plugin >> already (the plugin is called "feed", not "parse-feed", parse-feed was >> an older plugin and it is never committed. In my earlier comment, I >> meant to write parse-rss but wrote parse-feed). So, you don't have to >> apply any patches or anything. Just download a recent nightly build, >> and you are good to go :). >> >> You can also checkout trunk from svn and it will work too. >> >>> --Kai Middleton >>> >>> ----- Original Message ---- >>> From: Do�acan G�ney >>> To: nutch-user@lucene.apache.org >>> Sent: Friday, June 22, 2007 1:39:12 AM >>> Subject: Re: Possibly use a different library to parse RSS feed for improved performance and compatibility >>> >>> On 6/21/07, Kai_testing Middleton wrote: >>>> I am a new nutch user and the ability to crawl RSS feeds is critical to my mission. Do I understand from this (lengthy) discussion that in order to get the new RSS I need to either a) download one of the nightly builds and run ant or b) download and apply a patch (NUTCH-444.patch, I gather). >>> Nutch 0.9 can already parse RSS feeds (via parse-feed) plugin. >>> However, if you need the functionality of NUTCH-444, I would suggest >>> trying a nightly version of Nutch. Becase NUTCH-444 by itself is not >>> enough. You also need two patches from NUTCH-443 and probably >>> NUTCH-504. If you are worrying about stability, nightlies of nutch are >>> generally pretty stable. >>> >>> -- >>> Do�acan G�ney > > > > > > > ____________________________________________________________________________________ > Luggage? GPS? Comic books? > Check out fitting gifts for grads at Yahoo! Search > http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz > > > > > > > > ____________________________________________________________________________________ > Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. > http://mobile.yahoo.com/go?refer=1GNXIC