Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 29402 invoked from network); 2 Jan 2008 10:08:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Jan 2008 10:08:43 -0000 Received: (qmail 12825 invoked by uid 500); 2 Jan 2008 10:08:30 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 12800 invoked by uid 500); 2 Jan 2008 10:08:30 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 12791 invoked by uid 99); 2 Jan 2008 10:08:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jan 2008 02:08:30 -0800 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=DNS_FROM_OPENWHOIS,FORGED_HOTMAIL_RCVD2,SPF_HELO_PASS,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jan 2008 10:08:04 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1JA0W9-00058F-11 for hadoop-user@lucene.apache.org; Wed, 02 Jan 2008 02:08:09 -0800 Message-ID: <14575918.post@talk.nabble.com> Date: Wed, 2 Jan 2008 02:08:09 -0800 (PST) From: jibjoice To: hadoop-user@lucene.apache.org Subject: Re: Nutch crawl problem In-Reply-To: <14327978.post@talk.nabble.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: sudarat_jib@hotmail.com References: <14327978.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org i crawl "http://lucene.apache.org" and in conf/crawl-urlfilter.txt i set that "+^http://([a-z0-9]*\.)*apache.org/" when i use command "bin/nutch crawl urls -dir crawled -depth 3" have error that - crawl started in: crawled - rootUrlDir = urls - threads = 10 - depth = 3 - Injector: starting - Injector: crawlDb: crawled/crawldb - Injector: urlDir: urls - Injector: Converting injected urls to crawl db entries. Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /user/nutch/urls at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543) at org.apache.nutch.crawl.Injector.inject(Injector.java:162) at org.apache.nutch.crawl.Crawl.main(Crawl.java:115) -bash-3.1$ bin/nutch crawl inputs -dir crawled -depth 3 - crawl started in: crawled - rootUrlDir = inputs - threads = 10 - depth = 3 - Injector: starting - Injector: crawlDb: crawled/crawldb - Injector: urlDir: inputs - Injector: Converting injected urls to crawl db entries. - Total input paths to process : 1 - Running job: job_0001 - map 0% reduce 0% - map 100% reduce 0% - map 100% reduce 16% - map 100% reduce 58% - map 100% reduce 100% - Job complete: job_0001 - Counters: 6 - Map-Reduce Framework - Map input records=3 - Map output records=1 - Map input bytes=25 - Map output bytes=55 - Reduce input records=1 - Reduce output records=1 - Injector: Merging injected urls into crawl db. - Total input paths to process : 2 - Running job: job_0002 - map 0% reduce 0% - Task Id : task_0002_m_000000_0, Status : FAILED task_0002_m_000000_0: - Plugins: looking in: /nutch/search/build/plugins task_0002_m_000000_0: - Plugin Auto-activation mode: [true] task_0002_m_000000_0: - Registered Plugins: task_0002_m_000000_0: - the nutch core extension points (nutch-extensionpoints) task_0002_m_000000_0: - Basic Query Filter (query-basic) task_0002_m_000000_0: - Basic URL Normalizer (urlnormalizer-basic) task_0002_m_000000_0: - Basic Indexing Filter (index-basic) task_0002_m_000000_0: - Html Parse Plug-in (parse-html) task_0002_m_000000_0: - Basic Summarizer Plug-in (summary-basic) task_0002_m_000000_0: - Site Query Filter (query-site) task_0002_m_000000_0: - HTTP Framework (lib-http) task_0002_m_000000_0: - Text Parse Plug-in (parse-text) task_0002_m_000000_0: - Regex URL Filter (urlfilter-regex) task_0002_m_000000_0: - Pass-through URL Normalizer (urlnormalizer-pass) task_0002_m_000000_0: - Http Protocol Plug-in (protocol-http) task_0002_m_000000_0: - Regex URL Normalizer (urlnormalizer-regex) task_0002_m_000000_0: - OPIC Scoring Plug-in (scoring-opic) task_0002_m_000000_0: - CyberNeko HTML Parser (lib-nekohtml) task_0002_m_000000_0: - JavaScript Parser (parse-js) task_0002_m_000000_0: - URL Query Filter (query-url) task_0002_m_000000_0: - Regex URL Filter Framework (lib-regex-filter) task_0002_m_000000_0: - Registered Extension-Points: task_0002_m_000000_0: - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) task_0002_m_000000_0: - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) task_0002_m_000000_0: - Nutch Protocol (org.apache.nutch.protocol.Protocol) task_0002_m_000000_0: - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) task_0002_m_000000_0: - Nutch URL Filter (org.apache.nutch.net.URLFilter) task_0002_m_000000_0: - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) task_0002_m_000000_0: - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) task_0002_m_000000_0: - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) task_0002_m_000000_0: - Nutch Content Parser (org.apache.nutch.parse.Parser) task_0002_m_000000_0: - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) task_0002_m_000000_0: - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) task_0002_m_000000_0: - Ontology Model Loader (org.apache.nutch.ontology.Ontology) task_0002_m_000000_0: - found resource crawl-urlfilter.txt at file:/nutch/search/conf/crawl-urlfilter.txt - map 50% reduce 0% - map 100% reduce 0% - map 100% reduce 8% - map 100% reduce 25% - map 100% reduce 58% - map 100% reduce 100% - Job complete: job_0002 - Counters: 6 - Map-Reduce Framework - Map input records=3 - Map output records=1 - Map input bytes=63 - Map output bytes=55 - Reduce input records=1 - Reduce output records=1 - Injector: done - Generator: Selecting best-scoring urls due for fetch. - Generator: starting - Generator: segment: crawled/segments/25510102165746 - Generator: filtering: false - Generator: topN: 2147483647 - Total input paths to process : 2 - Running job: job_0003 - map 0% reduce 0% - map 50% reduce 0% - map 100% reduce 0% - map 100% reduce 8% - map 100% reduce 16% - map 100% reduce 58% - map 100% reduce 100% - Job complete: job_0003 - Counters: 6 - Map-Reduce Framework - Map input records=3 - Map output records=1 - Map input bytes=62 - Map output bytes=80 - Reduce input records=1 - Reduce output records=1 - Generator: Partitioning selected urls by host, for politeness. - Total input paths to process : 2 - Running job: job_0004 - map 0% reduce 0% - map 50% reduce 0% - map 100% reduce 0% - Task Id : task_0004_r_000000_0, Status : FAILED - Task Id : task_0004_r_000001_0, Status : FAILED - map 100% reduce 8% - map 100% reduce 0% - Task Id : task_0004_r_000000_1, Status : FAILED - Task Id : task_0004_r_000001_1, Status : FAILED - map 100% reduce 8% - map 100% reduce 0% - Task Id : task_0004_r_000000_2, Status : FAILED now i use hadoop-0.12.2, nutch-0.9 and java jdk1.6.0. Why? i can't solve it 1 month ago. -- View this message in context: http://www.nabble.com/Nutch-crawl-problem-tp14327978p14575918.html Sent from the Hadoop Users mailing list archive at Nabble.com.