hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jibjoice <sudarat_...@hotmail.com>
Subject Re: Nutch crawl problem
Date Thu, 03 Jan 2008 01:18:43 GMT

i crawl "http://lucene.apache.org" and in conf/crawl-urlfilter.txt i set that
"+^http://([a-z0-9]*\.)*apache.org/" when i use command "bin/nutch crawl
urls -dir crawled -depth 3" have error that 

- crawl started in: crawled 
- rootUrlDir = urls 
- threads = 10 
- depth = 3 
- Injector: starting 
- Injector: crawlDb: crawled/crawldb 
- Injector: urlDir: urls 
- Injector: Converting injected urls to crawl db entries. 
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path doesnt exist : /user/nutch/urls 
        at
org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138) 
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326) 
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543) 
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162) 
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115) 
-bash-3.1$ bin/nutch crawl inputs -dir crawled -depth 3 
- crawl started in: crawled 
- rootUrlDir = inputs 
- threads = 10 
- depth = 3 
- Injector: starting 
- Injector: crawlDb: crawled/crawldb 
- Injector: urlDir: inputs 
- Injector: Converting injected urls to crawl db entries. 
- Total input paths to process : 1 
- Running job: job_0001 
-  map 0% reduce 0% 
-  map 100% reduce 0% 
-  map 100% reduce 16% 
-  map 100% reduce 58% 
-  map 100% reduce 100% 
- Job complete: job_0001 
- Counters: 6 
-   Map-Reduce Framework 
-     Map input records=3 
-     Map output records=1 
-     Map input bytes=25 
-     Map output bytes=55 
-     Reduce input records=1 
-     Reduce output records=1 
- Injector: Merging injected urls into crawl db. 
- Total input paths to process : 2 
- Running job: job_0002 
-  map 0% reduce 0% 
- Task Id : task_0002_m_000000_0, Status : FAILED 
task_0002_m_000000_0: - Plugins: looking in: /nutch/search/build/plugins 
task_0002_m_000000_0: - Plugin Auto-activation mode: [true] 
task_0002_m_000000_0: - Registered Plugins: 
task_0002_m_000000_0: -         the nutch core extension points
(nutch-extensionpoints) 
task_0002_m_000000_0: -         Basic Query Filter (query-basic) 
task_0002_m_000000_0: -         Basic URL Normalizer (urlnormalizer-basic) 
task_0002_m_000000_0: -         Basic Indexing Filter (index-basic) 
task_0002_m_000000_0: -         Html Parse Plug-in (parse-html) 
task_0002_m_000000_0: -         Basic Summarizer Plug-in (summary-basic) 
task_0002_m_000000_0: -         Site Query Filter (query-site) 
task_0002_m_000000_0: -         HTTP Framework (lib-http) 
task_0002_m_000000_0: -         Text Parse Plug-in (parse-text) 
task_0002_m_000000_0: -         Regex URL Filter (urlfilter-regex) 
task_0002_m_000000_0: -         Pass-through URL Normalizer
(urlnormalizer-pass) 
task_0002_m_000000_0: -         Http Protocol Plug-in (protocol-http) 
task_0002_m_000000_0: -         Regex URL Normalizer (urlnormalizer-regex) 
task_0002_m_000000_0: -         OPIC Scoring Plug-in (scoring-opic) 
task_0002_m_000000_0: -         CyberNeko HTML Parser (lib-nekohtml) 
task_0002_m_000000_0: -         JavaScript Parser (parse-js) 
task_0002_m_000000_0: -         URL Query Filter (query-url) 
task_0002_m_000000_0: -         Regex URL Filter Framework
(lib-regex-filter) 
task_0002_m_000000_0: - Registered Extension-Points: 
task_0002_m_000000_0: -         Nutch Summarizer
(org.apache.nutch.searcher.Summarizer) 
task_0002_m_000000_0: -         Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer) 
task_0002_m_000000_0: -         Nutch Protocol
(org.apache.nutch.protocol.Protocol) 
task_0002_m_000000_0: -         Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer) 
task_0002_m_000000_0: -         Nutch URL Filter
(org.apache.nutch.net.URLFilter) 
task_0002_m_000000_0: -         Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter) 
task_0002_m_000000_0: -         Nutch Online Search Results Clustering
Plugin (org.apache.nutch.clustering.OnlineClusterer) 
task_0002_m_000000_0: -         HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter) 
task_0002_m_000000_0: -         Nutch Content Parser
(org.apache.nutch.parse.Parser) 
task_0002_m_000000_0: -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter) 
task_0002_m_000000_0: -         Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter) 
task_0002_m_000000_0: -         Ontology Model Loader
(org.apache.nutch.ontology.Ontology) 
task_0002_m_000000_0: - found resource crawl-urlfilter.txt at
file:/nutch/search/conf/crawl-urlfilter.txt 
-  map 50% reduce 0% 
-  map 100% reduce 0% 
-  map 100% reduce 8% 
-  map 100% reduce 25% 
-  map 100% reduce 58% 
-  map 100% reduce 100% 
- Job complete: job_0002 
- Counters: 6 
-   Map-Reduce Framework 
-     Map input records=3 
-     Map output records=1 
-     Map input bytes=63 
-     Map output bytes=55 
-     Reduce input records=1 
-     Reduce output records=1 
- Injector: done 
- Generator: Selecting best-scoring urls due for fetch. 
- Generator: starting 
- Generator: segment: crawled/segments/25510102165746 
- Generator: filtering: false 
- Generator: topN: 2147483647 
- Total input paths to process : 2 
- Running job: job_0003 
-  map 0% reduce 0% 
-  map 50% reduce 0% 
-  map 100% reduce 0% 
-  map 100% reduce 8% 
-  map 100% reduce 16% 
-  map 100% reduce 58% 
-  map 100% reduce 100% 
- Job complete: job_0003 
- Counters: 6 
-   Map-Reduce Framework 
-     Map input records=3 
-     Map output records=1 
-     Map input bytes=62 
-     Map output bytes=80 
-     Reduce input records=1 
-     Reduce output records=1 
- Generator: Partitioning selected urls by host, for politeness. 
- Total input paths to process : 2 
- Running job: job_0004 
-  map 0% reduce 0% 
-  map 50% reduce 0% 
-  map 100% reduce 0% 
- Task Id : task_0004_r_000000_0, Status : FAILED 
- Task Id : task_0004_r_000001_0, Status : FAILED 
-  map 100% reduce 8% 
-  map 100% reduce 0% 
- Task Id : task_0004_r_000000_1, Status : FAILED 
- Task Id : task_0004_r_000001_1, Status : FAILED 
-  map 100% reduce 8% 
-  map 100% reduce 0% 
- Task Id : task_0004_r_000000_2, Status : FAILED 

now i use hadoop-0.12.2, nutch-0.9 and java jdk1.6.0. Why? i can't solve it
1 month ago.

-- 
View this message in context: http://www.nabble.com/Nutch-crawl-problem-tp14327978p14589912.html
Sent from the Hadoop Users mailing list archive at Nabble.com.


Mime
View raw message