nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From मनोज <Manoj> <manojonem...@gmail.com>
Subject Problem of InvalidException in Nutch : Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException
Date Thu, 24 Nov 2011 10:24:10 GMT
Hi
I am facing problem* *with* ApacheNutch1.3* . Output is as given below.
Please help. Thanks in advance.
*
manoj@cmk-manoj-ossd:~/tools/software/nutch/runtime/local$ bin/nutch crawl
urls -dir crawl -depth 3 -topN 5
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2011-11-24 15:45:15
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-11-24 15:45:17, elapsed: 00:00:02
Generator: starting at 2011-11-24 15:45:17
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20111124154519
Generator: finished at 2011-11-24 15:45:21, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2011-11-24 15:45:21
Fetcher: segment: crawl/segments/20111124154519
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
fetching http://nutch.apache.org/
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-11-24 15:45:43, elapsed: 00:00:22
ParseSegment: starting at 2011-11-24 15:45:43
ParseSegment: segment: crawl/segments/20111124154519
ParseSegment: finished at 2011-11-24 15:45:44, elapsed: 00:00:01
CrawlDb update: starting at 2011-11-24 15:45:44
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20111124154519]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-11-24 15:45:46, elapsed: 00:00:01
Generator: starting at 2011-11-24 15:45:46
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20111124154548
Generator: finished at 2011-11-24 15:45:49, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2011-11-24 15:45:49
Fetcher: segment: crawl/segments/20111124154548
Fetcher: threads: 10
QueueFeeder finished: total 5 records + hit by time limit :0
fetching http://nutch.apache.org/wiki.html
fetching http://www.apache.org/
fetching http://www.eu.apachecon.com/c/aceu2009/
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129755709
  now           = 1322129751077
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129750061
  now           = 1322129751078
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129755709
  now           = 1322129752078
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129750061
  now           = 1322129752079
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129755709
  now           = 1322129753080
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129750061
  now           = 1322129753080
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129755709
  now           = 1322129754081
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129750061
  now           = 1322129754081
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=2
* queue: http://nutch.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129755709
  now           = 1322129755083
  0. http://nutch.apache.org/mailing_lists.html
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129750061
  now           = 1322129755083
  0. http://www.apache.org/dyn/closer.cgi/nutch/
fetching http://nutch.apache.org/mailing_lists.html
-activeThreads=10, spinWaiting=7, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129750061
  now           = 1322129756083
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129750061
  now           = 1322129757084
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1322129750061
  now           = 1322129758085
  0. http://www.apache.org/dyn/closer.cgi/nutch/
fetch of http://www.eu.apachecon.com/c/aceu2009/ failed with:
java.net.UnknownHostException: www.eu.apachecon.com
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1322129750061
  now           = 1322129759085
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1322129764028
  now           = 1322129760086
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1322129764028
  now           = 1322129761086
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1322129764028
  now           = 1322129762087
  0. http://www.apache.org/dyn/closer.cgi/nutch/
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1322129764028
  now           = 1322129763088
  0. http://www.apache.org/dyn/closer.cgi/nutch/
fetching http://www.apache.org/dyn/closer.cgi/nutch/
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-activeThreads=3, spinWaiting=2, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-11-24 15:46:06, elapsed: 00:00:17
ParseSegment: starting at 2011-11-24 15:46:06
ParseSegment: segment: crawl/segments/20111124154548
ParseSegment: finished at 2011-11-24 15:46:08, elapsed: 00:00:01
CrawlDb update: starting at 2011-11-24 15:46:08
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20111124154548]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-11-24 15:46:09, elapsed: 00:00:01
Generator: starting at 2011-11-24 15:46:09
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20111124154611
Generator: finished at 2011-11-24 15:46:12, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2011-11-24 15:46:12
Fetcher: segment: crawl/segments/20111124154611
Fetcher: threads: 10
fetching http://hadoop.apache.org/
fetching http://nutch.apache.org/index.html
fetching http://www.apache.org/licenses/
fetching http://forrest.apache.org/
QueueFeeder finished: total 5 records + hit by time limit :0
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1322129777960
  now           = 1322129774091
  0. http://www.apache.org/foundation/sponsorship.html
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1322129777960
  now           = 1322129775092
  0. http://www.apache.org/foundation/sponsorship.html
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1322129777960
  now           = 1322129776092
  0. http://www.apache.org/foundation/sponsorship.html
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=1
* queue: http://www.apache.org
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 4000
  minCrawlDelay = 0
  nextFetchTime = 1322129777960
  now           = 1322129777093
  0. http://www.apache.org/foundation/sponsorship.html
fetching http://www.apache.org/foundation/sponsorship.html
-finishing thread FetcherThread, activeThreads=9
-activeThreads=9, spinWaiting=6, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-11-24 15:46:23, elapsed: 00:00:11
ParseSegment: starting at 2011-11-24 15:46:23
ParseSegment: segment: crawl/segments/20111124154611
ParseSegment: finished at 2011-11-24 15:46:25, elapsed: 00:00:01
CrawlDb update: starting at 2011-11-24 15:46:25
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20111124154611]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-11-24 15:46:26, elapsed: 00:00:01
LinkDb: starting at 2011-11-24 15:46:26
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154548
LinkDb: adding segment:
file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154611
LinkDb: adding segment:
file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124152057
LinkDb: adding segment:
file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154415
LinkDb: adding segment:
file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154519
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124152057/parse_data
Input path does not exist:
file:/home/manoj/tools/software/nutch/runtime/local/crawl/segments/20111124154415/parse_data
    at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
    at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
    at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
manoj@cmk-manoj-ossd:~/tools/software/nutch/runtime/local$ vi urls/nutch
manoj@cmk-manoj-ossd:~/tools/software/nutch/runtime/local$*



-- 
Thanks & Regards

Manoj


India

Office :  022 27565303/4/5  Ext: 313

Mobile : +919323582145
http://twitter.com/aapkamanoj ,  http://aapkamanoj.blogspot.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message