nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dinçer Kavraal <dkavr...@gmail.com>
Subject Re: Error "Input path does not exist" when crawling
Date Mon, 01 Aug 2011 11:24:32 GMT
Christian Hi,

URLs does not matter actually. Same URLs may do it. Just try to do the
crawling operation once more, just as in the first run. The thing is I am
not out of disk space (for esp. tmp) and I can sometimes get it done without
problems in this manner (yes I have some other problems such redirection).

But if I get once this error, when try to rerun the crawling:
# bin/nutch crawl -dir crawlIntoDir urlsDir -depth 2 -threads 25
get same error.

One more thing, will you share the stats as:

*$ bin/nutch readdb crawl-dir/crawldb -stats*
CrawlDb statistics start: crawl-dir/crawldb
Statistics for CrawlDb: crawl-dir/crawldb
TOTAL urls: 956
retry 0: 956
min score: 0.0
avg score: 0.009015691
max score: 1.339
status 1 (db_unfetched): 790
status 2 (db_fetched): 126
status 4 (db_redir_temp): 19
status 5 (db_redir_perm): 21
CrawlDb statistics: done


and

*$ bin/nutch readseg -list crawl-dir/segments/**
NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
20110730005815 3 2011-07-30T00:58:18 2011-07-30T00:58:18 3 3
20110730005828 163 2011-07-30T00:58:30 2011-07-30T01:05:32 201 123


When I got that error, the latter list shows that one (or more) segments is
not finished well. But now you can see my segments seem ok. What about
yours?

Dinçer


2011/8/1 Christian Weiske <christian.weiske@netresearch.de>

> Hello Dinçer,
>
>
> > > Somewhere during the crawling process I get an error that stops
> > > everything:
> > >
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801090707
> > > Exception in thread "main"
> > > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > > exist:
>
> > I have had same problem in one of my instances. Let's dig together, at
> > least. I have tried to re-crawl the url list into same crawl directory
> > (crawl-301 in your case) and got the same error, will you confirm for
> > your case?
>
> How do you re-crawl the list? Is there a specific URL list in the
> segment?
>
> --
> Viele Grüße
> Christian Weiske
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message