nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Gardner <t...@tomg.com>
Subject InvalidInputException: Input path does not exist
Date Thu, 03 Sep 2009 17:23:17 GMT
Hello,

I'm trying to get whole-web crawling working, but I'm getting this error in
the final indexing steps:

LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_data
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_text
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_fetch
LinkDb: adding segment:
file:/data/crawl/segments/20090903093154/crawl_generate
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/content
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_parse
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/data/crawl/segments/20090903093154/parse_data/parse_data
Input path does not exist:
file:/data/crawl/segments/20090903093154/parse_text/parse_data
Can anyone help?  This is running nutch 1.0 on a clean nutch install on
Fedora Linux.

I've verified the same error using the nutch-2009-09-03_05-18-47 release as
well.

My script and full error output are below.

Thanks


-------------------------------------------------------------------- Nutch
Script
-------------------------------------------------------------------------

#!/bin/bash
export JAVA_HOME=/usr/local/jdk

# Clean up from last run
/bin/rm -rf crawl seed
mkdir seed
# Copy list of urls to the seed directory
cp urls seed/urls.txt
# Injects urls in the 'seed' directory into the crawldb
/usr/local/nutch/bin/nutch inject crawl/crawldb seed
# Generate fetch list, fetch and parse content
/usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments
echo DONE GENERATE 1
# The above command will generate a new segment directory
# under crawl/segments that at this point contains files that
# store the url(s) to be fetched. In the following commands
# we need the latest segment dir as parameter so well store
# it in an environment variable:
SEGMENT=`ls -d crawl/segments/2* | tail -1`
echo SEGMENT 1: $SEGMENT
# Now launch the fetcher that actually goes to get the content
/usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing
echo DONE FETCH 1
# Next, parse the content
/usr/local/nutch/bin/nutch parse $SEGMENT
echo DONE PARSE 1
# Then update the Nutch crawldb. The updatedb command will
# store all new urls discovered during the fetch and parse of
# the previous segment into Nutch database so they can be
# fetched later. Nutch also stores information about the
# pages that were fetched so the same urls wont be fetched
# again and again.
/usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter
-normalize
echo DONE UPDATEDB 1
# Now the database has entries for all of the pages referenced by the
initial set

# Now we fetch a new segment with the top-scoring 1000 pages
# /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments -topN
1000
/usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments
echo DONE GENERATE 2
# reset SEGMENT
SEGMENT=`ls -d crawl/segments/2* | tail -1`
echo SEGMENT 2: $SEGMENT
# Now re-launch the fetcher to get the content
/usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing
echo DONE FETCH 2
# Next, parse the content
/usr/local/nutch/bin/nutch parse $SEGMENT
echo DONE PARSE 2
# update the db
/usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter
-normalize
echo DONE UPDATE 2

# Fetch another round
# /usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments -topN
1000
/usr/local/nutch/bin/nutch generate crawl/crawldb crawl/segments
echo DONE GENERATE 3
# reset SEGMENT
SEGMENT=`ls -d crawl/segments/2* | tail -1`
echo SEGMENT 3: $SEGMENT
# Now re-launch the fetcher to get the content
/usr/local/nutch/bin/nutch fetch $SEGMENT -noParsing
echo DONE FETCH 3
# Next, parse the content
/usr/local/nutch/bin/nutch parse $SEGMENT
echo DONE PARSE 3
# update the db
/usr/local/nutch/bin/nutch updatedb crawl/crawldb $SEGMENT -filter
-normalize
echo DONE UPDATEDB 3

#
# We now index what we've gotten
#
# Before indexing we first invert all of the links,
# so that we may index incoming anchor text with the pages.
/usr/local/nutch/bin/nutch invertlinks crawl/linkdb -dir crawl/segments/*
# Then index
/usr/local/nutch/bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
crawl/segments/*

----------------------------------------------------------------  Nutch
Errors ---------------------------------------------------------








-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=2, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
DONE FETCH 3
DONE PARSE 3
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20090903093336]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
DONE UPDATEDB 3
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_data
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/parse_text
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_fetch
LinkDb: adding segment:
file:/data/crawl/segments/20090903093154/crawl_generate
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/content
LinkDb: adding segment: file:/data/crawl/segments/20090903093154/crawl_parse
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/data/crawl/segments/20090903093154/parse_data/parse_data
Input path does not exist:
file:/data/crawl/segments/20090903093154/parse_text/parse_data
Input path does not exist:
file:/data/crawl/segments/20090903093154/crawl_fetch/parse_data
Input path does not exist:
file:/data/crawl/segments/20090903093154/crawl_generate/parse_data
Input path does not exist:
file:/data/crawl/segments/20090903093154/content/parse_data
Input path does not exist:
file:/data/crawl/segments/20090903093154/crawl_parse/parse_data
 at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
 at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
 at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
 at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
 at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
Indexer: starting
Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/data/crawl/linkdb/current
 at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
 at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
 at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:72)
 at org.apache.nutch.indexer.Indexer.run(Indexer.java:92)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:101)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message