nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Woliner <bryan.woli...@gmail.com>
Subject Basic Whole-Web Crawl Question: Problem running fetch for the first time
Date Wed, 06 Jul 2005 14:32:15 GMT
Hi,

I was able crawl/index/search a couple of sites using the "intranet crawl" 
instructions in the tutorial. I am now trying to go through the whole-web 
crawl instructions in the tutorial and only got through a few steps before I 
ran into an error the first time I called bin/nutch fetch.

(Note: the file urlsWW, used in the inject statement below, contains only 
one URL for testing purposes, so currently reads: 
http://www.democracynow.org)

Here is what happened: 

Bryan@bryanscomputer /usr/local/nutch-0.6
$ mkdir db2

Bryan@bryanscomputer /usr/local/nutch-0.6
$ mkdir segments2

Bryan@bryanscomputer /usr/local/nutch-0.6
$ bin/nutch admin db2 -create
050705 234131 No NutchFileSystem indicated, so defaulting to local fs.
050705 234131 loading file:/C:/cygwin/usr/local/nutch-0.6/conf/nutch-
default.xm

050705 234132 loading file:/C:/cygwin/usr/local/nutch-0.6/conf/nutch-
site.xml
050705 234132 Created webdb at LocalFS,db2

Bryan@bryanscomputer /usr/local/nutch-0.6
$ bin/nutch inject db2 -urlfile urlsWW
050705 234332 loading file:/C:/cygwin/usr/local/nutch-0.6/conf/nutch-
default.xm

050705 234333 loading file:/C:/cygwin/usr/local/nutch-0.6/conf/nutch-
site.xml
050705 234333 No NutchFileSystem indicated, so defaulting to local fs.
050705 234333 Starting URL processing
050705 234333 Using URL filter: net.nutch.net.RegexURLFilter
050705 234333 found resource regex-urlfilter.txt at 
file:/C:/cygwin/usr/local/n
tch-0.6/conf/regex-urlfilter.txt
050705 234333 Using URL normalizer: net.nutch.net.BasicUrlNormalizer
050705 234333 Added 1 pages
050705 234333 Processing pagesByURL: Sorted 1 instructions in 0.0 seconds.
050705 234333 Processing pagesByURL: Sorted Infinity instructions/second
050705 234333 Processing pagesByURL: Merged to new DB containing 1 records 
in 0
0 seconds
050705 234333 Processing pagesByURL: Merged Infinity records/second
050705 234333 Processing pagesByMD5: Sorted 1 instructions in 0.0 seconds.
050705 234333 Processing pagesByMD5: Sorted Infinity instructions/second
050705 234333 Processing pagesByMD5: Merged to new DB containing 1 records 
in 0
0 seconds
050705 234333 Processing pagesByMD5: Merged Infinity records/second
050705 234333 Processing linksByMD5: Copied file (0 bytes) in 0.015 secs.
050705 234333 Processing linksByURL: Copied file (0 bytes) in 0.016 secs.

Bryan@bryanscomputer /usr/local/nutch-0.6
$ bin/nutch generate db2 segments2
050705 234455 No NutchFileSystem indicated, so defaulting to local fs.
050705 234455 FetchListTool started
050705 234455 loading file:/C:/cygwin/usr/local/nutch-0.6/conf/nutch-
default.xm

050705 234455 loading file:/C:/cygwin/usr/local/nutch-0.6/conf/nutch-
site.xml
050705 234456 Processing pagesByURL: Sorted 1 instructions in 0.015 seconds.
050705 234456 Processing pagesByURL: Sorted 66.66666666666667instructions/seco
d
050705 234456 Processing pagesByURL: Merged to new DB containing 1 records 
in 0
0 seconds
050705 234456 Processing pagesByURL: Merged Infinity records/second
050705 234456 Processing pagesByMD5: Sorted 1 instructions in 0.0 seconds.
050705 234456 Processing pagesByMD5: Sorted Infinity instructions/second
050705 234456 Processing pagesByMD5: Merged to new DB containing 1 records 
in 0
0 seconds
050705 234456 Processing pagesByMD5: Merged Infinity records/second
050705 234456 Processing linksByMD5: Copied file (0 bytes) in 0.016 secs.
050705 234456 Processing linksByURL: Copied file (0 bytes) in 0.015 secs.
050705 234456 Processing segments2\20050705234455\fetchlist.unsorted: Sorted 
1
ntries in 0.0 seconds.
050705 234456 Processing segments2\20050705234455\fetchlist.unsorted: Sorted 
In
inity entries/second
050705 234456 Overall processing: Sorted 1 entries in 0.0 seconds.
050705 234456 Overall processing: Sorted 0.0 entries/second
050705 234456 FetchListTool completed

Bryan@bryanscomputer /usr/local/nutch-0.6
$ s1='ls -d segments/2* | tail -1'

Bryan@bryanscomputer /usr/local/nutch-0.6
$ echo $s1
ls -d segments/20050701222333 | tail -1

Bryan@bryanscomputer /usr/local/nutch-0.6
$ bin/nutch fetch $s1
050705 234611 loading file:/C:/cygwin/usr/local/nutch-0.6/conf/nutch-
default.xm

050705 234612 loading file:/C:/cygwin/usr/local/nutch-0.6/conf/nutch-
site.xml
050705 234612 No NutchFileSystem indicated, so defaulting to local fs.
Exception in thread "main" java.io.IOException: File does not exist
at net.nutch.fs.LocalFileSystem.open(LocalFileSystem.java:77)
at net.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:143)
at net.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:136)
at net.nutch.io.MapFile$Reader.<init>(MapFile.java:171)
at net.nutch.io.MapFile$Reader.<init>(MapFile.java:160)
at net.nutch.io.ArrayFile$Reader.<init>(ArrayFile.java:37)
at net.nutch.fetcher.Fetcher.<init>(Fetcher.java:235)
at net.nutch.fetcher.Fetcher.main(Fetcher.java:413)

Bryan@bryanscomputer /usr/local/nutch-0.6
$

Any Suggestions are much appreciated,
Bryan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message