nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zzeran <zze...@gmail.com>
Subject Nutch Crash during db update
Date Wed, 02 Sep 2009 08:53:15 GMT

Hi,

I'm new to Nutch and so far very impressed!

I've been investigating Nutch for the past two weeks and now I've started
fetching pages from the internet (I'm doing a specific crawl on a few
selected domains).

I'm running Nutch using Hadoop on two machines, using the DFS and using
Ubuntu 9.04.

I've tried running Nutch several times but everytime it seems to be crashing
after 4-5 hours of crawling (even after I've formatted the DFS and restarted
crawl).

I've created a "loop" in a shell script that execute all the crawling
phases. On the loop that caused the crash (after 4-5 hours) I'm getting the
following:

Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20090901230006
Generator: filtering: true
Generator: topN: 10000
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
processing segment /user/nutch/crawl/segments/20090901230006
Fetcher: starting
Fetcher: segment: /user/nutch/crawl/segments/20090901230006
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [/user/nutch/crawl/segments/20090901230006]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:197)
    at java.io.DataInputStream.readFully(DataInputStream.java:169)
    at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
    at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
    at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
    at org.apache.hadoop.mapred.Child.main(Child.java:158)

java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:197)
    at java.io.DataInputStream.readFully(DataInputStream.java:169)
    at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
    at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
    at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
    at org.apache.hadoop.mapred.Child.main(Child.java:158)

java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:197)
    at java.io.DataInputStream.readFully(DataInputStream.java:169)
    at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
    at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
    at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
    at org.apache.hadoop.mapred.Child.main(Child.java:158)

CrawlDb update: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
    at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94)
    at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150)


Any ideas?

Thanks,
Eran
-- 
View this message in context: http://www.nabble.com/Nutch-Crash-during-db-update-tp25253922p25253922.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Mime
View raw message