nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vishal vachhani <vishal...@gmail.com>
Subject Re: Nutch Crash during db update
Date Wed, 02 Sep 2009 09:59:16 GMT
I have also seen this exception. However, I am not able to exactly figure
out why it is happening. But when I dump my segment using "readseg", it is
also throwing exception. I suspect that my segment got corrupted.  Please
try to dump you segments and check whether it is getting dumped or not.

let me know if you able to solve the problem.


On Wed, Sep 2, 2009 at 2:23 PM, zzeran <zzeran@gmail.com> wrote:

>
> Hi,
>
> I'm new to Nutch and so far very impressed!
>
> I've been investigating Nutch for the past two weeks and now I've started
> fetching pages from the internet (I'm doing a specific crawl on a few
> selected domains).
>
> I'm running Nutch using Hadoop on two machines, using the DFS and using
> Ubuntu 9.04.
>
> I've tried running Nutch several times but everytime it seems to be
> crashing
> after 4-5 hours of crawling (even after I've formatted the DFS and
> restarted
> crawl).
>
> I've created a "loop" in a shell script that execute all the crawling
> phases. On the loop that caused the crash (after 4-5 hours) I'm getting the
> following:
>
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20090901230006
> Generator: filtering: true
> Generator: topN: 10000
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> processing segment /user/nutch/crawl/segments/20090901230006
> Fetcher: starting
> Fetcher: segment: /user/nutch/crawl/segments/20090901230006
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [/user/nutch/crawl/segments/20090901230006]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> java.io.EOFException
>    at java.io.DataInputStream.readFully(DataInputStream.java:197)
>    at java.io.DataInputStream.readFully(DataInputStream.java:169)
>    at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>    at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>    at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>    at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>    at
>
> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>    at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
>    at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
> java.io.EOFException
>    at java.io.DataInputStream.readFully(DataInputStream.java:197)
>    at java.io.DataInputStream.readFully(DataInputStream.java:169)
>    at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>    at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>    at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>    at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>    at
>
> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>    at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
>    at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
> java.io.EOFException
>    at java.io.DataInputStream.readFully(DataInputStream.java:197)
>    at java.io.DataInputStream.readFully(DataInputStream.java:169)
>    at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>    at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>    at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>    at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>    at
>
> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>    at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
>    at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
> CrawlDb update: java.io.IOException: Job failed!
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>    at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94)
>    at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150)
>
>
> Any ideas?
>
> Thanks,
> Eran
> --
> View this message in context:
> http://www.nabble.com/Nutch-Crash-during-db-update-tp25253922p25253922.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
Thanks and Regards,
Vishal Vachhani

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message