nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zzeran <zze...@gmail.com>
Subject Re: Nutch Crash during db update
Date Wed, 02 Sep 2009 10:32:40 GMT

Hi Vishal,

Thanks for your help.

I've tried dumping the segment like you suggested and indeed, I've got the
following error message:

SegmentReader: dump segment: /user/nutch/crawl/segments/20090901230006
java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
	at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer
ava:63)
	at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:1
)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:193
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:206
	at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFile
cordReader.java:76)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map
sk.java:192)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j
a:176)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
	at org.apache.hadoop.mapred.Child.main(Child.java:158)

java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
	at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer
ava:63)
	at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:1
)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:193
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:206
	at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFile
cordReader.java:76)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map
sk.java:192)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j
a:176)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
	at org.apache.hadoop.mapred.Child.main(Child.java:158)

java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
	at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer
ava:63)
	at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:1
)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:193
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:206
	at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFile
cordReader.java:76)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(Map
sk.java:192)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.j
a:176)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
	at org.apache.hadoop.mapred.Child.main(Child.java:158)

Exception in thread "main" java.io.IOException: Job failed!
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
	at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225)
	at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)

So according to what you've said - the segment got corrupted. Is this really
the case? why was it corrupted? anyway I can avoid it?

Thanks,
Eran


vishal vachhani wrote:
> 
> I have also seen this exception. However, I am not able to exactly figure
> out why it is happening. But when I dump my segment using "readseg", it is
> also throwing exception. I suspect that my segment got corrupted.  Please
> try to dump you segments and check whether it is getting dumped or not.
> 
> let me know if you able to solve the problem.
> 
> 
> On Wed, Sep 2, 2009 at 2:23 PM, zzeran <zzeran@gmail.com> wrote:
> 
>>
>> Hi,
>>
>> I'm new to Nutch and so far very impressed!
>>
>> I've been investigating Nutch for the past two weeks and now I've started
>> fetching pages from the internet (I'm doing a specific crawl on a few
>> selected domains).
>>
>> I'm running Nutch using Hadoop on two machines, using the DFS and using
>> Ubuntu 9.04.
>>
>> I've tried running Nutch several times but everytime it seems to be
>> crashing
>> after 4-5 hours of crawling (even after I've formatted the DFS and
>> restarted
>> crawl).
>>
>> I've created a "loop" in a shell script that execute all the crawling
>> phases. On the loop that caused the crash (after 4-5 hours) I'm getting
>> the
>> following:
>>
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: crawl/segments/20090901230006
>> Generator: filtering: true
>> Generator: topN: 10000
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> processing segment /user/nutch/crawl/segments/20090901230006
>> Fetcher: starting
>> Fetcher: segment: /user/nutch/crawl/segments/20090901230006
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: crawl/crawldb
>> CrawlDb update: segments: [/user/nutch/crawl/segments/20090901230006]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: Merging segment data into db.
>> java.io.EOFException
>>    at java.io.DataInputStream.readFully(DataInputStream.java:197)
>>    at java.io.DataInputStream.readFully(DataInputStream.java:169)
>>    at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>>    at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>>    at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>    at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>    at
>>
>> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>>    at
>>
>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
>>    at org.apache.hadoop.mapred.Child.main(Child.java:158)
>>
>> java.io.EOFException
>>    at java.io.DataInputStream.readFully(DataInputStream.java:197)
>>    at java.io.DataInputStream.readFully(DataInputStream.java:169)
>>    at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>>    at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>>    at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>    at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>    at
>>
>> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>>    at
>>
>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
>>    at org.apache.hadoop.mapred.Child.main(Child.java:158)
>>
>> java.io.EOFException
>>    at java.io.DataInputStream.readFully(DataInputStream.java:197)
>>    at java.io.DataInputStream.readFully(DataInputStream.java:169)
>>    at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>>    at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>>    at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>    at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>    at
>>
>> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>>    at
>>
>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
>>    at org.apache.hadoop.mapred.Child.main(Child.java:158)
>>
>> CrawlDb update: java.io.IOException: Job failed!
>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>>    at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94)
>>    at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189)
>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>    at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150)
>>
>>
>> Any ideas?
>>
>> Thanks,
>> Eran
>> --
>> View this message in context:
>> http://www.nabble.com/Nutch-Crash-during-db-update-tp25253922p25253922.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Thanks and Regards,
> Vishal Vachhani
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-Crash-during-db-update-tp25253922p25255115.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Mime
View raw message