hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark angelillo <li...@snooth.com>
Subject Jobtracker showing jobs which seem dead
Date Tue, 05 May 2009 19:41:02 GMT
Hi all,

I've got an installation of Hadoop up working with a Nutch crawler,  
and it looks like recently the jobs are all halting in the middle of  
the reduce phase. This is on Hadoop 0.19.1

Here's what I'm seeing in the datanode logs: (there were a few in the  
logs, but the last error was almost a day ago)

2009-05-04 17:02:24,889 ERROR datanode.DataNode -  
DatanodeRegistration(10.9.17.206:50010,  
storageID=DS-1024739802-10.9.17.206-50010-1238445482034,  
infoPort=50075, ipcPort=50020):DataXceiver  
java.net.SocketTimeoutException: 480000 millis timeout while waiting  
for channel to be ready for write. ch :  
java.nio.channels.SocketChannel[connected local=/10.9.17.206:50010  
remote=/10.9.17.206:50537]
        at  
org 
.apache 
.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at  
org 
.apache 
.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java: 
159)
        at  
org 
.apache 
.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java: 
198)
        at  
org 
.apache 
.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java: 
313)
        at  
org 
.apache 
.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
        at  
org 
.apache 
.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
        at  
org 
.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
        at java.lang.Thread.run(Thread.java:619)

I searched for the error message and it turned up a few potential bugs  
with HBase, but I don't think that's in play here as I can't find any  
mention of it in the configuration files for our setup. Or, if it's  
possible I need to change hbase configurations, would that involve  
creating an hbase-site.xml config file in the hadoop conf directory or  
does that go directly in hadoop-site.xml?

Otherwise, I can't seem to track down what might be causing this. All  
of the status information about the job that I can find seems to  
report it's fine and normal, but it hasn't progressed in almost a  
day's time now. (Should be a 3-5 hour job if all goes well, and it  
used to...)

Ideas? Can I provide more info?

Thanks,
Mark

Mime
View raw message