hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Foley <ma...@yahoo-inc.com>
Subject Re: What's datanode doing when logging 'Verification succeeded for blk_.' ?
Date Tue, 31 May 2011 19:44:03 GMT
The "Verification succeeded" messages are from a Datanode background housekeeping task, DataBlockScanner,
which attempts to discover any replicas that have become corrupt.  If it finds one (which
should be rare), it tells the Namenode the replica has become corrupted, and the NN will re-replicate
it from a good copy on another DN.

DataBlockScanner may consume up to 100% of one CPU core on the DN, but no more.  It is very
unlikely to have caused the DN to become unable to do its high-priority work, like sending
heartbeats and responding to Clients.  Unless you're running DN on single-core boxes, look
to network problems or Namenode overload as more likely explanations for the problem.

One other possibility: were the "lost heartbeat" logs from startup time of a large cluster?
 In v20, prior to a set of startup performance improvements that a few of us did over the
first few months of this year, it was not uncommon for the NN to get swamped during startup
of a large cluster, and start losing heartbeats and removing healthy nodes.  This was directly
addressed in trunk and 20-security by HDFS-1541 (patch by Hairong Kuang).


On May 31, 2011, at 4:10 AM, Joey Echeverria wrote:

How much memory do you have on your DataNode? Is it possible that
you're swapping?


On Mon, May 30, 2011 at 11:09 PM, ccxixicc <ccxixicc@foxmail.com> wrote:
> Hi,all
> I found NameNode often lost heartbeat from DataNodes:
> org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost
> heartbeat from
> org.apache.hadoop.net.NetworkTopology: Removing a node:
> /default-rack/
> meanwhile NN logs:
> org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
> blockMap updated: is added to blk_16634224072...
> And DN logs:
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_1820616086..
> There's no DFSClients, I do nothing, What are the NN and DN doing? Almost
> 100% cpu. Is this why NN lost heartbeat from DN?
> Thanks.

Joseph Echeverria
Cloudera, Inc.

View raw message