hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uma Maheswara Rao G <mahesw...@huawei.com>
Subject Blocks are getting corrupted under very high load
Date Tue, 22 Nov 2011 11:16:38 GMT
Hi All,

I have backported HDFS-1779 to our Hadoop version which is based on 0.20-Append branch.

We are running a load test, as usual. (We want to ensure the reliability of the system under
heavy loads.)
My cluster has 8 DataNodes and a Namenode
Each machine has 16 CPUs and 12 hard disks, each having 2TB capacity.
Clients are running along with Datanodes.
Clients will upload some tar files containing 3-4 blocks, from 50 threads.
Each block size is 256MB. replication factor is 3.

Everything looks to be fine on a normal load.
When the load is increased, lot of errors are happening.
Many pipeline failures are happening also.
All these are fine, except for the strange case of few blocks.

Some blocks (around 30) are missing (FSCK report shows).
When I tried to read that files, it fails saying that No Datanodes for this block
Analysing the logs, we found that, for these blocks, pipeline recovery happened, write was
successful to a single Datanode.
Also, Datanode reported the block to Namenode in a blockReceived command.
After some time (say, 30 minutes), the Datanode is getting restarted.
In the BBW (BlocksBeingWritten) report send by DN immediately after restart, these finalized
blocks are also included. (Showing that these blocks are in blocksBeingWritten folder)
In many of the cases, the generation timestamp reported in the BBW report is the old timestamp.

Namenode is rejecting that block in the BBW report by saying file is already closed.
Also, Namenode asks the Datanode to invlidate the blocks & Datanode is doing the same.
When deleting the blocks also, it is printing the path from BlocksBeingWritten directory.
(Also the previous generation timestamp)

Looks very strange for me.
Does this means that the finalized block file & meta file (which is written in current
folder) is getting lost after DN restart
Due to which Namenode will not receive these block's information in the BLOCK REPORT send
from the Datanodes.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message