hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Blocks are getting corrupted under very high load
Date Wed, 23 Nov 2011 00:57:37 GMT
Can you look on the DN in question and see whether it was succesfully
finalized when the write finished? It doesn't sound like a successful
write -- should have moved it out of the bbw directory into current/

-Todd

On Tue, Nov 22, 2011 at 3:16 AM, Uma Maheswara Rao G
<maheswara@huawei.com> wrote:
> Hi All,
>
>
>
> I have backported HDFS-1779 to our Hadoop version which is based on 0.20-Append branch.
>
> We are running a load test, as usual. (We want to ensure the reliability of the system
under heavy loads.)
> My cluster has 8 DataNodes and a Namenode
> Each machine has 16 CPUs and 12 hard disks, each having 2TB capacity.
> Clients are running along with Datanodes.
> Clients will upload some tar files containing 3-4 blocks, from 50 threads.
> Each block size is 256MB. replication factor is 3.
>
> Everything looks to be fine on a normal load.
> When the load is increased, lot of errors are happening.
> Many pipeline failures are happening also.
> All these are fine, except for the strange case of few blocks.
>
> Some blocks (around 30) are missing (FSCK report shows).
> When I tried to read that files, it fails saying that No Datanodes for this block
> Analysing the logs, we found that, for these blocks, pipeline recovery happened, write
was successful to a single Datanode.
> Also, Datanode reported the block to Namenode in a blockReceived command.
> After some time (say, 30 minutes), the Datanode is getting restarted.
> In the BBW (BlocksBeingWritten) report send by DN immediately after restart, these finalized
blocks are also included. (Showing that these blocks are in blocksBeingWritten folder)
> In many of the cases, the generation timestamp reported in the BBW report is the old
timestamp.
>
> Namenode is rejecting that block in the BBW report by saying file is already closed.
> Also, Namenode asks the Datanode to invlidate the blocks & Datanode is doing the
same.
> When deleting the blocks also, it is printing the path from BlocksBeingWritten directory.
(Also the previous generation timestamp)
>
> Looks very strange for me.
> Does this means that the finalized block file & meta file (which is written in current
folder) is getting lost after DN restart
> Due to which Namenode will not receive these block's information in the BLOCK REPORT
send from the Datanodes.
>
>
>
>
>
> Regards,
>
> Uma
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message