hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Blocks are getting corrupted under very high load
Date Wed, 23 Nov 2011 23:37:48 GMT
On Wed, Nov 23, 2011 at 1:23 AM, Uma Maheswara Rao G
<maheswara@huawei.com> wrote:
> Yes, Todd,  block after restart is small and  genstamp also lesser.
>   Here complete machine reboot happend. The boards are configured like, if it is not
getting any CPU cycles  for 480secs, it will reboot himself.
>  kernal.hung_task_timeout_secs = 480 sec.

So sounds like the following happened:
- while writing file, the pipeline got reduced down to 1 node due to
timeouts from the other two
- soon thereafter (before more replicas were made), that last replica
kernel-paniced without syncing the data
- on reboot, the filesystem lost some edits from its ext3 journal, and
the block got moved back into the RBW directly, with truncated data
- hdfs did "the right thing" - at least what the algorithms say it
should do, because it had gotten a commitment for a later replica

If you have a build which includes HDFS-1539, you could consider
setting dfs.datanode.synconclose to true, which would have prevented
this problem.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message