hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinayak Borkar <vinay...@gmail.com>
Subject Re: HDFS openforwrite CORRUPT -> HEALTHY
Date Tue, 07 Oct 2014 17:41:02 GMT
Trying again since I did not get a reply. Please let me know if I should 
use a different forum to ask this question.

Thanks,
Vinayak



On 10/4/14, 8:45 PM, Vinayak Borkar wrote:
> Hi,
>
>
> I was experimenting with HDFS to push its boundaries on fault tolerance.
> Here is what I observed.
>
> I am using HDFS from Hadoop 2.2. I started the NameNode and then a
> single DataNode. I started writing to a DFS file from a Java client
> periodically calling hsync(). After some time, I powered off the machine
> that was running this test (not shutdown, just abruptly powered off).
>
> When the system came back up, and HDFS processes were up and HDFS was
> out of safe mode, I ran fsck on the DFS filesystem (with  -openforwrite
> -files -blocks) options and here is the output:
>
>
> /test/test.log 388970 bytes, 1 block(s), OPENFORWRITE:  MISSING 1 blocks
> of total size 388970 B
> 0.
> BP-1471648347-10.211.55.100-1412458980748:blk_1073743243_2420{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[[DISK]DS-e5bed5ae-1fa9-45ed-8d4c-8006919b4d9c:NORMAL|RWR]]}
> len=388970 MISSING!
>
> Status: CORRUPT
>   Total size:    7214119 B
>   Total dirs:    54
>   Total files:    232
>   Total symlinks:        0
>   Total blocks (validated):    214 (avg. block size 33710 B)
>    ********************************
>    CORRUPT FILES:    1
>    MISSING BLOCKS:    1
>    MISSING SIZE:        388970 B
>    ********************************
>   Minimally replicated blocks:    213 (99.53271 %)
>   Over-replicated blocks:    0 (0.0 %)
>   Under-replicated blocks:    213 (99.53271 %)
>   Mis-replicated blocks:        0 (0.0 %)
>   Default replication factor:    3
>   Average block replication:    0.9953271
>   Corrupt blocks:        0
>   Missing replicas:        426 (66.35514 %)
>   Number of data-nodes:        1
>   Number of racks:        1
> FSCK ended at Sat Oct 04 23:09:40 EDT 2014 in 47 milliseconds
>
>
> I just let the system sit for some time and reran fsck (after about
> 15-20 mins) and surprisingly the output was very different. The
> corruption was magically gone:
>
> /test/test.log 1859584 bytes, 1 block(s):  Under replicated
> BP-1471648347-10.211.55.100-1412458980748:blk_1073743243_2421. Target
> Replicas is 3 but found 1 replica(s).
> 0. BP-1471648347-10.211.55.100-1412458980748:blk_1073743243_2421
> len=1859584 repl=1
>
> Status: HEALTHY
>   Total size:    8684733 B
>   Total dirs:    54
>   Total files:    232
>   Total symlinks:        0
>   Total blocks (validated):    214 (avg. block size 40582 B)
>   Minimally replicated blocks:    214 (100.0 %)
>   Over-replicated blocks:    0 (0.0 %)
>   Under-replicated blocks:    214 (100.0 %)
>   Mis-replicated blocks:        0 (0.0 %)
>   Default replication factor:    3
>   Average block replication:    1.0
>   Corrupt blocks:        0
>   Missing replicas:        428 (66.666664 %)
>   Number of data-nodes:        1
>   Number of racks:        1
> FSCK ended at Sat Oct 04 23:24:23 EDT 2014 in 63 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
>
>
> So my question is this: What just happened? How did the NameNode recover
> that missing block and why did it take 15 mins or so? Is there some kind
> of a lease on the file (because of the open nature) that expired after
> the 15-20 mins? Can someone with knowledge of HDFS internals please shed
> some light on what could possibly be going on or point me to sections of
> the code that could answer my questions? Also is there a way to speed
> this process up? Like say trigger the expiration of the lease (assuming
> it is a lease).
>
> Thanks,
> Vinayak


Mime
View raw message