hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Sharma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4721) Speed up lease/block recovery when DN fails and a block goes into recovery
Date Thu, 25 Apr 2013 00:23:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641221#comment-13641221
] 

Varun Sharma commented on HDFS-4721:
------------------------------------

Here are the remainder messages - looking at it there are messages 40 minutes later when I
bring back the dead datanode. I think it reports the block and a recovery is performed, then,
since its still in the recovery queue.

2013-04-24 06:57:14,373 INFO BlockStateChange: BLOCK* processReport: blk_-2482251885029951704_11942
on 10.168.12.138:50010 size 7039284 does not belong to any file
2013-04-24 06:57:14,373 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_-2482251885029951704_11942
to 10.168.12.138:50010
2013-04-24 06:57:17,240 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* InvalidateBlocks:
ask 10.168.12.138:50010 to delete [blk_-121400693146753449_11986, blk_7815495529310756756_10715,
blk_4125941153395778345_10713, blk_7979989947202390292_11938, blk_-2482251885029951704_11942,
blk_-2834772731171489244_10711]
2013-04-24 09:14:25,284 INFO BlockStateChange: BLOCK* processReport: blk_-2482251885029951704_11942
on 10.170.6.131:50010 size 7039284 does not belong to any file
2013-04-24 09:14:25,284 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_-2482251885029951704_11942
to 10.170.6.131:50010
2013-04-24 09:14:26,916 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* InvalidateBlocks:
ask 10.170.6.131:50010 to delete [blk_-6242914570577158362_12305, blk_7396709163981662539_11419,
blk_-121400693146753449_11986, blk_7815495529310756756_10716, blk_8175754220082115190_12303,
blk_1204694577977643985_12307, blk_4125941153395778345_10718, blk_7979989947202390292_11938,
blk_-2482251885029951704_11942, blk_-3317357101836432862_12390, blk_-5206526708499881023_11940,
blk_-2834772731171489244_10717]
2013-04-24 16:38:26,254 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942,
newgenerationstamp=12012, newlength=7044280, newtargets=[10.170.15.97:50010], closeFile=true,
deleteBlock=false)
2013-04-24 16:38:26,255 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942)
not found
2013-04-24 16:38:26,255 INFO org.apache.hadoop.ipc.Server: IPC Server handler 55 on 8020,
call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization from
10.170.15.97:44875: error: java.io.IOException: Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942)
not found
java.io.IOException: Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942)
not found
2013-04-24 16:38:26,255 INFO BlockStateChange: BLOCK* addBlock: block blk_-2482251885029951704_12012
on 10.170.15.97:50010 size 7044280 does not belong to any file
2013-04-24 16:38:26,255 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_-2482251885029951704_12012
to 10.170.15.97:50010
2013-04-24 16:38:28,766 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* InvalidateBlocks:
ask 10.170.15.97:50010 to delete [blk_-121400693146753449_12233, blk_-2482251885029951704_12012,
blk_7979989947202390292_11989]
                
> Speed up lease/block recovery when DN fails and a block goes into recovery
> --------------------------------------------------------------------------
>
>                 Key: HDFS-4721
>                 URL: https://issues.apache.org/jira/browse/HDFS-4721
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.0.3-alpha
>            Reporter: Varun Sharma
>             Fix For: 2.0.4-alpha
>
>         Attachments: 4721-hadoop2.patch, 4721-trunk.patch, 4721-trunk-v2.patch, 4721-v2.patch,
4721-v3.patch, 4721-v4.patch, 4721-v5.patch, 4721-v6.patch, 4721-v7.patch, 4721-v8.patch
>
>
> This was observed while doing HBase WAL recovery. HBase uses append to write to its write
ahead log. So initially the pipeline is setup as
> DN1 --> DN2 --> DN3
> This WAL needs to be read when DN1 fails since it houses the HBase regionserver for the
WAL.
> HBase first recovers the lease on the WAL file. During recovery, we choose DN1 as the
primary DN to do the recovery even though DN1 has failed and is not heartbeating any more.
> Avoiding the stale DN1 would speed up recovery and reduce hbase MTTR. There are two options.
> a) Ride on HDFS 3703 and if stale node detection is turned on, we do not choose stale
datanodes (typically not heart beated for 20-30 seconds) as primary DN(s)
> b) We sort the replicas in order of last heart beat and always pick the ones which gave
the most recent heart beat
> Going to the dead datanode increases lease + block recovery since the block goes into
UNDER_RECOVERY state even though no one is recovering it actively. Please let me know if this
makes sense. If yes, whether we should move forward with a) or b).
> Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message