hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hanisha Koneru (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11576) Block recovery will fail indefinitely if recovery time > heartbeat interval
Date Tue, 04 Apr 2017 23:41:41 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15956061#comment-15956061
] 

Hanisha Koneru commented on HDFS-11576:
---------------------------------------

Thanks for this fix [~lukmajercak].
One comment/ thought:
In UnderRecoveryBlock#addRecoveryAttempt, before adding a new BlockRecoveryAttempt, it is
removed from the reoveryTimeouts.
bq. recoveryTimeouts.remove(newTimeout);
The remove operation iterates over the entries in the corresponding row of the hash table.
This can be avoided if instead of adding a new element, the method sets a new timeout for
the previous element, if it exists. 



> Block recovery will fail indefinitely if recovery time > heartbeat interval
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-11576
>                 URL: https://issues.apache.org/jira/browse/HDFS-11576
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs, namenode
>    Affects Versions: 2.7.1, 2.7.2, 2.7.3, 3.0.0-alpha1, 3.0.0-alpha2
>            Reporter: Lukas Majercak
>            Assignee: Lukas Majercak
>            Priority: Critical
>         Attachments: HDFS-11576.001.patch, HDFS-11576.002.patch, HDFS-11576.repro.patch
>
>
> Block recovery will fail indefinitely if the time to recover a block is always longer
than the heartbeat interval. Scenario:
> 1. DN sends heartbeat 
> 2. NN sends a recovery command to DN, recoveryID=X
> 3. DN starts recovery
> 4. DN sends another heartbeat
> 5. NN sends a recovery command to DN, recoveryID=X+1
> 6. DN calls commitBlockSyncronization after succeeding with first recovery to NN, which
fails because X < X+1
> ... 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message