hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinayakumar B (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11674) reserveSpaceForReplicas is not released if append request failed due to mirror down and replica recovered
Date Thu, 11 May 2017 05:01:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005886#comment-16005886
] 

Vinayakumar B commented on HDFS-11674:
--------------------------------------

bq. Could you please clarify how this part works? getBlockLocations sorts the blocks by network
distance from the caller, randomizing replicas at the same distance. So lastBlock.getLocations()\[2\]
may be the first replica in the pipeline some times.

In the below part of code, blockLocations were queried first and then set as pipeline explicitly
for the test purpose. Also note that there is no 'sorting on distance' done for append calls.
Its currently only for 'getBlockLocations()' cal. May be could do that in a following Jira.
{code:java}
/*
 * Reset the pipeline for the append in such a way that, datanode which is
 * down is one of the mirror, not the first datanode.
 */
HdfsBlockLocation blockLocation = (HdfsBlockLocation) fs.getClient()
    .getBlockLocations(file.toString(), 0, BLOCK_SIZE)[0];
LocatedBlock lastBlock = blockLocation.getLocatedBlock();
.
.
.
DFSTestUtil.setPipeline((DFSOutputStream) os.getWrappedStream(),
  lastBlock);{code}

bq. I ran this test 5 times and it timed out once waiting for the file to be closed. I didn't
debug it further though.
I will also check again, not sure whats wrong. But I am sure that its not because of current
change or test. Could you paste console logs if possible.

> reserveSpaceForReplicas is not released if append request failed due to mirror down and
replica recovered
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11674
>                 URL: https://issues.apache.org/jira/browse/HDFS-11674
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Vinayakumar B
>            Assignee: Vinayakumar B
>            Priority: Critical
>              Labels: release-blocker
>         Attachments: HDFS-11674-01.patch, HDFS-11674-02.patch
>
>
> Scenario:
> 1. 3 Node cluster with "dfs.client.block.write.replace-datanode-on-failure.policy"  as
DEFAULT
> Block is written with x data.
> 2. One of the Datanode, NOT the first DN, is down
> 3. Client tries to append data to block and fails since one DN is down.
> 4. calls recoverLease() on the file.
> 5. Successfull recovery happens.
> Issue:
> 1. DNs which were connected from client before encountering mirror down, will have the
reservedSpaceForReplicas incremented, BUT never decremented. 
> 2. So in long run DN's all space will be in reservedSpaceForReplicas resulting OutOfSpace
errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message