Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Mon, 6 Jan 2014 08:53:59 +0000 (UTC)
From: "Binglin Chang (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12618889.1354719825559.46166.1388998439687@arcas>
In-Reply-To: <JIRA.12618889.1354719825559@arcas>
References: <JIRA.12618889.1354719825559@arcas>
Subject: [jira] [Updated] (HDFS-4273) Problem in DFSInputStream read retry
 logic may cause early failure
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HDFS-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Binglin Chang updated HDFS-4273:
--------------------------------

    Attachment: HDFS-4273.v7.patch

Update patch, chages:
1. rebase to current trunk
2. local DN in deadNodes can expire, after local DN expires, it is removed from deadNodes
3. set static const LOCAL_DEADNODE_EXPIRE_MILLISECONDS to10 minutes, so local DN should expire in 10 minutes, then read operations will try to use this local DN is possible. Assuming fail is fast when connecting to local DN when local DN is dead, performance impact should be small for extra retry. 

We can make LOCAL_DEADNODE_EXPIRE_MILLISECONDS configurable by adding it to dfsclient.conf, if someone think it necessary. 

> Problem in DFSInputStream read retry logic may cause early failure
> ------------------------------------------------------------------
>
>                 Key: HDFS-4273
>                 URL: https://issues.apache.org/jira/browse/HDFS-4273
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.2-alpha
>            Reporter: Binglin Chang
>            Assignee: Binglin Chang
>            Priority: Minor
>         Attachments: HDFS-4273-v2.patch, HDFS-4273.patch, HDFS-4273.v3.patch, HDFS-4273.v4.patch, HDFS-4273.v5.patch, HDFS-4273.v6.patch, HDFS-4273.v7.patch, TestDFSInputStream.java
>
>
> Assume the following call logic
> {noformat} 
> readWithStrategy()
>   -> blockSeekTo()
>   -> readBuffer()
>      -> reader.doRead()
>      -> seekToNewSource() add currentNode to deadnode, wish to get a different datanode
>         -> blockSeekTo()
>            -> chooseDataNode()
>               -> block missing, clear deadNodes and pick the currentNode again
>         seekToNewSource() return false
>      readBuffer() re-throw the exception quit loop
> readWithStrategy() got the exception,  and may fail the read call before tried MaxBlockAcquireFailures.
> {noformat} 
> some issues of the logic:
> 1. seekToNewSource() logic is broken because it may clear deadNodes in the middle.
> 2. the variable "int retries=2" in readWithStrategy seems have conflict with MaxBlockAcquireFailures, should it be removed?


--
This message was sent by Atlassian JIRA
(v6.1.5#6160)