Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Mon, 6 Jan 2014 07:56:00 +0000 (UTC)
From: "Binglin Chang (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12618889.1354719825559.46079.1388994960784@arcas>
In-Reply-To: <JIRA.12618889.1354719825559@arcas>
References: <JIRA.12618889.1354719825559@arcas>
Subject: [jira] [Commented] (HDFS-4273) Problem in DFSInputStream read retry
 logic may cause early failure
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862833#comment-13862833 ] 

Binglin Chang commented on HDFS-4273:
-------------------------------------

Hi LiuLei,  this is indeed an issue. 
By my understanding, hbase mostly uses pread to read hfile data in multithread, but deadnodes is not per thread, hence a more broader issue is if we change deadnodes in one thread, it affects other threads' logic. This jira doesn't not fix this broader issue.
I am thinking of a workaround, if a DN in deadNodes is local, we give it an expire time, after it expires, we remove it, this is not perfect but should solve the issue most of the time.


> Problem in DFSInputStream read retry logic may cause early failure
> ------------------------------------------------------------------
>
>                 Key: HDFS-4273
>                 URL: https://issues.apache.org/jira/browse/HDFS-4273
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.2-alpha
>            Reporter: Binglin Chang
>            Assignee: Binglin Chang
>            Priority: Minor
>         Attachments: HDFS-4273-v2.patch, HDFS-4273.patch, HDFS-4273.v3.patch, HDFS-4273.v4.patch, HDFS-4273.v5.patch, HDFS-4273.v6.patch, TestDFSInputStream.java
>
>
> Assume the following call logic
> {noformat} 
> readWithStrategy()
>   -> blockSeekTo()
>   -> readBuffer()
>      -> reader.doRead()
>      -> seekToNewSource() add currentNode to deadnode, wish to get a different datanode
>         -> blockSeekTo()
>            -> chooseDataNode()
>               -> block missing, clear deadNodes and pick the currentNode again
>         seekToNewSource() return false
>      readBuffer() re-throw the exception quit loop
> readWithStrategy() got the exception,  and may fail the read call before tried MaxBlockAcquireFailures.
> {noformat} 
> some issues of the logic:
> 1. seekToNewSource() logic is broken because it may clear deadNodes in the middle.
> 2. the variable "int retries=2" in readWithStrategy seems have conflict with MaxBlockAcquireFailures, should it be removed?


--
This message was sent by Atlassian JIRA
(v6.1.5#6160)