Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4404310351 for ; Mon, 6 Jan 2014 08:54:33 +0000 (UTC) Received: (qmail 15695 invoked by uid 500); 6 Jan 2014 08:54:07 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 15644 invoked by uid 500); 6 Jan 2014 08:54:04 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 15603 invoked by uid 99); 6 Jan 2014 08:53:59 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Jan 2014 08:53:59 +0000 Date: Mon, 6 Jan 2014 08:53:59 +0000 (UTC) From: "Binglin Chang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-4273) Problem in DFSInputStream read retry logic may cause early failure MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Binglin Chang updated HDFS-4273: -------------------------------- Attachment: HDFS-4273.v7.patch Update patch, chages: 1. rebase to current trunk 2. local DN in deadNodes can expire, after local DN expires, it is removed from deadNodes 3. set static const LOCAL_DEADNODE_EXPIRE_MILLISECONDS to10 minutes, so local DN should expire in 10 minutes, then read operations will try to use this local DN is possible. Assuming fail is fast when connecting to local DN when local DN is dead, performance impact should be small for extra retry. We can make LOCAL_DEADNODE_EXPIRE_MILLISECONDS configurable by adding it to dfsclient.conf, if someone think it necessary. > Problem in DFSInputStream read retry logic may cause early failure > ------------------------------------------------------------------ > > Key: HDFS-4273 > URL: https://issues.apache.org/jira/browse/HDFS-4273 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.0.2-alpha > Reporter: Binglin Chang > Assignee: Binglin Chang > Priority: Minor > Attachments: HDFS-4273-v2.patch, HDFS-4273.patch, HDFS-4273.v3.patch, HDFS-4273.v4.patch, HDFS-4273.v5.patch, HDFS-4273.v6.patch, HDFS-4273.v7.patch, TestDFSInputStream.java > > > Assume the following call logic > {noformat} > readWithStrategy() > -> blockSeekTo() > -> readBuffer() > -> reader.doRead() > -> seekToNewSource() add currentNode to deadnode, wish to get a different datanode > -> blockSeekTo() > -> chooseDataNode() > -> block missing, clear deadNodes and pick the currentNode again > seekToNewSource() return false > readBuffer() re-throw the exception quit loop > readWithStrategy() got the exception, and may fail the read call before tried MaxBlockAcquireFailures. > {noformat} > some issues of the logic: > 1. seekToNewSource() logic is broken because it may clear deadNodes in the middle. > 2. the variable "int retries=2" in readWithStrategy seems have conflict with MaxBlockAcquireFailures, should it be removed? -- This message was sent by Atlassian JIRA (v6.1.5#6160)