Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9F119708C for ; Wed, 28 Sep 2011 04:16:09 +0000 (UTC) Received: (qmail 34974 invoked by uid 500); 28 Sep 2011 04:16:09 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 34725 invoked by uid 500); 28 Sep 2011 04:16:08 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 34687 invoked by uid 99); 28 Sep 2011 04:16:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Sep 2011 04:16:07 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Sep 2011 04:16:06 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id C80EF28CDC5 for ; Wed, 28 Sep 2011 04:15:45 +0000 (UTC) Date: Wed, 28 Sep 2011 04:15:45 +0000 (UTC) From: "Todd Lipcon (Created) (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <645186125.2300.1317183345821.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Created] (HDFS-2378) recoverBlock timeout in DFSClient should be longer MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 recoverBlock timeout in DFSClient should be longer -------------------------------------------------- Key: HDFS-2378 URL: https://issues.apache.org/jira/browse/HDFS-2378 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.20.206.0, 0.23.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: 0.20.206.0, 0.23.0 In a failure scenario when one of the datanodes in a pipeline has "frozen" (eg hard swapping or disk controller issues) we sometimes see timeouts in the call to recoverBlock(). This is because recoverBlock's implementation sends several RPCs internally (to the NN and to other nodes in the pipeline) with the same timeout. Since the timeouts are equal, the "outer" call times out first. The retry then fails since recovery is already in progress, or already finished. The best fix would be to make recoverBlock idempotent so the retry doesn't fail, but in the absence of that we can likely fix this issue by increasing the timeout to be equal to the sum of the timeouts of the underlying recovery calls. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira