Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 47889 invoked from network); 1 Jul 2008 21:29:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Jul 2008 21:29:44 -0000 Received: (qmail 90378 invoked by uid 500); 1 Jul 2008 21:29:38 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 90340 invoked by uid 500); 1 Jul 2008 21:29:38 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 90280 invoked by uid 99); 1 Jul 2008 21:29:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Jul 2008 14:29:38 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Jul 2008 21:28:54 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 596C7234C14E for ; Tue, 1 Jul 2008 14:28:45 -0700 (PDT) Message-ID: <38251534.1214947725365.JavaMail.jira@brutus> Date: Tue, 1 Jul 2008 14:28:45 -0700 (PDT) From: "Raghu Angadi (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-3673) Deadlock in Datanode RPC servers In-Reply-To: <871530531.1214872244961.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609714#action_12609714 ] Raghu Angadi commented on HADOOP-3673: -------------------------------------- A thin client is better. And DFS client has never gotten thinner in the past. The protocol needs to be really really stable before there is such a client. Regd porting, mostly likely a combination of a very thin wrapper over a some what fatter (Java?) library is what might emerge in the future. For 0.18, simple and straight forward fix is better. RPC server feature might not be too intrusive for 0.18 either. > Care has to be taken to ensure that responses from calls from the same connection are sequentialized and processed in order. No ordering is required. > Deadlock in Datanode RPC servers > -------------------------------- > > Key: HADOOP-3673 > URL: https://issues.apache.org/jira/browse/HADOOP-3673 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.18.0 > Reporter: dhruba borthakur > Priority: Blocker > Fix For: 0.18.0 > > > There is a deadlock scenario in the way Lease Recovery is triggered using the Datanode RPC server via HADOOP-3283. > Each Datanode has dfs.datanode.handler.count handler threads (default of 3). These handler threads are used to support the generation-stamp-dance protocol as described in HADOOP-1700. > Let me try to explain the scenario with an example. Suppose, a cluster has two datanodes. Also, let's assume that dfs.datanode.handler.count is set to 1. Suppose that there are two clients, each writing to a separate file with a replication factor of 2. Let's assume that both clients encounter an IO error and triggers the generation-stamp-dance protocol. The first client may invoke recoverBlock on the first datanode while the second client may invoke recoverBlock on the second datanode. Now, each of the datanode will try to make a getBlockMetaDataInfo() to the other datanode. But since each datanode has only 1 server handler threads, both threads will block for eternity. Deadlock! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.