Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 41633 invoked from network); 25 Sep 2009 05:43:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Sep 2009 05:43:47 -0000 Received: (qmail 2595 invoked by uid 500); 25 Sep 2009 05:43:47 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 2565 invoked by uid 500); 25 Sep 2009 05:43:47 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 2555 invoked by uid 99); 25 Sep 2009 05:43:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Sep 2009 05:43:47 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Sep 2009 05:43:37 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 2ADE0234C4A9 for ; Thu, 24 Sep 2009 22:43:16 -0700 (PDT) Message-ID: <1220511820.1253857396174.JavaMail.jira@brutus> Date: Thu, 24 Sep 2009 22:43:16 -0700 (PDT) From: "Todd Lipcon (JIRA)" To: hdfs-issues@hadoop.apache.org Subject: [jira] Resolved: (HDFS-262) On a busy cluster, it is possible for the client to believe it cannot fetch a block when the client or datanodes are running slowly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HDFS-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon resolved HDFS-262. ------------------------------ Resolution: Cannot Reproduce Assignee: Todd Lipcon Hi Jim, I believe this is the behavior already implemented. It sleeps for 3 seconds, then calls openInfo() once, which causes the block locations to be refreshed. Resolving - feel free to reopen if I misunderstood. > On a busy cluster, it is possible for the client to believe it cannot fetch a block when the client or datanodes are running slowly > ----------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-262 > URL: https://issues.apache.org/jira/browse/HDFS-262 > Project: Hadoop HDFS > Issue Type: Improvement > Environment: 100 node cluster, fedora, 1TB disk per machine available for HDFS (two spindles) 16GB RAM, 8 cores > running datanode, TaskTracker, HBaseRegionServer and the task being executed by the TaskTracker. > Reporter: Jim Kellerman > Assignee: Todd Lipcon > > On a heavily loaded node, the communication between a DFSClient can time out or fail leading DFSClient to believe the datanode is non-responsive even though the datanode is, in fact, healthy. It may run through all the retries for that datanode leading DFSClient to mark the datanode "dead". > This can continue as DFSClient iterates through the other datanodes for the block it is looking for, and then DFSClient will declare that it can't find any servers for that block (even though all n (where n = replication factor) datanodes are healthy (but slow) and have valid copies of the block. > It is also possible that the process running the DFSClient is too slow and misses (or times out) responses from the data node, resulting in the DFSClient believing that the datanode is dead. > Another possibility is that the block has been moved from one or more datanodes since DFSClient$DFSInputStream.chooseDataNode() found the locations of the block. > When the retries for each datanode and all datanodes are exhausted, DFSClient$DFSInputStream.chooseDataNode() issues the warning: > {code} > if (nodes == null || nodes.length == 0) { > LOG.info("No node available for block: " + blockInfo); > } > LOG.info("Could not obtain block " + block.getBlock() + " from any node: " + ie); > {code} > It would be an improvement, and not impact performance under normal conditions if when DFSClient decides that it cannot find the block anywhere, for it to retry finding the block by calling > {code} > private static LocatedBlocks callGetBlockLocations() > {code} > > *once* , to attempt to recover from machine(s) being too busy, or the block being relocated since the initial call to callGetBlockLocations(). If the second attempt to find the block based on what the namenode told DFSClient, then issue the messages and give up by throwing the exception it does today. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.