Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 68407 invoked from network); 27 Jun 2006 17:40:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 27 Jun 2006 17:40:07 -0000 Received: (qmail 63419 invoked by uid 500); 27 Jun 2006 17:40:06 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 63376 invoked by uid 500); 27 Jun 2006 17:40:06 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 63358 invoked by uid 99); 27 Jun 2006 17:40:05 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Jun 2006 10:40:05 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Jun 2006 10:40:05 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id D5AE27141F4 for ; Tue, 27 Jun 2006 17:38:30 +0000 (GMT) Message-ID: <2580890.1151429910871.JavaMail.jira@brutus> Date: Tue, 27 Jun 2006 17:38:30 +0000 (GMT+00:00) From: "Milind Bhandarkar (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-318) Progress in writing a DFS file does not count towards Job progress and can make the task timeout In-Reply-To: <3721347.1151007329825.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-318?page=all ] Milind Bhandarkar updated HADOOP-318: ------------------------------------- Attachment: hadoop-datanode-allocation.patch This is an updated patch for this issue that does not have any errors "task reported no progress for 600 seconds" even if there is progress. In fact it is a datanode allocation patch. Each datanode sends an additional load data to namenode that indicates how many bllocks it is currently writing or reading. The namenode, when choosing datanodes for new block takes this load into consideration, and discards datanodes whose load is more than twice that of average. Thiss is in addition to the requirement that the datanode has enough space to store min_num_blocks. With this patch, I never see the "no progress for 600 seconds, killing task" error. Therefore, on my 240 node cluster, the randomwriter times went down from 3997 seconds to 2404 seconds. This patch includes the file-writing progress patch as well. So, please discard the two patches I submitted earlier. > Progress in writing a DFS file does not count towards Job progress and can make the task timeout > ------------------------------------------------------------------------------------------------ > > Key: HADOOP-318 > URL: http://issues.apache.org/jira/browse/HADOOP-318 > Project: Hadoop > Type: Bug > Components: mapred > Versions: 0.3.2 > Environment: all, but especially on big busy clusters > Reporter: Milind Bhandarkar > Assignee: Milind Bhandarkar > Fix For: 0.4.0 > Attachments: hadoop-datanode-allocation.patch, hadoop-latency-new.patch, hadoop-latency.patch > > When a task writes to DFS file, depending on how busy the cluster is, it can timeout after 10 minutes by default, because the progress towards writing a DFS file does not count as progress of the task. The solution (patch is forthcoming) is to provide a way to callback reporter to report task progress from DFSOutputStream. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira