Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <2580890.1151429910871.JavaMail.jira@brutus>
Date: Tue, 27 Jun 2006 17:38:30 +0000 (GMT+00:00)
From: "Milind Bhandarkar (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Updated: (HADOOP-318) Progress in writing a DFS file does
 not count towards Job progress and can make the task timeout
In-Reply-To: <3721347.1151007329825.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

     [ http://issues.apache.org/jira/browse/HADOOP-318?page=all ]

Milind Bhandarkar updated HADOOP-318:
-------------------------------------

    Attachment: hadoop-datanode-allocation.patch

This is an updated patch for this issue that does not have any errors "task reported no progress for 600 seconds" even if there is progress. In fact it is a datanode allocation patch. Each datanode sends an additional load data to namenode that indicates how many bllocks it is currently writing or reading. The namenode, when choosing datanodes for new block takes this load into consideration, and discards datanodes whose load is more than twice that of average.

Thiss is in addition to the requirement that the datanode has enough space to store min_num_blocks.

With this patch, I never see the "no progress for 600 seconds, killing task" error. Therefore, on my 240 node cluster, the randomwriter times went down from 3997 seconds to 2404 seconds.

This patch includes the file-writing progress patch as well. So, please discard  the two patches I submitted earlier.

> Progress in writing a DFS file does not count towards Job progress and can make the task timeout
> ------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-318
>          URL: http://issues.apache.org/jira/browse/HADOOP-318
>      Project: Hadoop
>         Type: Bug

>   Components: mapred
>     Versions: 0.3.2
>  Environment: all, but especially on big busy clusters
>     Reporter: Milind Bhandarkar
>     Assignee: Milind Bhandarkar
>      Fix For: 0.4.0
>  Attachments: hadoop-datanode-allocation.patch, hadoop-latency-new.patch, hadoop-latency.patch
>
> When a task writes to DFS file, depending on how busy the cluster is, it can timeout after 10 minutes by default, because the progress towards writing a DFS file does not count as progress of the task. The solution (patch is forthcoming) is to provide a way to callback reporter to report task progress from DFSOutputStream.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira