hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Milind Bhandarkar (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-318) Progress in writing a DFS file does not count towards Job progress and can make the task timeout
Date Tue, 27 Jun 2006 17:38:30 GMT
     [ http://issues.apache.org/jira/browse/HADOOP-318?page=all ]

Milind Bhandarkar updated HADOOP-318:
-------------------------------------

    Attachment: hadoop-datanode-allocation.patch

This is an updated patch for this issue that does not have any errors "task reported no progress
for 600 seconds" even if there is progress. In fact it is a datanode allocation patch. Each
datanode sends an additional load data to namenode that indicates how many bllocks it is currently
writing or reading. The namenode, when choosing datanodes for new block takes this load into
consideration, and discards datanodes whose load is more than twice that of average.

Thiss is in addition to the requirement that the datanode has enough space to store min_num_blocks.

With this patch, I never see the "no progress for 600 seconds, killing task" error. Therefore,
on my 240 node cluster, the randomwriter times went down from 3997 seconds to 2404 seconds.

This patch includes the file-writing progress patch as well. So, please discard  the two patches
I submitted earlier.

> Progress in writing a DFS file does not count towards Job progress and can make the task
timeout
> ------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-318
>          URL: http://issues.apache.org/jira/browse/HADOOP-318
>      Project: Hadoop
>         Type: Bug

>   Components: mapred
>     Versions: 0.3.2
>  Environment: all, but especially on big busy clusters
>     Reporter: Milind Bhandarkar
>     Assignee: Milind Bhandarkar
>      Fix For: 0.4.0
>  Attachments: hadoop-datanode-allocation.patch, hadoop-latency-new.patch, hadoop-latency.patch
>
> When a task writes to DFS file, depending on how busy the cluster is, it can timeout
after 10 minutes by default, because the progress towards writing a DFS file does not count
as progress of the task. The solution (patch is forthcoming) is to provide a way to callback
reporter to report task progress from DFSOutputStream.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message