hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-1651) Some improvements in progress reporting
Date Wed, 25 Jul 2007 12:06:34 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Devaraj Das updated HADOOP-1651:
--------------------------------

    Attachment: 1651.patch

In addition to what is mentioned in the description of the issue, I also added a check for
ping failures before the task kills itself. The logic in the patch is - if the progress report
fails for three times consecutively, then try ping, and if ping too fails for three times,
then kill the task. The reason for adding this is, I notice a few 65 deaths each time a large
sort is run (500/900 nodes). Most of the time, the deaths happen in the progress reporting
getting an exception. This new slightly different behavior will improve that case. It might
increase the time before a task discovers a tasktracker is dead but not that significantly.
It helps us eliminate the false negatives.

> Some improvements in progress reporting
> ---------------------------------------
>
>                 Key: HADOOP-1651
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1651
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.14.0
>
>         Attachments: 1651.patch
>
>
> Some improvements that can be done:
> 1) Progress reporting interval can be made slightly large. It is currently 1 second.
Propose to make it 3 seconds to reduce the load on the TaskTracker.
> 2) Progress reports can potentially be missed. In the loop, if the first attempt at reporting
a progress doesn't go through, it is not retried. The next communication will be a 'ping'.
3) If there is an exception while reporting progress or doing ping, the client should sleep
for sometime before retrying.
> 4) The TaskUmbilicalProtocol client can always stay connected to the server. Currently,
the default idle timeout on the IPC client is set to 1000 msec (this means that the client
will disconnect if the connection has been idle for 1000 msec). This might lead to unnecessary
tearing-down/setting-up of connections for the TaskUmbilicalProtocol and can be avoided by
having a high idle timeout for this protocol. The idea behind having the idle timeout was
to not hold on to server connections unnecessarily and hence be more scalable when there are
1000s of clients, especially applicable to those protocols involving the JT and the NameNode.
 We don't run into scalability issues with TaskUmbilical protocol since it is limited to a
few Tasks and the corresponding TaskTracker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message