hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1874) lost task trackers -- jobs hang
Date Wed, 19 Sep 2007 01:12:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528625
] 

Raghu Angadi commented on HADOOP-1874:
--------------------------------------

hmm.. I was wondering if the following would happen when I was submitting server throttle
hack (looks like it does :():

- Server gets backed up at one momemnt, and it reads slower from the client.
- It looks like if there it does not receive anything from a client for 2 min, it closes the
connection. I was not yesterday, when the client closes a connection.
- Ideally what server should do in that case is not to process _any_ more RPCs from that connection.
But since there is still readable data on the closed socket it patiently reads and executes
the RPC that are going to be thrown away. Any such unnecessary work done will result in bad
feedback of ever increasing load since client retries the same RPCs on different socket. I
wonder what 'netstat' would have shown in this case on the namenode. My guess is that there
should be a LOT of these exceptions while writing the reply.

Let me know if you want to try an updated server throttle patch.


> lost task trackers -- jobs hang
> -------------------------------
>
>                 Key: HADOOP-1874
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1874
>             Project: Hadoop
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 0.15.0
>            Reporter: Christian Kunz
>            Assignee: Devaraj Das
>            Priority: Blocker
>         Attachments: lazy-dfs-ops.1.patch, lazy-dfs-ops.2.patch, lazy-dfs-ops.4.patch,
lazy-dfs-ops.patch, server-throttle-hack.patch
>
>
> This happens on a 1400 node cluster using a recent nightly build patched with HADOOP-1763
(that fixes a previous 'lost task tracker' issue) running a c++-pipes job with 4200 maps and
2800 reduces. The task trackers start to get lost in high numbers at the end of job completion.
> Similar non-pipes job do not show the same problem, but is unclear whether it is related
to c++-pipes. It could also be dfs overload when reduce tasks close and validate all newly
created dfs files. I see dfs client rpc timeout exception. But this alone does not explain
the escalation in losing task trackers.
> I also noticed that the job tracker becomes rather unresponsive with rpc timeout and
call queue overflow exceptions. Job Tracker is running with 60 handlers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message