hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghu Angadi <rang...@yahoo-inc.com>
Subject Re: TaskTrackers disengaging from JobTracker
Date Thu, 30 Oct 2008 16:29:59 GMT
Devaraj Das wrote:
>> I wrote a patch to address the NPE in JobTracker.killJob() and compiled
>> it against TRUNK. I've put this on the cluster and it's now been holding
>> steady for the last hour or so.. so that plus whatever other differences
>> there are between 18.1 and TRUNK may have fixed things. (I'll submit the
>> patch to the JIRA as soon as it finishes cranking against the JUnit tests)
>>
> 
> Aaron, I don't think this is a solution to the problem you are seeing. The
> IPC handlers are tolerant to exceptions. In particular, they must not die in
> the event of an exception during RPC processing. Could you please get a
> stack trace of the JobTracker threads (without your patch) when the TTs are
> unable to talk to it. Access the url http://<jt-host>:<jt-info-port>/stacks
> That will tell us what the handlers are up to.

Devaraj fwded the stacks that Aaron sent. As he suspected there is a 
deadlock in RPC server. I will file a blocker for 0.18 and above. This 
deadlock is more likely on a busy network.

Raghu.


Mime
View raw message