Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=user-agent:date:subject:from:to:message-id:thread-topic:
	thread-index:in-reply-to:mime-version:content-type:content-transfer-encoding;
	b=G3NwT8WAhQD+tBFZfu5IBFs6A1l9KoeUY4EFddW2vwn/LYjR+Yu3oPXM/N+2aNL1
User-Agent: Microsoft-Entourage/12.13.0.080930
Date: Thu, 30 Oct 2008 09:32:19 +0530
Subject: Re: TaskTrackers disengaging from JobTracker
From: Devaraj Das <ddas@yahoo-inc.com>
To: <core-user@hadoop.apache.org>
Message-ID: <C52F2FA3.50DA8%ddas@yahoo-inc.com>
Thread-Topic: TaskTrackers disengaging from JobTracker
Thread-Index: Ack6RE2Mxp8PGsGOx0S9sPgCmnc6HQ==
In-Reply-To: <d6d7c4410810291443j602827e9u160a01d0fc75901e@mail.gmail.com>
Mime-version: 1.0
Content-type: text/plain;
	charset="US-ASCII"
Content-transfer-encoding: 7bit


On 10/30/08 3:13 AM, "Aaron Kimball" <aaron@cloudera.com> wrote:

> The system load and memory consumption on the JT are both very close to
> "idle" states -- it's not overworked, I don't think
> 
> I may have an idea of the problem, though. Digging back up a ways into the
> JT logs, I see this:
> 
> 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 4 on 9001, call killJob(job_200810290855_0025) from
> 10.1.143.245:48253: error: java.io.IOException:
> java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
> at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
> ava:37)
> at java.lang.reflect.Method.invoke(Method.java:599)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
> 
> 
> 
> This exception is then repeated for all the IPC server handlers. So I think
> the problem is that all the handler threads are dying one by one due to this
> NPE.
> 

This should not happen. IPC handler catches Throwable and handles that.
Could you give more details like the kind of jobs (long/short) you are
running, how many tasks they have, etc.

> This something I can fix myself, or is a patch available?
> 
> - Aaron
> 
> On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy <acm@yahoo-inc.com> wrote:
> 
>> It's possible that the JobTracker is under duress and unable to respond to
>> the TaskTrackers... what do the JobTracker logs say?
>> 
>> Arun
>> 
>> 
>> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote:
>> 
>>  Hi all,
>>> 
>>> I'm working with a 30 node Hadoop cluster that has just started
>>> demonstrating some weird behavior. It's run without incident for a few
>>> weeks.. and now:
>>> 
>>> The cluster will run smoothly for 90--120 minutes or so, handling jobs
>>> continually during this time. Then suddenly it will be the case that all
>>> 29
>>> TaskTrackers will get disconnected from the JobTracker. All the tracker
>>> daemon processes are still running on each machine; but the JobTracker
>>> will
>>> say "0 nodes available" on the web status screen. Restarting MapReduce
>>> fixes
>>> this for another 90--120 minutes.
>>> 
>>> This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763,
>>> but
>>> we're running on 0.18.1.
>>> 
>>> I found this in a TaskTracker log:
>>> 
>>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
>>> exception: java.io.IOException: Call failed on local exception
>>>   at java.lang.Throwable.<init>(Throwable.java:67)
>>>   at org.apache.hadoop.ipc.Client.call(Client.java:718)
>>>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>>   at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source)
>>>   at
>>> 
>>> 
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>>
)
>>>   at
>>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928)
>>>   at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
>>>   at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352)
>>> Caused by: java.io.IOException: Connection reset by peer
>>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
>>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
>>>   at sun.nio.ch.IOUtil.read(IOUtil.java:207)
>>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>>>   at
>>> 
>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j
>>> ava:55)
>>>   at
>>> 
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>>>   at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>>>   at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>>>   at java.io.FilterInputStream.read(FilterInputStream.java:127)
>>>   at
>>> 
>>> 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272>>>
)
>>>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
>>>   at java.io.BufferedInputStream.read(BufferedInputStream.java:248)
>>>   at java.io.DataInputStream.readInt(DataInputStream.java:381)
>>>   at
>>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
>>>   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
>>> 
>>> 
>>> As well as a few of these warnings:
>>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON
>>> THREADS
>>> ((40-40+0)<1) on SocketListener0@0.0.0.0:50060
>>> 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF
>>> THREADS: SocketListener0@0.0.0.0:50060
>>> 
>>> 
>>> 
>>> The NameNode and DataNodes are completely fine. Can't be a DNS issue,
>>> because all DNS is served through /etc/hosts files. NameNode and
>>> JobTracker
>>> are on the same machine.
>>> 
>>> Any help is appreciated
>>> Thanks
>>> - Aaron Kimball
>>> 
>> 
>>