Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 94239 invoked from network); 30 Oct 2008 04:55:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Oct 2008 04:55:06 -0000 Received: (qmail 75688 invoked by uid 500); 30 Oct 2008 04:55:06 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 75639 invoked by uid 500); 30 Oct 2008 04:55:06 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 75628 invoked by uid 99); 30 Oct 2008 04:55:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Oct 2008 21:55:06 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [203.99.254.144] (HELO rsmtp2.corp.hki.yahoo.com) (203.99.254.144) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Oct 2008 04:53:49 +0000 Received: from [10.66.92.201] (fewbugocean-lm.eglbp.corp.yahoo.com [10.66.92.201]) by rsmtp2.corp.hki.yahoo.com (8.13.8/8.13.8/y.rout) with ESMTP id m9U4sKbq046303 for ; Wed, 29 Oct 2008 21:54:21 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=user-agent:date:subject:from:to:message-id:thread-topic: thread-index:in-reply-to:mime-version:content-type:content-transfer-encoding; b=e9RezxZFAd/v8S8cT9ojMeSN4QeHvMn/1gwFqMMzq14Fsg3ZHE0i7+kj9zOfyX9/ User-Agent: Microsoft-Entourage/12.13.0.080930 Date: Thu, 30 Oct 2008 10:24:18 +0530 Subject: Re: TaskTrackers disengaging from JobTracker From: Devaraj Das To: Message-ID: Thread-Topic: TaskTrackers disengaging from JobTracker Thread-Index: Ack6S5Cd1BUosXmSUEC9afBKbm/fJg== In-Reply-To: <49093A55.4040708@cs.washington.edu> Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org > > I wrote a patch to address the NPE in JobTracker.killJob() and compiled > it against TRUNK. I've put this on the cluster and it's now been holding > steady for the last hour or so.. so that plus whatever other differences > there are between 18.1 and TRUNK may have fixed things. (I'll submit the > patch to the JIRA as soon as it finishes cranking against the JUnit tests) > Aaron, I don't think this is a solution to the problem you are seeing. The IPC handlers are tolerant to exceptions. In particular, they must not die in the event of an exception during RPC processing. Could you please get a stack trace of the JobTracker threads (without your patch) when the TTs are unable to talk to it. Access the url http://:/stacks That will tell us what the handlers are up to. > - Aaron > > > Devaraj Das wrote: >> >> On 10/30/08 3:13 AM, "Aaron Kimball" wrote: >> >>> The system load and memory consumption on the JT are both very close to >>> "idle" states -- it's not overworked, I don't think >>> >>> I may have an idea of the problem, though. Digging back up a ways into the >>> JT logs, I see this: >>> >>> 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server >>> handler 4 on 9001, call killJob(job_200810290855_0025) from >>> 10.1.143.245:48253: error: java.io.IOException: >>> java.lang.NullPointerException >>> java.io.IOException: java.lang.NullPointerException >>> at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45>>> ) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl >>> .j >>> ava:37) >>> at java.lang.reflect.Method.invoke(Method.java:599) >>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) >>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) >>> >>> >>> >>> This exception is then repeated for all the IPC server handlers. So I think >>> the problem is that all the handler threads are dying one by one due to this >>> NPE. >>> >> >> This should not happen. IPC handler catches Throwable and handles that. >> Could you give more details like the kind of jobs (long/short) you are >> running, how many tasks they have, etc. >> >>> This something I can fix myself, or is a patch available? >>> >>> - Aaron >>> >>> On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy wrote: >>> >>>> It's possible that the JobTracker is under duress and unable to respond to >>>> the TaskTrackers... what do the JobTracker logs say? >>>> >>>> Arun >>>> >>>> >>>> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: >>>> >>>> Hi all, >>>>> I'm working with a 30 node Hadoop cluster that has just started >>>>> demonstrating some weird behavior. It's run without incident for a few >>>>> weeks.. and now: >>>>> >>>>> The cluster will run smoothly for 90--120 minutes or so, handling jobs >>>>> continually during this time. Then suddenly it will be the case that all >>>>> 29 >>>>> TaskTrackers will get disconnected from the JobTracker. All the tracker >>>>> daemon processes are still running on each machine; but the JobTracker >>>>> will >>>>> say "0 nodes available" on the web status screen. Restarting MapReduce >>>>> fixes >>>>> this for another 90--120 minutes. >>>>> >>>>> This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, >>>>> but >>>>> we're running on 0.18.1. >>>>> >>>>> I found this in a TaskTracker log: >>>>> >>>>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught >>>>> exception: java.io.IOException: Call failed on local exception >>>>> at java.lang.Throwable.(Throwable.java:67) >>>>> at org.apache.hadoop.ipc.Client.call(Client.java:718) >>>>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) >>>>> at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) >>>>> at >>>>> >>>>> >> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045> >> >> >> ) >>>>> at >>>>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) >>>>> at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) >>>>> at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) >>>>> Caused by: java.io.IOException: Connection reset by peer >>>>> at sun.nio.ch.FileDispatcher.read0(Native Method) >>>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) >>>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) >>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:207) >>>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) >>>>> at >>>>> >>>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream >>>>> .j >>>>> ava:55) >>>>> at >>>>> >>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:14 >>>>> 0) >>>>> at >>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) >>>>> at >>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) >>>>> at java.io.FilterInputStream.read(FilterInputStream.java:127) >>>>> at >>>>> >>>>> >> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272> >> >> >> ) >>>>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) >>>>> at java.io.BufferedInputStream.read(BufferedInputStream.java:248) >>>>> at java.io.DataInputStream.readInt(DataInputStream.java:381) >>>>> at >>>>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) >>>>> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) >>>>> >>>>> >>>>> As well as a few of these warnings: >>>>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON >>>>> THREADS >>>>> ((40-40+0)<1) on SocketListener0@0.0.0.0:50060 >>>>> 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF >>>>> THREADS: SocketListener0@0.0.0.0:50060 >>>>> >>>>> >>>>> >>>>> The NameNode and DataNodes are completely fine. Can't be a DNS issue, >>>>> because all DNS is served through /etc/hosts files. NameNode and >>>>> JobTracker >>>>> are on the same machine. >>>>> >>>>> Any help is appreciated >>>>> Thanks >>>>> - Aaron Kimball >>>>> >>>> >> >>