Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 85921 invoked from network); 30 Oct 2008 04:03:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Oct 2008 04:03:14 -0000 Received: (qmail 54013 invoked by uid 500); 30 Oct 2008 04:03:14 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 53969 invoked by uid 500); 30 Oct 2008 04:03:14 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 53958 invoked by uid 99); 30 Oct 2008 04:03:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Oct 2008 21:03:14 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [203.99.254.144] (HELO rsmtp2.corp.hki.yahoo.com) (203.99.254.144) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Oct 2008 04:01:57 +0000 Received: from [10.66.92.201] (fewbugocean-lm.eglbp.corp.yahoo.com [10.66.92.201]) by rsmtp2.corp.hki.yahoo.com (8.13.8/8.13.8/y.rout) with ESMTP id m9U42NUB043903 for ; Wed, 29 Oct 2008 21:02:27 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=user-agent:date:subject:from:to:message-id:thread-topic: thread-index:in-reply-to:mime-version:content-type:content-transfer-encoding; b=G3NwT8WAhQD+tBFZfu5IBFs6A1l9KoeUY4EFddW2vwn/LYjR+Yu3oPXM/N+2aNL1 User-Agent: Microsoft-Entourage/12.13.0.080930 Date: Thu, 30 Oct 2008 09:32:19 +0530 Subject: Re: TaskTrackers disengaging from JobTracker From: Devaraj Das To: Message-ID: Thread-Topic: TaskTrackers disengaging from JobTracker Thread-Index: Ack6RE2Mxp8PGsGOx0S9sPgCmnc6HQ== In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 10/30/08 3:13 AM, "Aaron Kimball" wrote: > The system load and memory consumption on the JT are both very close to > "idle" states -- it's not overworked, I don't think > > I may have an idea of the problem, though. Digging back up a ways into the > JT logs, I see this: > > 2008-10-29 11:24:05,502 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 4 on 9001, call killJob(job_200810290855_0025) from > 10.1.143.245:48253: error: java.io.IOException: > java.lang.NullPointerException > java.io.IOException: java.lang.NullPointerException > at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j > ava:37) > at java.lang.reflect.Method.invoke(Method.java:599) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) > > > > This exception is then repeated for all the IPC server handlers. So I think > the problem is that all the handler threads are dying one by one due to this > NPE. > This should not happen. IPC handler catches Throwable and handles that. Could you give more details like the kind of jobs (long/short) you are running, how many tasks they have, etc. > This something I can fix myself, or is a patch available? > > - Aaron > > On Wed, Oct 29, 2008 at 12:55 PM, Arun C Murthy wrote: > >> It's possible that the JobTracker is under duress and unable to respond to >> the TaskTrackers... what do the JobTracker logs say? >> >> Arun >> >> >> On Oct 29, 2008, at 12:33 PM, Aaron Kimball wrote: >> >> Hi all, >>> >>> I'm working with a 30 node Hadoop cluster that has just started >>> demonstrating some weird behavior. It's run without incident for a few >>> weeks.. and now: >>> >>> The cluster will run smoothly for 90--120 minutes or so, handling jobs >>> continually during this time. Then suddenly it will be the case that all >>> 29 >>> TaskTrackers will get disconnected from the JobTracker. All the tracker >>> daemon processes are still running on each machine; but the JobTracker >>> will >>> say "0 nodes available" on the web status screen. Restarting MapReduce >>> fixes >>> this for another 90--120 minutes. >>> >>> This looks similar to https://issues.apache.org/jira/browse/HADOOP-1763, >>> but >>> we're running on 0.18.1. >>> >>> I found this in a TaskTracker log: >>> >>> 2008-10-29 09:49:03,021 ERROR org.apache.hadoop.mapred.TaskTracker: Caught >>> exception: java.io.IOException: Call failed on local exception >>> at java.lang.Throwable.(Throwable.java:67) >>> at org.apache.hadoop.ipc.Client.call(Client.java:718) >>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) >>> at org.apache.hadoop.mapred.$Proxy1.heartbeat(Unknown Source) >>> at >>> >>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1045>>> ) >>> at >>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:928) >>> at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343) >>> at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2352) >>> Caused by: java.io.IOException: Connection reset by peer >>> at sun.nio.ch.FileDispatcher.read0(Native Method) >>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33) >>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) >>> at sun.nio.ch.IOUtil.read(IOUtil.java:207) >>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) >>> at >>> >>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j >>> ava:55) >>> at >>> >>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140) >>> at >>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) >>> at >>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) >>> at java.io.FilterInputStream.read(FilterInputStream.java:127) >>> at >>> >>> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:272>>> ) >>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:229) >>> at java.io.BufferedInputStream.read(BufferedInputStream.java:248) >>> at java.io.DataInputStream.readInt(DataInputStream.java:381) >>> at >>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) >>> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) >>> >>> >>> As well as a few of these warnings: >>> 2008-10-29 01:44:20,161 INFO org.mortbay.http.SocketListener: LOW ON >>> THREADS >>> ((40-40+0)<1) on SocketListener0@0.0.0.0:50060 >>> 2008-10-29 01:44:20,166 WARN org.mortbay.http.SocketListener: OUT OF >>> THREADS: SocketListener0@0.0.0.0:50060 >>> >>> >>> >>> The NameNode and DataNodes are completely fine. Can't be a DNS issue, >>> because all DNS is served through /etc/hosts files. NameNode and >>> JobTracker >>> are on the same machine. >>> >>> Any help is appreciated >>> Thanks >>> - Aaron Kimball >>> >> >>