Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CDAA29C05 for ; Wed, 1 Feb 2012 08:13:26 +0000 (UTC) Received: (qmail 70830 invoked by uid 500); 1 Feb 2012 08:13:24 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 69513 invoked by uid 500); 1 Feb 2012 08:13:07 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 69261 invoked by uid 99); 1 Feb 2012 08:13:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Feb 2012 08:13:03 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of wget.null@googlemail.com designates 209.85.210.48 as permitted sender) Received: from [209.85.210.48] (HELO mail-pz0-f48.google.com) (209.85.210.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Feb 2012 08:12:55 +0000 Received: by dadp13 with SMTP id p13so852961dad.35 for ; Wed, 01 Feb 2012 00:12:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=34pE6eek0L+sZ6wlT0VVNjs3LuZAj4sFOsey8zUg5sE=; b=rSzP2JQTM9C4N1idsuOzWSojeeGrsNX4VQqNQKJ5QCg2z7/+WvGHBgm51D6CcX1S5C QYs/kh2HdJ0dAv3BQwUaRgFzWEU2Dc6c0Q+oTpL4Y8Vh90bdXBsYrYExUwn0GPQ6wq2C oC7K40kQ0rw3yOMAp1mDAp2onJDZKr0yAD1oA= Received: by 10.68.72.138 with SMTP id d10mr19128772pbv.15.1328083954183; Wed, 01 Feb 2012 00:12:34 -0800 (PST) Received: from [192.168.200.101] (HSI-KBW-149-172-23-146.hsi13.kabel-badenwuerttemberg.de. [149.172.23.146]) by mx.google.com with ESMTPS id i10sm61651684pbg.10.2012.02.01.00.12.31 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 01 Feb 2012 00:12:32 -0800 (PST) Subject: Re: tasktracker keep recevied KillJobAction and then delete unknown job while using hive Mime-Version: 1.0 (Apple Message framework v1251.1) Content-Type: text/plain; charset=iso-8859-1 From: alo alt In-Reply-To: Date: Wed, 1 Feb 2012 09:12:28 +0100 Cc: =?utf-8?B?5L2Y5pmT5b2s?= Content-Transfer-Encoding: quoted-printable Message-Id: <0FDC2DE9-D554-4EBA-A0EA-38855CC1AC4A@gmail.com> References: To: user@hive.apache.org X-Mailer: Apple Mail (2.1251.1) X-Virus-Checked: Checked by ClamAV on apache.org Hi, + hdfs-user (bcc'd) which jre version u use? - Alex =20 -- Alexander Lorenz http://mapredit.blogspot.com On Feb 1, 2012, at 8:16 AM, Xiaobin She wrote: > hi , >=20 >=20 > I'm using hive to do some log analysis, and I have encountered a = problem. >=20 > My cluster have 3 nodes, one for NameNode/JobTracker and the other two = for DataNode/TaskTracker >=20 > One of the tasktracker will repeatedly receive KillJobAction and then = delete unknown jobs >=20 > the logs look like: >=20 > 2012-01-31 00:35:37,640 INFO org.apache.hadoop.mapred.TaskTracker: = Received 'KillJobAction' for job: job_201201301055_0381 > 2012-01-31 00:35:37,640 WARN org.apache.hadoop.mapred.TaskTracker: = Unknown job job_201201301055_0381 being deleted. > 2012-01-31 00:36:22,697 INFO org.apache.hadoop.mapred.TaskTracker: = Received 'KillJobAction' for job: job_201201301055_0383 > 2012-01-31 00:36:22,698 WARN org.apache.hadoop.mapred.TaskTracker: = Unknown job job_201201301055_0383 being deleted. > 2012-01-31 01:05:34,108 INFO org.apache.hadoop.mapred.TaskTracker: = Received 'KillJobAction' for job: job_201201301055_0384 > 2012-01-31 01:05:34,108 WARN org.apache.hadoop.mapred.TaskTracker: = Unknown job job_201201301055_0384 being deleted. > 2012-01-31 01:07:43,280 INFO org.apache.hadoop.mapred.TaskTracker: = Received 'KillJobAction' for job: job_201201301055_0385 > 2012-01-31 01:07:43,280 WARN org.apache.hadoop.mapred.TaskTracker: = Unknown job job_201201301055_0385 being deleted. =20 >=20 > this happens occasionally, and if this happens, this tasktracker will = do notghing but keep receiveing KillJobAction and delete unknown job, = and thus the performance will drop down. >=20 > to solve this problem, I have to restart the cluster. > but obviously, this is not a good solution. >=20 > these jobs eventually will be run on the other tasktracker, and they = will run well, the job will success. >=20 > has anybody have encountered this problem and give me some advices? >=20 > and occasionally there will be some errlog like: >=20 > 2012-01-31 13:11:40,183 INFO org.apache.hadoop.ipc.Server: IPC Server = listener on 55837: readAndProcess threw exception java.io.IOException: = Connection reset by peer. Count of bytes read: 0 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) > at sun.nio.ch.IOUtil.read(IOUtil.java:175) > at = sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) > at org.apache.hadoop.ipc.Server.channelRead(Server.java:1211) > at org.apache.hadoop.ipc.Server.access$2300(Server.java:77) > at = org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:799) > at = org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:419) > at org.apache.hadoop.ipc.Server$Listener.run(Server.java:328) > 2012-01-31 13:11:40,211 INFO org.apache.hadoop.mapred.JvmManager: JVM = : jvm_201201311041_0071_r_-1096994286 exited. Number of tasks it ran: 0 > 2012-01-31 13:11:40,214 INFO org.apache.hadoop.mapred.TaskTracker: = Killing unknown JVM jvm_201201311041_0071_r_-386575334 > 2012-01-31 13:11:40,221 INFO org.apache.hadoop.ipc.Server: IPC Server = listener on 55837: readAndProcess threw exception java.io.IOException: = Connection reset by peer. Count of bytes read: 0 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) > at sun.nio.ch.IOUtil.read(IOUtil.java:175) > at = sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) > at org.apache.hadoop.ipc.Server.channelRead(Server.java:1211) > at org.apache.hadoop.ipc.Server.access$2300(Server.java:77) > at = org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:799) > at = org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:419) > at org.apache.hadoop.ipc.Server$Listener.run(Server.java:328) =20= >=20 > Is there some connections between these two errors? >=20 > thank you very much! >=20 > xiaobin