Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A74B310681 for ; Wed, 12 Jun 2013 07:18:27 +0000 (UTC) Received: (qmail 73095 invoked by uid 500); 12 Jun 2013 07:18:17 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 72993 invoked by uid 500); 12 Jun 2013 07:18:15 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 72975 invoked by uid 99); 12 Jun 2013 07:18:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Jun 2013 07:18:14 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ravishetye@gmail.com designates 209.85.216.170 as permitted sender) Received: from [209.85.216.170] (HELO mail-qc0-f170.google.com) (209.85.216.170) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Jun 2013 07:18:09 +0000 Received: by mail-qc0-f170.google.com with SMTP id s1so2190938qcw.15 for ; Wed, 12 Jun 2013 00:17:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=4HnEdVHqNQNeA60cIxW32MpoF+/O6zWPx/cI5P9MwcE=; b=o7ADQxGqO0+aeUnOK/ADI/P57pvt0GBZeIJU6wbGXzmxWvRdGm+1WQnRKc8CDIt5c/ fzvR/dWvV7e1Ovqvv49XyGse98LtYzkUfdaV/A+OnuXvUhcItKfHWM1zgABSGpy/V+X5 hg0q0t3hI/R4SOj2y0RAfmYM4eQtQH+tfCYtZcGr89N984A6CIy2sJyAzWF7fknstSFA EyGDPPu0iV+tmTmaO6PbihsIxGRzSPeEZAgnp7NPLAQ6m2dUPPxOjGSGZvVd38exNGHI MqgYr5lZwb/JBOYFv3b9jDJBHFKXAgdlqDJLdrFr/ypvuY/yaQ/zkFafliRXDmQr85/2 AK5Q== MIME-Version: 1.0 X-Received: by 10.229.71.134 with SMTP id h6mr6592396qcj.131.1371021468752; Wed, 12 Jun 2013 00:17:48 -0700 (PDT) Received: by 10.49.28.73 with HTTP; Wed, 12 Jun 2013 00:17:48 -0700 (PDT) Date: Wed, 12 Jun 2013 12:47:48 +0530 Message-ID: Subject: Task Tracker going down on hive cluster From: Ravi Shetye To: user@hive.apache.org, user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e016284469fe1ec04deefcf31 X-Virus-Checked: Checked by ClamAV on apache.org --089e016284469fe1ec04deefcf31 Content-Type: text/plain; charset=ISO-8859-1 In last 4-5 of day the task tracker on one of my slave machines has gone down couple of time. It has been working fine from the past 4-5 months The cluster configuration is 4 machine cluster on AWS 1 m2.xlarge master 3 m2.xlarge slaves The cluster is dedicated to run hive queries, with the data residing on s3. the slave on which the task tracker went down had the following log ******************************************************************* 2013-06-11 00:26:30,968 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60659, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005693_0, duration: 279198 2013-06-11 00:26:30,971 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.191.**.***:37605, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 193135 2013-06-11 00:26:30,971 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60630, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 192011 2013-06-11 00:26:30,972 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60656, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005693_0, duration: 178209 2013-06-11 00:26:30,973 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.8.***.**:45321, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005694_0, duration: 186452 2013-06-11 00:26:30,973 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60659, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005694_0, duration: 157360 2013-06-11 00:26:30,974 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.8.***.**:45321, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 157555 2013-06-11 00:26:30,991 INFO org.apache.hadoop.mapred.JvmManager: JVM Not killed jvm_201306071409_0151_m_-435659475 but just removed 2013-06-11 00:26:30,991 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201306071409_0151_m_-435659475 exited with exit code 0. Number of tasks it ran: 0 2013-06-11 00:26:30,991 ERROR org.apache.hadoop.mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. org.apache.hadoop.fs.FSError: java.io.IOException: Broken pipe at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:200) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49) at java.io.DataOutputStream.write(DataOutputStream.java:107) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:220) at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:315) at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:148) at java.io.OutputStreamWriter.close(OutputStreamWriter.java:233) at java.io.BufferedWriter.close(BufferedWriter.java:265) at java.io.PrintWriter.close(PrintWriter.java:312) at org.apache.hadoop.mapred.TaskController.writeCommand(TaskController.java:231) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:126) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:497) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:471) Caused by: java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:297) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:198) ... 13 more 2013-06-11 00:26:31,007 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201306071409_0151_m_-495709221 2013-06-11 00:26:31,008 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60656, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005694_0, duration: 222430 2013-06-11 00:26:31,008 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60653, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005693_0, duration: 154027 2013-06-11 00:26:31,008 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60659, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 132067 2013-06-11 00:26:31,326 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201306071409_0151_m_-495709221 spawned. 2013-06-11 00:26:31,328 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /mnt/app/hadoop-tmp/ttprivate/taskTracker/piyushv/jobcache/job_201306071409_0151/attempt_201306071409_0151_m_005717_0/taskjvm.sh 2013-06-11 00:26:31,331 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60656, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 437236 2013-06-11 00:26:31,332 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down TaskTracker at ip-10-191-**-***/10.191.**.*** ************************************************************/ -- RAVI SHETYE --089e016284469fe1ec04deefcf31 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
In last 4-5 of day the task tracker on one of my slave mac= hines has gone down couple of time. It has been working fine from the past = 4-5 months

The cluster configuration is
4 machine cluster = on AWS
1 m2.xlarge master
3 m2.xlarge slaves

The cluster is dedicated to run hive queries, with the data residi= ng on s3.

the slave on which the task tracke= r went down had the following log

***********************************************= ********************
2013-06-11 00:26:30,968 INFO org.apache= .hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.= 190.***.***:60659, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_2013060714= 09_0151_m_005693_0, duration: 279198
2013-06-11 00:26:30,971 INFO org.apache.hadoop.mapred.TaskTracker.clie= nttrace: src: 10.191.**.***:50060, dest: 10.191.**.***:37605, bytes: 38, op= : MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 19= 3135
2013-06-11 00:26:30,971 INFO org.apache.hadoop.mapred.TaskTracker.clie= nttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60630, bytes: 6, op= : MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 19= 2011
2013-06-11 00:26:30,972 INFO org.apache.hadoop.mapred.TaskTracker.clie= nttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60656, bytes: 6, op= : MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005693_0, duration: 17= 8209
2013-06-11 00:26:30,973 INFO org.apache.hadoop.mapred.TaskTracker.clie= nttrace: src: 10.191.**.***:50060, dest: 10.8.***.**:45321, bytes: 6, op: M= APRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005694_0, duration: 18645= 2
2013-06-11 00:26:30,973 INFO org.apache.hadoop.mapred.TaskTracker.clie= nttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60659, bytes: 6, op= : MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005694_0, duration: 15= 7360
2013-06-11 00:26:30,974 INFO org.apache.hadoop.mapred.TaskTracker.clie= nttrace: src: 10.191.**.***:50060, dest: 10.8.***.**:45321, bytes: 38, op: = MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 1575= 55
2013-06-11 00:26:30,991 INFO org.apache.hadoop.mapred.JvmManager: JVM = Not killed jvm_201306071409_0151_m_-435659475 but just removed
20= 13-06-11 00:26:30,991 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_2= 01306071409_0151_m_-435659475 exited with exit code 0. Number of tasks it r= an: 0
2013-06-11 00:26:30,991 ERROR org.apache.hadoop.mapred.JvmManager: Cau= ght Throwable in JVMRunner. Aborting TaskTracker.
org.apache.hado= op.fs.FSError: java.io.IOException: Broken pipe
at org.apache.hadoop.fs.RawLocalFileSyste= m$LocalFSFileOutputStream.write(RawLocalFileSystem.java:200)
at java.io.Buffered= OutputStream.write(BufferedOutputStream.java:122)
at org.apache.hadoop.fs.FSDataOutputStr= eam$PositionCache.write(FSDataOutputStream.java:49)
at java.io.DataOutp= utStream.write(DataOutputStream.java:107)
at sun.nio.cs.StreamEncoder.writeBytes(StreamE= ncoder.java:220)
at sun.nio.cs.Strea= mEncoder.implClose(StreamEncoder.java:315)
at sun.nio.cs.StreamEncoder.close(StreamEncode= r.java:148)
at java.io.OutputSt= reamWriter.close(OutputStreamWriter.java:233)
at java.io.BufferedWriter.close(BufferedWri= ter.java:265)
at java.io.PrintWri= ter.close(PrintWriter.java:312)
at org.apache.hadoop.mapred.TaskController.writeCommand(T= askController.java:231)
at org.apache.hadoo= p.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:126)
at org.apache.h= adoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.jav= a:497)
at org.apache.hadoo= p.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:471)
Caused by: java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBy= tes(Native Method)
at java.io.FileOutp= utStream.write(FileOutputStream.java:297)
at org.apache.hadoop.fs.RawLocalFileSystem$Loc= alFSFileOutputStream.write(RawLocalFileSystem.java:198)
... 13 more
2013-06-11 00:26:31,007 INFO org.apache.hadoop.mapred.JvmManager: In Jvm= Runner constructed JVM ID: jvm_201306071409_0151_m_-495709221
201= 3-06-11 00:26:31,008 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace:= src: 10.191.**.***:50060, dest: 10.190.***.***:60656, bytes: 6, op: MAPRED= _SHUFFLE, cliID: attempt_201306071409_0151_m_005694_0, duration: 222430
2013-06-11 00:26:31,008 INFO org.apache.hadoop.mapred.TaskTracker.clie= nttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60653, bytes: 38, o= p: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005693_0, duration: 1= 54027
2013-06-11 00:26:31,008 INFO org.apache.hadoop.mapred.TaskTracker.clie= nttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60659, bytes: 6, op= : MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 13= 2067
2013-06-11 00:26:31,326 INFO org.apache.hadoop.mapred.JvmManager: JVM = Runner jvm_201306071409_0151_m_-495709221 spawned.
2013-06-11 00:= 26:31,328 INFO org.apache.hadoop.mapred.TaskController: Writing commands to= /mnt/app/hadoop-tmp/ttprivate/taskTracker/piyushv/jobcache/job_20130607140= 9_0151/attempt_201306071409_0151_m_005717_0/taskjvm.sh
2013-06-11 00:26:31,331 INFO org.apache.hadoop.mapred.TaskTracker.clie= nttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60656, bytes: 38, o= p: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 4= 37236
2013-06-11 00:26:31,332 INFO org.apache.hadoop.mapred.TaskTracker: SHU= TDOWN_MSG:=A0
/**************************************************= **********
SHUTDOWN_MSG: Shutting down TaskTracker at ip-10-191-*= *-***/10.191.**.***
************************************************************/

--
RAVI SHETYE
--089e016284469fe1ec04deefcf31--