Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E30E5EF70 for ; Wed, 6 Mar 2013 20:08:50 +0000 (UTC) Received: (qmail 85927 invoked by uid 500); 6 Mar 2013 20:08:49 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 85878 invoked by uid 500); 6 Mar 2013 20:08:49 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 85868 invoked by uid 99); 6 Mar 2013 20:08:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Mar 2013 20:08:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of daning@netseer.com designates 209.85.217.177 as permitted sender) Received: from [209.85.217.177] (HELO mail-lb0-f177.google.com) (209.85.217.177) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Mar 2013 20:08:44 +0000 Received: by mail-lb0-f177.google.com with SMTP id go11so5879671lbb.8 for ; Wed, 06 Mar 2013 12:08:22 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:x-gm-message-state; bh=M5E8UrI600jAdIxnVfkWi4Y9zLrpJhj1R9xhVi7RZVc=; b=NiBNcERd4+fEOJlI0lIlrmKklYkSaI2NjRX5uL1Aj4B8MvntQ8HSeDVrEfot1bTmAt 2WUw2GVpswfteM09PevHiZIZhmbe0xcq15xZES+OYU6+vBaAdf/ruDKloNsi7+tnNXtb 0a91bWJInKWKo6FyLBiFL9WmX246oMjLzZdAslMCjtnD2CAl0jcp+jmYCzTxEPHEIpDp 8O/ljChx8V1LBXPIotEGY7mc2TGNJaQLODh/fxBvg5dzaA4MusAth648adSioKwEs4xK 2rg6jUp9bcHGE+h04DPWN2zh4sSXNVGqs4PwlXCGhOnWIDwklu1d8nmEdmb45VD3FTYL h/Sg== MIME-Version: 1.0 X-Received: by 10.112.43.137 with SMTP id w9mr7980041lbl.77.1362600501801; Wed, 06 Mar 2013 12:08:21 -0800 (PST) Received: by 10.114.11.103 with HTTP; Wed, 6 Mar 2013 12:08:21 -0800 (PST) In-Reply-To: References: Date: Wed, 6 Mar 2013 12:08:21 -0800 Message-ID: Subject: Hadoop cluster hangs on big hive job From: Daning Wang To: user@hive.apache.org Content-Type: multipart/alternative; boundary=bcaec54a32c0e1998a04d7472695 X-Gm-Message-State: ALoCoQkZABOSZesaQ8Kr8oZFX6dmaWpz/aNEamkQpZJBW5N0BoB/hstphX07MVY1BpYTyzHcZFxe X-Virus-Checked: Checked by ClamAV on apache.org --bcaec54a32c0e1998a04d7472695 Content-Type: text/plain; charset=ISO-8859-1 We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while running big hive jobs(hive-0.8.1). Basically all the nodes are dead, from that trasktracker's log looks it went into some kinds of loop forever. All the log entries like this when problem happened. Any idea how to debug the issue? Thanks in advance. 2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:30,644 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:31,369 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:31,675 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:31,875 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:32,372 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > 2013-03-05 15:13:32,893 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) > --bcaec54a32c0e1998a04d7472695 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
We have 5 nodes cluster(Hadoop 1.0.4), It hung a= couple of times while running big hive jobs(hive-0.8.1). Basically all the= nodes are dead, from that trasktracker's log looks it went into some k= inds of loop forever.

All the log entries like this when problem happened.

Any idea how to debug the issue?

Thanks in adva= nce.


2013-03-05 15:13:19,526 INFO o= rg.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000012_0 0= .131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker= : attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of= 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:20,858 INFO org.apa= che.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000036_0 0.13146= 8% reduce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:21,486 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:22,448 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000032_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:22,840 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:24,723 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:25,539 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000043_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:25,569 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:26,876 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:27,505 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:28,553 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000043_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:28,659 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:30,644 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000008_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:31,369 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000004_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:31,675 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:31,875 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0
2013-03-05 15:13:32,372 INFO org.apache.hadoop.mapred.TaskTracker: att= empt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of 4996= 4 at 0.00 MB/s) >=A0
2013-03-05 15:13:32,893 INFO org.apache.h= adoop.mapred.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% re= duce > copy (19706 of 49964 at 0.00 MB/s) >=A0


--bcaec54a32c0e1998a04d7472695--