Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8CB7CDC90 for ; Fri, 8 Mar 2013 16:31:53 +0000 (UTC) Received: (qmail 62008 invoked by uid 500); 8 Mar 2013 16:31:47 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 61694 invoked by uid 500); 8 Mar 2013 16:31:47 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 61666 invoked by uid 99); 8 Mar 2013 16:31:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Mar 2013 16:31:46 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of haavard.kongsgaard@gmail.com designates 209.85.215.178 as permitted sender) Received: from [209.85.215.178] (HELO mail-ea0-f178.google.com) (209.85.215.178) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Mar 2013 16:31:39 +0000 Received: by mail-ea0-f178.google.com with SMTP id g14so275393eak.23 for ; Fri, 08 Mar 2013 08:31:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:content-transfer-encoding; bh=Enp48Ef5N7MDeLolVLYXCf7fd6sR/4kHMqLoAKP16Ig=; b=lav40dgUG6XrRisxsi4lTUfLKqLQpf7OThdawpKx4SiQkyBzh6VREzRIcGNKnrsfV1 vaLs7RKrTWNmNngS8sUTFMfpPBw36cOLxSlZkyyYmWDa+p4aBIzpxapKdRahQLvcS//8 jQEX/F6nvs2rh/bDG/cFoWvlhZwX15lpA2tKxAPNSXsOvngRxEQBgaDcGCb3r/i8QBv1 jfVYFJwZiga7znkuozhOm/CIq2w4Gcu6alm1CPR56A+uT+tb6X/b/ZT7nLeUhRBJoibh jJM7nLytqoAOlh27UiTn4TNsyH722vSN0/+nzXAl5eB9pyxAvON4eJCxU1iWK4u7LJXy T5FA== MIME-Version: 1.0 X-Received: by 10.14.3.70 with SMTP id 46mr7335503eeg.2.1362760279039; Fri, 08 Mar 2013 08:31:19 -0800 (PST) Received: by 10.15.24.142 with HTTP; Fri, 8 Mar 2013 08:31:18 -0800 (PST) In-Reply-To: References: Date: Fri, 8 Mar 2013 17:31:18 +0100 Message-ID: Subject: Re: Hadoop cluster hangs on big hive job From: =?ISO-8859-1?Q?H=E5vard_Wahl_Kongsg=E5rd?= To: user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Dude I'am not going to read all you log files, but try to run this as a normal map reduce job, it could be memory related, something wrong with some of the zip files, wrong config etc..... -H=E5vard On Thu, Mar 7, 2013 at 8:53 PM, Daning Wang wrote: > We have hive query processing zipped csv files. the query was scanning fo= r > 10 days(partitioned by date). data for each day around 130G. The problem = is > not consistent since if you run it again, it might go through. but the > problem has never happened on the smaller jobs(like processing only one d= ays > data). > > We don't have space issue. > > I have attached log file when problem happening. it is stuck like > following(just search "19706 of 49964") > > 2013-03-05 15:13:51,587 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of 49= 964 > at 0.00 MB/s) > > 2013-03-05 15:13:51,811 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of 49= 964 > at 0.00 MB/s) > > 2013-03-05 15:13:52,551 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of 49= 964 > at 0.00 MB/s) > > 2013-03-05 15:13:52,760 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of 49= 964 > at 0.00 MB/s) > > 2013-03-05 15:13:52,946 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of 49= 964 > at 0.00 MB/s) > > 2013-03-05 15:13:54,742 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of 49= 964 > at 0.00 MB/s) > > > Thanks, > > Daning > > > On Thu, Mar 7, 2013 at 12:21 AM, H=E5vard Wahl Kongsg=E5rd > wrote: >> >> hadoop logs? >> >> On 6. mars 2013 21:04, "Daning Wang" wrote: >>> >>> We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while >>> running big jobs. Basically all the nodes are dead, from that trasktrac= ker's >>> log looks it went into some kinds of loop forever. >>> >>> All the log entries like this when problem happened. >>> >>> Any idea how to debug the issue? >>> >>> Thanks in advance. >>> >>> >>> 2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:30,644 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:31,369 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:31,675 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:31,875 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:32,372 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> 2013-03-05 15:13:32,893 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of = 49964 >>> at 0.00 MB/s) > >>> > --=20 H=E5vard Wahl Kongsg=E5rd Data Scientist Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/