Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4D6B7968C for ; Wed, 6 Mar 2013 21:18:26 +0000 (UTC) Received: (qmail 34782 invoked by uid 500); 6 Mar 2013 21:18:24 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 34705 invoked by uid 500); 6 Mar 2013 21:18:24 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 34697 invoked by uid 99); 6 Mar 2013 21:18:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Mar 2013 21:18:24 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daning@netseer.com designates 209.85.217.172 as permitted sender) Received: from [209.85.217.172] (HELO mail-lb0-f172.google.com) (209.85.217.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Mar 2013 21:18:18 +0000 Received: by mail-lb0-f172.google.com with SMTP id n8so6090674lbj.31 for ; Wed, 06 Mar 2013 13:17:57 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:x-gm-message-state; bh=pKVZVtuIpy6vGG0WalI9qvugitRsjXH71vGblvjUH/s=; b=Um5AxXSxKsGYWtGUJsCNug220Ap43AYij2ltEaGtzbFXq/j7oW+zvo+zgmgsONzonb gfDYafrx3oNOqZkKp1ahEsEgtOp10ao9Bc+kmjl/NK7M5xKHFzK+jyN3NKYLRle2POUF D9K0VlLQj2pPvjV/Q3AAq3PP1U1nlGnxanQ9byWDnb+z+Uz5SG8qc5pW5UUq1dP9dOp5 vN9EIb9UzRxdk9i440yPFxCNiE7cxJiefSPzHjo2BF/dZWUR4CgLRuc0eHtVEkxHYZ19 trf/9ezSrQu3QK3C9ZrZpmN+8egkxLeJpADRfKqRMAk1EiwpuFrdvjpa46qb3qo/y4DF cmtA== MIME-Version: 1.0 X-Received: by 10.112.83.67 with SMTP id o3mr8042471lby.7.1362604677344; Wed, 06 Mar 2013 13:17:57 -0800 (PST) Received: by 10.114.11.103 with HTTP; Wed, 6 Mar 2013 13:17:57 -0800 (PST) In-Reply-To: <15C962F3417BF94ABEAB2314AF92A16A6A6B8179@SVR-PR-MB2.cb.careerbuilder.com> References: <15C962F3417BF94ABEAB2314AF92A16A6A6B8179@SVR-PR-MB2.cb.careerbuilder.com> Date: Wed, 6 Mar 2013 13:17:57 -0800 Message-ID: Subject: Re: Hadoop cluster hangs on big hive job From: Daning Wang To: user@hive.apache.org Content-Type: multipart/alternative; boundary=14dae9d2f2b6c35a8804d7481f92 X-Gm-Message-State: ALoCoQm+Kjcc76xy/LD2yrtbM6ykND1WEMJeW+z/M3h0GptSF9KW6mfcsuViGOvV3e7UvCU4rdip X-Virus-Checked: Checked by ClamAV on apache.org --14dae9d2f2b6c35a8804d7481f92 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Thanks Chalcy! But the hadoop cluster should not hang in any way, is that a bug? On Wed, Mar 6, 2013 at 12:33 PM, Chalcy Raja wrote: > You could try breaking up the hive query to return smaller datasets. I > have noticed this behavior when the hive query has =91in=92 in where clau= se.** > ** > > ** ** > > Thanks,**** > > Chalcy**** > > *From:* Daning Wang [mailto:daning@netseer.com] > *Sent:* Wednesday, March 06, 2013 3:08 PM > *To:* user@hive.apache.org > *Subject:* Hadoop cluster hangs on big hive job**** > > ** ** > > We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while > running big hive jobs(hive-0.8.1). Basically all the nodes are dead, from > that trasktracker's log looks it went into some kinds of loop forever.***= * > > ** ** > > All the log entries like this when problem happened.**** > > ** ** > > Any idea how to debug the issue?**** > > ** ** > > Thanks in advance.**** > > ** ** > > ** ** > > 2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:30,644 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:31,369 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:31,675 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:31,875 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:32,372 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > 2013-03-05 15:13:32,893 INFO org.apache.hadoop.mapred.TaskTracker: > attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of > 49964 at 0.00 MB/s) > **** > > ** ** > > ** ** > --14dae9d2f2b6c35a8804d7481f92 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Thanks Chalcy! But the hadoop cluster should not hang in any way, is that a= bug?

On Wed, Mar 6, 2013 at 12:33 PM, Ch= alcy Raja <Chalcy.Raja@careerbuilder.com> wrote:=

You could try breaking up= the hive query to return smaller datasets.=A0 I have noticed this behavior= when the hive query has =91in=92 in where clause.

=A0<= /p>

Thanks,

Chalcy

From: Daning W= ang [mailto:daning@= netseer.com]
Sent: Wednesday, March 06, 2013 3:08 PM
To: user@h= ive.apache.org
Subject: Hadoop cluster hangs on big hive job

=A0

We have 5 nodes cluster(Hadoop 1.0.4), It hung a cou= ple of times while running big hive jobs(hive-0.8.1). Basically all the nod= es are dead, from that trasktracker's log looks it went into some kinds= of loop forever.

=A0

All the log entries like this when problem happened.=

=A0

Any idea how to debug the issue?

=A0

Thanks in advance.

=A0

=A0

2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000012_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000016_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000032_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000008_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000004_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000043_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000012_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000016_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000032_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000043_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000012_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:30,644 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000008_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:31,369 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000004_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:31,675 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:31,875 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:32,372 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

2013-03-05 15:13:32,893 INFO org.apache.hadoop.mapre= d.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% reduce > c= opy (19706 of 49964 at 0.00 MB/s) >=A0

=A0

=A0


--14dae9d2f2b6c35a8804d7481f92--