Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8BB20D4CC for ; Fri, 24 Aug 2012 18:21:46 +0000 (UTC) Received: (qmail 49177 invoked by uid 500); 24 Aug 2012 18:21:45 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 49133 invoked by uid 500); 24 Aug 2012 18:21:45 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 49123 invoked by uid 99); 24 Aug 2012 18:21:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Aug 2012 18:21:45 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bharathvissapragada1990@gmail.com designates 209.85.160.48 as permitted sender) Received: from [209.85.160.48] (HELO mail-pb0-f48.google.com) (209.85.160.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Aug 2012 18:21:40 +0000 Received: by pbbrq13 with SMTP id rq13so4690454pbb.35 for ; Fri, 24 Aug 2012 11:21:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type; bh=fsvJ8Z/vK49i4eXpJ62HgGeHGJsSqIUIbpwmjvTdo1E=; b=xIqXOf1tt3P9Rebup9YH6RqmrrZDv+T2mDkLvLAxIYTmuJTOp3Yg278xCofL00h6PQ MY/whb9U0TDUFVA7OT+nWVIDfrteCybFNPUFeUFmAajzksNC0Z0DR/m5wUscl4TqQRqp HKKSDsM7lJMe71mWaB/qrdPlEr4Q4NFLHp8im9B8/KoP1vSIofOe9s7EzlmvaHyh0PxF mdpQUyuqQU6b2bjIy1YtTHIQqSKAolUZwZKW7q55V1wBuju+UcUty5UtL915kpxwJ+oR CtMfEYvyZQJ5bX1HyMDBxRFHxK56jg9S2qL+X+u1YUB90vRji/fMZOmME0R4spCnUT37 Lc3Q== Received: by 10.68.213.234 with SMTP id nv10mr14851142pbc.56.1345832480006; Fri, 24 Aug 2012 11:21:20 -0700 (PDT) MIME-Version: 1.0 Sender: bharathvissapragada1990@gmail.com Received: by 10.66.251.104 with HTTP; Fri, 24 Aug 2012 11:20:59 -0700 (PDT) In-Reply-To: References: From: bharath vissapragada Date: Fri, 24 Aug 2012 23:50:59 +0530 X-Google-Sender-Auth: aPCHL2goen7Tl9ILKYst9AbJpmk Message-ID: Subject: Re: Long running Join Query - Reduce task fails due to failing to report status To: user@hive.apache.org Content-Type: multipart/alternative; boundary=e89a8fb20780e5e1f104c8070aa6 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8fb20780e5e1f104c8070aa6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable My two cents, Try checking if there is a skew in the input to that reducer compared to other reducers. This happens sometimes in joins where some reducers have large amount of input data and keep running forever. On Fri, Aug 24, 2012 at 11:41 PM, Bertrand Dechoux wrot= e: > It is not clear from your post but your job is always failing during the > same step? Or only sometimes? Or only once? > Since it's a hive query I would modify it to find the root cause. > > First create temporary "files" which are the results from the three first > M/R. > Then run the fourth M/R on it and try to filter the data in order to see > if it is related to the volume or the format. > > Regards > > Bertrand > > > On Fri, Aug 24, 2012 at 7:44 PM, Igor Tatarinov wrote: > >> Why don't you try splitting the big query into smaller ones? >> >> >> On Fri, Aug 24, 2012 at 10:20 AM, Tim Havens wrote= : >> >>> >>> Just curious if you've tried using Hive's explain method to see what IT >>> thinks of your query. >>> >>> >>> On Fri, Aug 24, 2012 at 9:36 AM, Himanish Kushary w= rote: >>> >>>> Hi, >>>> >>>> We have a complex query that involves several left outer joins >>>> resulting in 8 M/R jobs in Hive.During execution of one of the stages = ( >>>> after three M/R has run) the M/R job fails due to few Reduce tasks fai= ling >>>> due to inactivity. >>>> >>>> Most of the reduce tasks go through fine ( within 3 mins) but the last >>>> one gets stuck for a long time (> 1 hour) and finally after several >>>> attempts gets killed due to "failed to report status for 600 seconds. >>>> Killing!" >>>> >>>> What may be causing this issue ? Would hive.script.auto.progress help >>>> in this case ? As we are not able to get much information from the log >>>> files how may we approach resolving this ? Will tweaking of any specif= ic >>>> M/R parameters help ? >>>> >>>> The task attempt log shows several lines like this before exiting : >>>> >>>> 2012-08-23 19:17:23,848 INFO ExecReducer: ExecReducer: processing 2190= 00000 rows: used memory =3D 408582240 >>>> 2012-08-23 19:17:30,189 INFO ExecReducer: ExecReducer: processing 2200= 00000 rows: used memory =3D 346110400 >>>> 2012-08-23 19:17:37,510 INFO ExecReducer: ExecReducer: processing 2210= 00000 rows: used memory =3D 583913576 >>>> 2012-08-23 19:17:44,829 INFO ExecReducer: ExecReducer: processing 2220= 00000 rows: used memory =3D 513071504 >>>> 2012-08-23 19:17:47,923 INFO org.apache.hadoop.mapred.FileInputFormat:= Total input paths to process : 1 >>>> >>>> Here are the reduce task counters: >>>> >>>> *Map-Reduce Framework* Combine input records0 Combine output records0R= educe input groups >>>> 222,480,335 Reduce shuffle bytes7,726,141,897 Reduce input records >>>> 222,480,335 Reduce output records0 Spilled Records355,827,191 CPU time >>>> spent (ms)2,152,160 Physical memory (bytes) snapshot1,182,490,624Virtu= al memory (bytes) snapshot >>>> 1,694,531,584 Total committed heap usage (bytes)990,052,352 >>>> >>>> The tasktracker log gives a thread dump at that time but no exception. >>>> >>>> *2012-08-23 20:05:49,319 INFO org.apache.hadoop.mapred.TaskTracker: >>>> Process Thread Dump: lost task* >>>> *69 active threads* >>>> >>>> --------------------------- >>>> Thanks & Regards >>>> Himanish >>>> >>> >>> >>> >>> -- >>> "The whole world is you. Yet you keep thinking there is something else.= " >>> - Xuefeng Yicun 822-902 A.D. >>> >>> Tim R. Havens >>> Google Phone: 573.454.1232 >>> ICQ: 495992798 >>> ICBM: 37=B051'34.79"N 90=B035'24.35"W >>> ham radio callsign: NW0W >>> >> >> > > > -- > Bertrand Dechoux > --=20 Bharath Vissapragada, 4th Year undergraduate, IIIT Hyderabad. w: http://researchweb.iiit.ac.in/~bharath.v --e89a8fb20780e5e1f104c8070aa6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable My two cents,

Try checking if there is a skew in the inp= ut to that reducer compared to other reducers. This happens sometimes in jo= ins where some reducers have large amount of input data and keep running fo= rever.=A0


On Fri, Aug 24, 2012 at 11:41 = PM, Bertrand Dechoux <dechouxb@gmail.com> wrote:
It is not clear from your post but your job is always failing during t= he same step? Or only sometimes? Or only once?
Since it's a h= ive query I would modify it to find the root cause.

First create temporary "files" which are the results from th= e three first M/R.
Then run the fourth M/R on it and try to filte= r the data in order to see if it is related to the volume or the format.

Regards

Bertrand=A0


On Fri, Aug 24, 2012 at 7:44= PM, Igor Tatarinov <igor@decide.com> wrote:
Why don= 9;t you try splitting the big query into smaller ones?


On Fri, Aug 24, 2012 at 10:20 AM, Tim Havens &= lt;timhavens@gmail= .com> wrote:

Just curious if you've tried using H= ive's explain method to see what IT thinks of your query.


On Fri, Aug 24, 2012 at 9:36 = AM, Himanish Kushary <himanish@gmail.com> wrote:
Hi,

We have a = complex query that involves several left outer joins resulting in 8 M/R job= s in Hive.During execution of one of the stages ( after three M/R has run) = the M/R job fails due to few Reduce tasks failing due to inactivity.

Most of the reduce tasks go through fine ( within 3 min= s) but the last one gets stuck for a long time (> 1 hour) and finally af= ter several attempts gets killed due to "failed to report status for 600 seconds. Killing!"

What may be causi= ng this issue ? Would=A0hive.script.auto.progress help in this case ? As we= are not able to get much information from the log files how may we approac= h resolving this ? Will tweaking of any specific M/R parameters help ?

The task attempt log shows several lines like this befo= re exiting :

2012-08-23 19:17:23,848 INFO ExecReducer: ExecReducer: processing 219=
000000 rows: used memory =3D 408582240
2012-08-23 19:17:30,189 INFO ExecReducer: ExecReducer: processing 220000000=
 rows: used memory =3D 346110400
2012-08-23 19:17:37,510 INFO ExecReducer: ExecReducer: processing 221000000=
 rows: used memory =3D 583913576
2012-08-23 19:17:44,829 INFO ExecReducer: ExecReducer: processing 222000000=
 rows: used memory =3D 513071504
2012-08-23 19:17:47,923 INFO org.apache.hadoop.mapred.FileInputFormat: Tota=
l input paths to process : 1
Here are the reduce tas= k counters:

<= td style=3D"font-size:12px;border:1px solid rgb(102,102,102)"> Virtual memory (bytes) snapshot Total committed heap usage (bytes)

The tasktracker log gives a thread dump at t= hat time but no exception.

2012-08-23 20:05:49,319 INFO org.apache.hadoop.mapre= d.TaskTracker: Process Thread Dump: lost task
69 active th= reads

---------------------------
Thanks & R= egards
Himanish



--
"The whole world is you. Y= et you keep thinking there is something else." - Xuefeng Yicun 822-902= A.D.

Tim R. Havens
Google Phone: 573.454.1232
ICQ: 495992798
ICBM:=A0 37=B051'34.79"N=A0=A0 90=B035'24.35= "W
ham radio callsign: NW0W




<= /div>--
Bertrand Dechoux=



--
Bharath Viss= apragada,
4th Year undergraduate,
IIIT Hyderabad.
w: http://researchweb= .iiit.ac.in/~bharath.v
--e89a8fb20780e5e1f104c8070aa6--
Map-Reduce Framework
Combine input records0
Combine output records0
Reduce input groups222,480,335
Reduce shuffle bytes7,726,141,897
Reduce input records222,480,335
Reduce output records0
Spilled Records355,827,191
CPU time spent (ms)2,152,160
Physical memory (bytes) snapshot1,182,490,624
1,694,531,584
990,052,352