Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of krishnanjrao@gmail.com
 designates 209.85.223.180 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALSJUsSpRfO1TUX9-i0d1do=y0X8C9qn0CRcrt3FXH_a=wQzLw@mail.gmail.com>
References: 
 <CAPEqew+69CKv_t6uai29CZgGkYKLsV6ExK2nHufEkUpgt_TH-w@mail.gmail.com>
 <CALSJUsSpRfO1TUX9-i0d1do=y0X8C9qn0CRcrt3FXH_a=wQzLw@mail.gmail.com>
From: Krishna Rao <krishnanjrao@gmail.com>
Date: Thu, 27 Mar 2014 09:59:44 +0000
Message-ID: 
 <CAPEqew+L0e=nen0HO7b0rigQXUiQzk-D_gHxqSgjWpNbCfV6oA@mail.gmail.com>
Subject: Re: Job froze for hours because of an unresponsive disk on one of the
 task trackers
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7bdc157a3ee1aa04f593a6ca

--047d7bdc157a3ee1aa04f593a6ca
Content-Type: text/plain; charset=ISO-8859-1

I noticed, but none of the jobs ended up being re-submitted! And all 3 of
those jobs failed on the same node. All we know is that the disk on that
node became unresponsive.


On 27 March 2014 09:33, Dieter De Witte <drdwitte@gmail.com> wrote:

> The ids of the tasks are different so the node got killed after failing on
> 3 different(!) reduce tasks. The reduce task 48 will probably have been
> resubmitted to another node.
>
>
> 2014-03-27 10:22 GMT+01:00 Krishna Rao <krishnanjrao@gmail.com>:
>
> Hi,
>>
>> we have a daily Hive script that usually takes a few hours to run. The
>> other day I notice one of the jobs was taking in excess of a few hours.
>> Digging into it I saw that there were 3 attempts to launch a job on a
>> single node:
>>
>> Task Id Start Time Finish Time
>> Error
>> task_201312241250_46714_r_000048 Error launching task
>> task_201312241250_46714_r_000049 Error launching task
>> task_201312241250_46714_r_000050 Error launching task
>>
>> I later found out that this node had a dodgy/unresponsive disk (still
>> being tested right now).
>>
>> We've seen tasks fail in the past, but re-submitted to another node and
>> succeeding. So, shouldn't this task have been kicked off on another node
>> after the first failure? Is there anything I could be missing in terms of
>> configuration that should be set?
>>
>> We're using CDH4.4.0.
>>
>> Cheers,
>>
>> Krishna
>>
>
>

--047d7bdc157a3ee1aa04f593a6ca
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I noticed, but none of the jobs ended up being re-submitte=
d! And all 3 of those jobs failed on the same node. All we know is that the=
 disk on that node became unresponsive.</div><div class=3D"gmail_extra"><br=
>

<br><div class=3D"gmail_quote">On 27 March 2014 09:33, Dieter De Witte <spa=
n dir=3D"ltr">&lt;<a href=3D"mailto:drdwitte@gmail.com" target=3D"_blank">d=
rdwitte@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div dir=3D"ltr">The ids of the tasks are different so the node got killed =
after failing on 3 different(!) reduce tasks. The reduce task 48 will proba=
bly have been resubmitted to another node.<br></div><div class=3D"gmail_ext=
ra">


<br><br><div class=3D"gmail_quote">2014-03-27 10:22 GMT+01:00 Krishna Rao <=
span dir=3D"ltr">&lt;<a href=3D"mailto:krishnanjrao@gmail.com" target=3D"_b=
lank">krishnanjrao@gmail.com</a>&gt;</span>:<div><div class=3D"h5"><br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex">


<div dir=3D"ltr"><span style=3D"font-family:arial,sans-serif;font-size:13px=
">Hi,</span><div style=3D"font-family:arial,sans-serif;font-size:13px"><br>=
</div><div style=3D"font-family:arial,sans-serif;font-size:13px">we have a =
daily Hive script that usually takes a few hours to run. The other day I no=
tice one of the jobs was taking in excess of a few hours. Digging into it I=
 saw that there were 3 attempts to launch a job on a single node:</div>


<div style=3D"font-family:arial,sans-serif;font-size:13px"><br></div><div s=
tyle=3D"font-family:arial,sans-serif;font-size:13px"><div>Task Id<span styl=
e=3D"white-space:pre-wrap">	</span>Start Time<span style=3D"white-space:pre=
-wrap">	</span>Finish Time</div>


<div>Error</div><div>task_201312241250_46714_r_000048<span style=3D"white-s=
pace:pre-wrap">			</span>Error launching task</div><div>task_201312241250_4=
6714_r_000049<span style=3D"white-space:pre-wrap">			</span>Error launching=
 task</div>


<div>task_201312241250_46714_r_000050<span style=3D"white-space:pre-wrap">	=
		</span>Error launching task</div></div><div style=3D"font-family:arial,sa=
ns-serif;font-size:13px"><br></div><div style=3D"font-family:arial,sans-ser=
if;font-size:13px">


I later found out that this node had a dodgy/unresponsive disk (still being=
 tested right now).</div><div style=3D"font-family:arial,sans-serif;font-si=
ze:13px"><br></div><div style=3D"font-family:arial,sans-serif;font-size:13p=
x">


We&#39;ve seen tasks fail in the past, but re-submitted to another node and=
 succeeding. So, shouldn&#39;t this task have been kicked off on another no=
de after the first failure? Is there anything I could be missing in terms o=
f configuration that should be set?</div>


<div style=3D"font-family:arial,sans-serif;font-size:13px"><br></div><div s=
tyle=3D"font-family:arial,sans-serif;font-size:13px">We&#39;re using CDH4.4=
.0.</div><div style=3D"font-family:arial,sans-serif;font-size:13px"><br></d=
iv>


<div style=3D"font-family:arial,sans-serif;font-size:13px">Cheers,</div><di=
v style=3D"font-family:arial,sans-serif;font-size:13px"><br></div><div styl=
e=3D"font-family:arial,sans-serif;font-size:13px">Krishna</div></div>
</blockquote></div></div></div><br></div>
</blockquote></div><br></div>

--047d7bdc157a3ee1aa04f593a6ca--