Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAF0o4SypFofRjW3Qd-F+ov-zWQHbtad8y8ub4Y9b_FTAubhZdQ@mail.gmail.com>
References: 
 <CAF0o4SzXeC=9oVy6=1Asi9-_c5yAhZ-EeA+fv7MfUZQ7EwVtPg@mail.gmail.com>
	<CACmJb3xx7w-FdSzNOTX03P80cuvsifQX1sMDZhnBvVGqB7W6Ew@mail.gmail.com>
	<CAF0o4SypFofRjW3Qd-F+ov-zWQHbtad8y8ub4Y9b_FTAubhZdQ@mail.gmail.com>
Date: Wed, 11 Nov 2015 23:53:47 +0530
Message-ID: 
 <CANOghCV63v-=8MzYbR-oOeTYWPMbSHr0FDhFdB7maAb2yNnR9Q@mail.gmail.com>
Subject: Re: Re-execution of map task
From: Varun Saxena <vsaxena.varun@gmail.com>
To: user@hadoop.apache.org
Cc: Namikaze Minato <lloydsensei@gmail.com>
Content-Type: multipart/alternative; boundary=089e01182608695916052447edde

--089e01182608695916052447edde
Content-Type: text/plain; charset=UTF-8

Hi Sergey,

This indicates that one or more of your Node Managers' may have gone down.
RM indicates this to AM on allocate response.
If a map task ran on such a node, its output is considered unusable even
though the map task has been marked as success previously.
Such a map task is then KILLED and a new attempt is launched.

Regards,
Varun Saxena.

On Wed, Nov 11, 2015 at 11:44 PM, Sergey <sergun@gmail.com> wrote:

> Hi,
>
> yes, there are several "failed" map, because of 600 sec time-out.
>
> I also found a lot messages like this in the log:
>
> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
> because it ran on unusable node 10.0.0.5:30050.
> AttemptId:attempt_1447029285980_0001_m_000043_1
>
> Different nodes got unusable status very often.
>
> Do you know something about possible reason? Maybe changing some time-out
> params in communication between nodes could help?
>
> As I already said I work in the cloud on Azure HDInsight.
>
>
>
>
>
>
> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <lloydsensei@gmail.com>:
>
>> Hi.
>>
>> Do you also have "failed" map attempts?
>> Killed map attempts won't help us understand why your job is failing.
>>
>> Regards,
>> LLoyd
>>
>>
>> On 11 November 2015 at 16:37, Sergey <sergun@gmail.com> wrote:
>> >
>> > Hi experts!
>> >
>> > I see strange behaviour of Hadoop while execution of my tasks.
>> > It re-runs task attempt which has completed with SUCCEEDED status
>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>> >
>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>> and
>> > than 2000.
>> >
>> > The same story with some other tasks..
>> > A also see on screen after execution of task that some times map
>> progress is
>> > decreasing...
>> >
>> > I don't use preemption, speculative execution and don't see any
>> exceptions,
>> > time-outs in yarn log
>> > (except last line "Container killed on request. Exit code is 143").
>> >
>> > How to catch the reason?
>> >
>> > I use version 2.6.0 in Azure cloud (HDInsight)
>> >
>> >
>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>> TaskAttempt
>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
>> from
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>> RUNNING
>> > to SUCCESS_CONTAINER_CLEANUP
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>> KILLING
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>> > Opening proxy : 10.0.0.8:30050
>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>> with
>> > attempt attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>> SUCCEEDED
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>> Tasks: 4
>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>> AssignedMaps:8
>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>> > HostLocal:0 RackLocal:16
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>> > completed container container_e04_1447029285980_0001_01_002951
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>> allocated
>> > containers 1
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
>> > reduce
>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
>> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
>> the
>> > ApplicationMaster.
>> > Container killed on request. Exit code is 143
>> > Container exited with a non-zero exit code 143
>>
>
>

--089e01182608695916052447edde
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Sergey,<div><br></div><div>This indicates that one or m=
ore of your Node Managers&#39; may have gone down. RM indicates this to AM =
on allocate response.</div><div>If a map task ran on such a node, its outpu=
t is considered unusable even though the map task has been marked as succes=
s previously.</div><div>Such a map task is then KILLED and a new attempt is=
 launched.</div><div><br></div><div>Regards,</div><div>Varun Saxena.</div><=
/div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Nov =
11, 2015 at 11:44 PM, Sergey <span dir=3D"ltr">&lt;<a href=3D"mailto:sergun=
@gmail.com" target=3D"_blank">sergun@gmail.com</a>&gt;</span> wrote:<br><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi,<div><br></div><div>yes, th=
ere are several &quot;failed&quot; map, because of 600 sec time-out.</div><=
div><br></div><div>I also found a lot messages like this in the log:</div><=
div><div><br></div><div>2015-11-09 22:00:35,882 INFO [AsyncDispatcher event=
 handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt =
killed because it ran on unusable node <a href=3D"http://10.0.0.5:30050" ta=
rget=3D"_blank">10.0.0.5:30050</a>. AttemptId:attempt_1447029285980_0001_m_=
000043_1</div></div><div><br></div><div>Different nodes got unusable status=
 very often.</div><div><br></div><div>Do you know something about possible =
reason? Maybe changing some time-out params in communication between nodes =
could help?</div><div><br></div><div>As I already said I work in the cloud =
on Azure HDInsight.</div><div><br></div><div><br></div><div><br></div><div>=
<br></div><div><br></div></div><div class=3D"gmail_extra"><br><div class=3D=
"gmail_quote">2015-11-11 20:33 GMT+03:00 Namikaze Minato <span dir=3D"ltr">=
&lt;<a href=3D"mailto:lloydsensei@gmail.com" target=3D"_blank">lloydsensei@=
gmail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi.<br>
<br>
Do you also have &quot;failed&quot; map attempts?<br>
Killed map attempts won&#39;t help us understand why your job is failing.<b=
r>
<br>
Regards,<br>
LLoyd<div><div class=3D"h5"><br>
<div><div><br>
On 11 November 2015 at 16:37, Sergey &lt;<a href=3D"mailto:sergun@gmail.com=
" target=3D"_blank">sergun@gmail.com</a>&gt; wrote:<br>
&gt;<br>
&gt; Hi experts!<br>
&gt;<br>
&gt; I see strange behaviour of Hadoop while execution of my tasks.<br>
&gt; It re-runs task attempt which has completed with SUCCEEDED status<br>
&gt; (see the log below about attempt_1447029285980_0001_m_000012_0).<br>
&gt;<br>
&gt; I don&#39;t know why but this tasks repeats in attempts numbers 0,1,2,=
3,4 and<br>
&gt; than 2000.<br>
&gt;<br>
&gt; The same story with some other tasks..<br>
&gt; A also see on screen after execution of task that some times map progr=
ess is<br>
&gt; decreasing...<br>
&gt;<br>
&gt; I don&#39;t use preemption, speculative execution and don&#39;t see an=
y exceptions,<br>
&gt; time-outs in yarn log<br>
&gt; (except last line &quot;Container killed on request. Exit code is 143&=
quot;).<br>
&gt;<br>
&gt; How to catch the reason?<br>
&gt;<br>
&gt; I use version 2.6.0 in Azure cloud (HDInsight)<br>
&gt;<br>
&gt;<br>
&gt; 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]<br>
&gt; org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAtte=
mpt<br>
&gt; attempt_1447029285980_0001_m_000012_0 is : 1.0<br>
&gt; 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]<br>
&gt; org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement=
 from<br>
&gt; attempt_1447029285980_0001_m_000012_0<br>
&gt; 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]<br>
&gt; org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:<br>
&gt; attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from RU=
NNING<br>
&gt; to SUCCESS_CONTAINER_CLEANUP<br>
&gt; 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]<br>
&gt; org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:<br>
&gt; Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container=
<br>
&gt; container_e04_1447029285980_0001_01_002951 taskAttempt<br>
&gt; attempt_1447029285980_0001_m_000012_0<br>
&gt; 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]<br>
&gt; org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KIL=
LING<br>
&gt; attempt_1447029285980_0001_m_000012_0<br>
&gt; 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]<br>
&gt; org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProx=
y:<br>
&gt; Opening proxy : <a href=3D"http://10.0.0.8:30050" rel=3D"noreferrer" t=
arget=3D"_blank">10.0.0.8:30050</a><br>
&gt; 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]<br>
&gt; org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:<br>
&gt; attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from<br=
>
&gt; SUCCESS_CONTAINER_CLEANUP to SUCCEEDED<br>
&gt; 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]<br>
&gt; org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded w=
ith<br>
&gt; attempt attempt_1447029285980_0001_m_000012_0<br>
&gt; 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]<br>
&gt; org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:<br>
&gt; task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to SUC=
CEEDED<br>
&gt; 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]<br>
&gt; org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tas=
ks: 4<br>
&gt; 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]<br>
&gt; org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before<br>
&gt; Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1 AssignedMap=
s:8<br>
&gt; AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0<=
br>
&gt; HostLocal:0 RackLocal:16<br>
&gt; 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]<br>
&gt; org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received<b=
r>
&gt; completed container container_e04_1447029285980_0001_01_002951<br>
&gt; 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]<br>
&gt; org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got alloca=
ted<br>
&gt; containers 1<br>
&gt; 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]<br>
&gt; org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned t=
o<br>
&gt; reduce<br>
&gt; 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]<br>
&gt; org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnosti=
cs<br>
&gt; report from attempt_1447029285980_0001_m_000012_0: Container killed by=
 the<br>
&gt; ApplicationMaster.<br>
&gt; Container killed on request. Exit code is 143<br>
&gt; Container exited with a non-zero exit code 143<br>
</div></div></div></div></blockquote></div><br></div>
</blockquote></div><br></div>

--089e01182608695916052447edde--