Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 212B918441 for ; Wed, 11 Nov 2015 18:23:58 +0000 (UTC) Received: (qmail 75744 invoked by uid 500); 11 Nov 2015 18:23:53 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 75626 invoked by uid 500); 11 Nov 2015 18:23:53 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 75615 invoked by uid 99); 11 Nov 2015 18:23:53 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Nov 2015 18:23:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B2567C0481 for ; Wed, 11 Nov 2015 18:23:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.899 X-Spam-Level: ** X-Spam-Status: No, score=2.899 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Q4AI7rLxdtC6 for ; Wed, 11 Nov 2015 18:23:50 +0000 (UTC) Received: from mail-ig0-f172.google.com (mail-ig0-f172.google.com [209.85.213.172]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 68CC320B89 for ; Wed, 11 Nov 2015 18:23:49 +0000 (UTC) Received: by igbhv6 with SMTP id hv6so26407295igb.0 for ; Wed, 11 Nov 2015 10:23:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=k2kd4K3BQYFkqt4fyzw6LW/PQkP/5Mvf3qDqH2Xdg+w=; b=Q6QrWCPiJloJiH18Jd7bO6Qrl53tNnUb2VS/H+DGBn+U4EaHYX1ODN6bGcFwtNFS8t lzLfIja6sKWmfsCleOZgWwJqn770SmSTcORcgQQS67UbnOyrGcn074R4BQ8rae7fayX2 Hn6WqSbudaF7h7cuEFpLv3LuGyrRRwfMORSoQbhsa7KjiXmtA7HInF6MKQFIqyqFVQdk Y6FBfj86MQmi0MlmgrTT/mpMdOatu7xg6obrdSWiSxV+U0ZyXg7dchwsc8wSanLsLEeI laznFKWV6RnxQ/OtzCozfp9QiZXFYGm5uUOt88lwU78kRonEOoZLXad21v799lPf0/ua m6LA== MIME-Version: 1.0 X-Received: by 10.50.43.197 with SMTP id y5mr31903472igl.45.1447266227933; Wed, 11 Nov 2015 10:23:47 -0800 (PST) Received: by 10.64.1.139 with HTTP; Wed, 11 Nov 2015 10:23:47 -0800 (PST) In-Reply-To: References: Date: Wed, 11 Nov 2015 23:53:47 +0530 Message-ID: Subject: Re: Re-execution of map task From: Varun Saxena To: user@hadoop.apache.org Cc: Namikaze Minato Content-Type: multipart/alternative; boundary=089e01182608695916052447edde --089e01182608695916052447edde Content-Type: text/plain; charset=UTF-8 Hi Sergey, This indicates that one or more of your Node Managers' may have gone down. RM indicates this to AM on allocate response. If a map task ran on such a node, its output is considered unusable even though the map task has been marked as success previously. Such a map task is then KILLED and a new attempt is launched. Regards, Varun Saxena. On Wed, Nov 11, 2015 at 11:44 PM, Sergey wrote: > Hi, > > yes, there are several "failed" map, because of 600 sec time-out. > > I also found a lot messages like this in the log: > > 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed > because it ran on unusable node 10.0.0.5:30050. > AttemptId:attempt_1447029285980_0001_m_000043_1 > > Different nodes got unusable status very often. > > Do you know something about possible reason? Maybe changing some time-out > params in communication between nodes could help? > > As I already said I work in the cloud on Azure HDInsight. > > > > > > > 2015-11-11 20:33 GMT+03:00 Namikaze Minato : > >> Hi. >> >> Do you also have "failed" map attempts? >> Killed map attempts won't help us understand why your job is failing. >> >> Regards, >> LLoyd >> >> >> On 11 November 2015 at 16:37, Sergey wrote: >> > >> > Hi experts! >> > >> > I see strange behaviour of Hadoop while execution of my tasks. >> > It re-runs task attempt which has completed with SUCCEEDED status >> > (see the log below about attempt_1447029285980_0001_m_000012_0). >> > >> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4 >> and >> > than 2000. >> > >> > The same story with some other tasks.. >> > A also see on screen after execution of task that some times map >> progress is >> > decreasing... >> > >> > I don't use preemption, speculative execution and don't see any >> exceptions, >> > time-outs in yarn log >> > (except last line "Container killed on request. Exit code is 143"). >> > >> > How to catch the reason? >> > >> > I use version 2.6.0 in Azure cloud (HDInsight) >> > >> > >> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153] >> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of >> TaskAttempt >> > attempt_1447029285980_0001_m_000012_0 is : 1.0 >> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153] >> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement >> from >> > attempt_1447029285980_0001_m_000012_0 >> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler] >> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: >> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from >> RUNNING >> > to SUCCESS_CONTAINER_CLEANUP >> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4] >> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: >> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container >> > container_e04_1447029285980_0001_01_002951 taskAttempt >> > attempt_1447029285980_0001_m_000012_0 >> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4] >> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: >> KILLING >> > attempt_1447029285980_0001_m_000012_0 >> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4] >> > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: >> > Opening proxy : 10.0.0.8:30050 >> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler] >> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: >> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from >> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED >> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler] >> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded >> with >> > attempt attempt_1447029285980_0001_m_000012_0 >> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler] >> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: >> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to >> SUCCEEDED >> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler] >> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed >> Tasks: 4 >> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator] >> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before >> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1 >> AssignedMaps:8 >> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0 >> > HostLocal:0 RackLocal:16 >> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator] >> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received >> > completed container container_e04_1447029285980_0001_01_002951 >> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator] >> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got >> allocated >> > containers 1 >> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator] >> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to >> > reduce >> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler] >> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics >> > report from attempt_1447029285980_0001_m_000012_0: Container killed by >> the >> > ApplicationMaster. >> > Container killed on request. Exit code is 143 >> > Container exited with a non-zero exit code 143 >> > > --089e01182608695916052447edde Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Sergey,

This indicates that one or m= ore of your Node Managers' may have gone down. RM indicates this to AM = on allocate response.
If a map task ran on such a node, its outpu= t is considered unusable even though the map task has been marked as succes= s previously.
Such a map task is then KILLED and a new attempt is= launched.

Regards,
Varun Saxena.
<= /div>

On Wed, Nov = 11, 2015 at 11:44 PM, Sergey <sergun@gmail.com> wrote:
Hi,

yes, th= ere are several "failed" map, because of 600 sec time-out.
<= div>
I also found a lot messages like this in the log:
<= div>

2015-11-09 22:00:35,882 INFO [AsyncDispatcher event= handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt = killed because it ran on unusable node 10.0.0.5:30050. AttemptId:attempt_1447029285980_0001_m_= 000043_1

Different nodes got unusable status= very often.

Do you know something about possible = reason? Maybe changing some time-out params in communication between nodes = could help?

As I already said I work in the cloud = on Azure HDInsight.



=


2015-11-11 20:33 GMT+03:00 Namikaze Minato = <lloydsensei@= gmail.com>:
Hi.

Do you also have "failed" map attempts?
Killed map attempts won't help us understand why your job is failing.
Regards,
LLoyd


On 11 November 2015 at 16:37, Sergey <sergun@gmail.com> wrote:
>
> Hi experts!
>
> I see strange behaviour of Hadoop while execution of my tasks.
> It re-runs task attempt which has completed with SUCCEEDED status
> (see the log below about attempt_1447029285980_0001_m_000012_0).
>
> I don't know why but this tasks repeats in attempts numbers 0,1,2,= 3,4 and
> than 2000.
>
> The same story with some other tasks..
> A also see on screen after execution of task that some times map progr= ess is
> decreasing...
>
> I don't use preemption, speculative execution and don't see an= y exceptions,
> time-outs in yarn log
> (except last line "Container killed on request. Exit code is 143&= quot;).
>
> How to catch the reason?
>
> I use version 2.6.0 in Azure cloud (HDInsight)
>
>
> 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAtte= mpt
> attempt_1447029285980_0001_m_000012_0 is : 1.0
> 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement= from
> attempt_1447029285980_0001_m_000012_0
> 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from RU= NNING
> to SUCCESS_CONTAINER_CLEANUP
> 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container=
> container_e04_1447029285980_0001_01_002951 taskAttempt
> attempt_1447029285980_0001_m_000012_0
> 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KIL= LING
> attempt_1447029285980_0001_m_000012_0
> 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProx= y:
> Opening proxy : 10.0.0.8:30050
> 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
> 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded w= ith
> attempt attempt_1447029285980_0001_m_000012_0
> 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
> task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to SUC= CEEDED
> 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tas= ks: 4
> 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
> Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1 AssignedMap= s:8
> AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0<= br> > HostLocal:0 RackLocal:16
> 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received > completed container container_e04_1447029285980_0001_01_002951
> 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got alloca= ted
> containers 1
> 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned t= o
> reduce
> 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnosti= cs
> report from attempt_1447029285980_0001_m_000012_0: Container killed by= the
> ApplicationMaster.
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143


--089e01182608695916052447edde--