hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Saxena <vsaxena.va...@gmail.com>
Subject Re: Re-execution of map task
Date Wed, 11 Nov 2015 19:57:59 GMT
The reason is that RM(ResourceManager) thinks Node Manager(10.0.0.5:30050 in
your case) is in an unhealthy state(either heartbeat from Nodemanager has
not come to resourcemanager for some time, a HB has come out of order which
might indicate restart, etc.)  and reports this to Application Master.
This is the reason for log like "TaskAttempt killed because it ran on
unusable node"
When an unusable node is reported, killing of an already successful map
task is the expected behavior.

Can you check if this NodeManager went down ?
Or can you see any unusual log in active resourcemanager's logs for this
node(10.0.0.5:30050) ? Something like deactivating node, unhealthy, etc.

Regards,
Varun Saxena.


On Thu, Nov 12, 2015 at 12:32 AM, Sergey <sergun@gmail.com> wrote:

> Hi Varun,
> thank you!
>
> But the main question is the reason...
>
> I did check log files with yarn logs and local logs on nodes.
> And don't see reasons why Hadoop can think that nodes gone unusable.
>
>
> Regards,
> Sergey.
>
>
>
>
> 2015-11-11 21:23 GMT+03:00 Varun Saxena <vsaxena.varun@gmail.com>:
>
>> Hi Sergey,
>>
>> This indicates that one or more of your Node Managers' may have gone
>> down. RM indicates this to AM on allocate response.
>> If a map task ran on such a node, its output is considered unusable even
>> though the map task has been marked as success previously.
>> Such a map task is then KILLED and a new attempt is launched.
>>
>> Regards,
>> Varun Saxena.
>>
>> On Wed, Nov 11, 2015 at 11:44 PM, Sergey <sergun@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> yes, there are several "failed" map, because of 600 sec time-out.
>>>
>>> I also found a lot messages like this in the log:
>>>
>>> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
>>> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
>>> because it ran on unusable node 10.0.0.5:30050.
>>> AttemptId:attempt_1447029285980_0001_m_000043_1
>>>
>>> Different nodes got unusable status very often.
>>>
>>> Do you know something about possible reason? Maybe changing some
>>> time-out params in communication between nodes could help?
>>>
>>> As I already said I work in the cloud on Azure HDInsight.
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <lloydsensei@gmail.com>:
>>>
>>>> Hi.
>>>>
>>>> Do you also have "failed" map attempts?
>>>> Killed map attempts won't help us understand why your job is failing.
>>>>
>>>> Regards,
>>>> LLoyd
>>>>
>>>>
>>>> On 11 November 2015 at 16:37, Sergey <sergun@gmail.com> wrote:
>>>> >
>>>> > Hi experts!
>>>> >
>>>> > I see strange behaviour of Hadoop while execution of my tasks.
>>>> > It re-runs task attempt which has completed with SUCCEEDED status
>>>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>>>> >
>>>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>>>> and
>>>> > than 2000.
>>>> >
>>>> > The same story with some other tasks..
>>>> > A also see on screen after execution of task that some times map
>>>> progress is
>>>> > decreasing...
>>>> >
>>>> > I don't use preemption, speculative execution and don't see any
>>>> exceptions,
>>>> > time-outs in yarn log
>>>> > (except last line "Container killed on request. Exit code is 143").
>>>> >
>>>> > How to catch the reason?
>>>> >
>>>> > I use version 2.6.0 in Azure cloud (HDInsight)
>>>> >
>>>> >
>>>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>>>> TaskAttempt
>>>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>>>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done
>>>> acknowledgement from
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>>> RUNNING
>>>> > to SUCCESS_CONTAINER_CLEANUP
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>>>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>>> KILLING
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> >
>>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>>>> > Opening proxy : 10.0.0.8:30050
>>>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>>>> with
>>>> > attempt attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>>>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>>>> SUCCEEDED
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>>>> Tasks: 4
>>>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>>>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>>>> AssignedMaps:8
>>>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>>>> > HostLocal:0 RackLocal:16
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>>>> > completed container container_e04_1447029285980_0001_01_002951
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>>>> allocated
>>>> > containers 1
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned
>>>> to
>>>> > reduce
>>>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> Diagnostics
>>>> > report from attempt_1447029285980_0001_m_000012_0: Container killed
>>>> by the
>>>> > ApplicationMaster.
>>>> > Container killed on request. Exit code is 143
>>>> > Container exited with a non-zero exit code 143
>>>>
>>>
>>>
>>
>

Mime
View raw message