hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Silnov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
Date Sat, 27 Feb 2016 07:03:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170406#comment-15170406
] 

Silnov commented on YARN-4728:
------------------------------

Varun Saxena,thanks for your response!
I have checked MAPREDUCE-6513. The scenario is similar to that as you said. 
I'll get some knowledge from it:) 

> MapReduce job doesn't make any progress for a very very long time after one Node become
unusable.
> -------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4728
>                 URL: https://issues.apache.org/jira/browse/YARN-4728
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler, nodemanager, resourcemanager
>    Affects Versions: 2.6.0
>         Environment: hadoop 2.6.0
> yarn
>            Reporter: Silnov
>            Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) every day.
> Sometimes, I found my job remain the same progression for a very very long time. So I
have to kill the job mannually and re-submit it to the cluster. It works well before(re-submit
the job and it run to the end), but something go wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the progression
doesn't change for a long time, and each time has a different progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map suspend while
all the running reduce task have consumed all the avaliable  memory. I stop the yarn and add
configuration below  into yarn-site.xml and then restart the yarn.
> <property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
> <value>0.1</value>
> <property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
> <value>1.0</value>
> (wanting the yarn to preempt the reduce task's resource to run suspending map task)
> After restart the yarn,I submit the job with the property mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for a very very
long time)
> I check the web UI for the hadoop again,and find that the suspended map task is newed
with the previous note:"TaskAttempt killed because it ran on unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> ******Deactivating Node node02:21349 as it is now LOST.
> ******node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good which cause
the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce task's resource
to run the suspend map task?(this cause the job remain the same progress value for a very
very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message