hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
Date Wed, 24 Feb 2016 07:12:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160305#comment-15160305

zhihai xu commented on YARN-4728:

Thanks for reporting this issue [~Silnov]! 
It looks like this issue is caused by the long timeout at two level. This issue is similar
as YARN-3944, YARN-4414, YARN-3238 and YARN-3554. You may work around this issue by changing
the configuration values: "ipc.client.connect.max.retries.on.timeouts" (default is 45),  "ipc.client.connect.timeout"(default
is 20000ms) and "yarn.client.nodemanager-connect.max-wait-ms" (default is 900,000ms).

> MapReduce job doesn't make any progress for a very very long time after one Node become
> -------------------------------------------------------------------------------------------------
>                 Key: YARN-4728
>                 URL: https://issues.apache.org/jira/browse/YARN-4728
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler, nodemanager, resourcemanager
>    Affects Versions: 2.6.0
>         Environment: hadoop 2.6.0
> yarn
>            Reporter: Silnov
>            Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) every day.
> Sometimes, I found my job remain the same progression for a very very long time. So I
have to kill the job mannually and re-submit it to the cluster. It works well before(re-submit
the job and it run to the end), but something go wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the progression
doesn't change for a long time, and each time has a different progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map suspend while
all the running reduce task have consumed all the avaliable  memory. I stop the yarn and add
configuration below  into yarn-site.xml and then restart the yarn.
> <property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
> <value>0.1</value>
> <property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
> <value>1.0</value>
> (wanting the yarn to preempt the reduce task's resource to run suspending map task)
> After restart the yarn,I submit the job with the property mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for a very very
long time)
> I check the web UI for the hadoop again,and find that the suspended map task is newed
with the previous note:"TaskAttempt killed because it ran on unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> ******Deactivating Node node02:21349 as it is now LOST.
> ******node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good which cause
the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce task's resource
to run the suspend map task?(this cause the job remain the same progress value for a very
very long time:( )
> Any one can help?
> Thank you very much!

This message was sent by Atlassian JIRA

View raw message