hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Silnov (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
Date Wed, 24 Feb 2016 04:05:18 GMT
Silnov created YARN-4728:
----------------------------

             Summary: MapReduce job doesn't make any progress for a very very long time after
one Node become unusable.
                 Key: YARN-4728
                 URL: https://issues.apache.org/jira/browse/YARN-4728
             Project: Hadoop YARN
          Issue Type: Bug
          Components: capacityscheduler, nodemanager, resourcemanager
    Affects Versions: 2.6.0
         Environment: hadoop 2.6.0
yarn
            Reporter: Silnov
            Priority: Critical


I have some nodes running hadoop 2.6.0.
The cluster's configuration remain default largely.
I run some job on the cluster(especially some job processing a lot of data) every day.
Sometimes, I found my job remain the same progression for a very very long time. So I have
to kill the job mannually and re-submit it to the cluster. It works well before(re-submit
the job and it run to the end), but something go wrong today.
After I re-submit the same task for 3 times, its running go deadlock(the progression doesn't
change for a long time, and each time has a different progress value.e.g.33.01%,45.8%,73.21%).

I begin to check the web UI for the hadoop, then I find there are 98 map suspend while all
the running reduce task have consumed all the avaliable  memory. I stop the yarn and add configuration
below  into yarn-site.xml and then restart the yarn.
<property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
<value>0.1</value>
<property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
<value>1.0</value>
(wanting the yarn to preempt the reduce task's resource to run suspending map task)
After restart the yarn,I submit the job with the property mapreduce.job.reduce.slowstart.completedmaps=1.
but the same result happen again!!(my job remain the same progress value for a very very long
time)

I check the web UI for the hadoop again,and find that the suspended map task is newed with
the previous note:"TaskAttempt killed because it ran on unusable node node02:21349".
Then I check the resourcemanager's log and find some useful messages below:
******Deactivating Node node02:21349 as it is now LOST.
******node02:21349 Node Transitioned from RUNNING to LOST.

I think this may happen because my network across the cluster is not good which cause the
RM don't receive the NM's heartbeat in time.

But I wonder that why the yarn framework can't preempt the running reduce task's resource
to run the suspend map task?(this cause the job remain the same progress value for a very
very long time:( )


Any one can help?
Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message