hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Cheah <mch...@palantir.com>
Subject YARN queues become unusable and jobs are stuck in ACCEPTED state
Date Thu, 28 Apr 2016 19:03:35 GMT
Hi,

I¹ve been sporadically seeing an issue when using Hadoop YARN. I¹m using Hadoop 2.5.0, CDH5.3.3.

When I¹ve configured the stack to use the fair scheduler protocol, after some period of time
of the cluster being alive and running jobs, I¹m noticing that when I submit a job, the job
will be stuck in the ACCEPTED state even though the cluster has sufficient resources to spawn
an application master container as well as the queue I¹m submitting to having sufficient
resources available. Furthermore, all jobs submitted to that queue will be stuck in the ACCEPTED
state. I can unblock job submission by going into the allocation XML file, renaming the queue,
and submitting jobs to that renamed queue instead. However the queue has only changed name,
and all of its other settings have been preserved.

It is clearly untenable for me to have to change the queues that I¹m using sometimes. This
appears to happen irrespective of the settings of the queue, e.g. Its weight or its minimum
resource share. The events leading up to this occurrence are strictly unpredictable and I
have no concrete way to reproduce the issue. The logs don¹t show anything interesting either;
the resource manager just states that it schedules an attempt for the application submitted
to the bad queue, but the attempt¹s application master is never allocated to a container
anywhere.

I have looked around the YARN bug base and couldn¹t find any similar issues. I¹ve also used
jstack to inspect the Resource Manager process, but nothing is obviously wrong there. I was
wondering if anyone has encountered a similar issue before. I apologize that the description
is vague, but it¹s the best way I can describe it.

Thanks,

-Matt Cheah


Mime
View raw message