hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prabhu Joseph <prabhujose.ga...@gmail.com>
Subject Re: Spark Job on YARN Hogging the entire Cluster resource
Date Wed, 24 Feb 2016 22:47:50 GMT
You are right, Hamel. It should get 10 TB /2. And In hadoop-2.7.0, it is
working fine. But in hadoop-2.5.1, it gets only 10TB/230. The same
configuration used in both versions.
So i think a JIRA could have fixed the issue after hadoop-2.5.1.

On Thu, Feb 25, 2016 at 1:28 AM, Hamel Kothari <hamelkothari@gmail.com>

> The instantaneous fair share is what Queue B should get according to the
> code (and my experience). Assuming your queues are all equal it would be
> 10TB/2.
> I can't help much more unless I can see your config files and ideally also
> the YARN Scheduler UI to get an idea of what your queues/actual resource
> usage is like. Logs from each of your Spark applications would also be
> useful. Basically the more info the better.
> On Wed, Feb 24, 2016 at 2:52 PM Prabhu Joseph <prabhujose.gates@gmail.com>
> wrote:
>> Hi Hamel,
>>     Thanks for looking into the issue. What i am not understanding is,
>> after preemption what is the share that the second queue gets in case if
>> the first queue holds the entire cluster resource without releasing, is it
>> instantaneous fair share or fair share.
>>      Queue A and B are there (total 230 queues), total cluster resource
>> is 10TB, 3000 cores. If a job submitted into queue A, it will get 10TB,
>> 3000 cores and it is not releasing any resource. Now if a second job
>> submitted into queue B, so preemption definitely will happen, but what is
>> the share queue B will get after preemption. *Is it  <10 TB , 3000> / 2
>> or <10TB,3000> / 230*
>> We find, after preemption queue B gets only <10TB,3000> / 230, because
>> the first job is holding the resource. In case if first job releases the
>> resource, the second queue will get <10TB,3000> /2 based on higher priority
>> and reservation.
>> The question is how much preemption tries to preempt the queue A if it
>> holds the entire resource without releasing? Could not able to share the
>> actual configuration, but the answer to the question here will help us.
>> Thanks,
>> Prabhu Joseph
>> On Wed, Feb 24, 2016 at 10:03 PM, Hamel Kothari <hamelkothari@gmail.com>
>> wrote:
>>> If all queues are identical, this behavior should not be happening.
>>> Preemption as designed in fair scheduler (IIRC) takes place based on the
>>> instantaneous fair share, not the steady state fair share. The fair
>>> scheduler docs
>>> <https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>
>>> aren't super helpful on this but it does say in the Monitoring section that
>>> preemption won't take place if you're less than your instantaneous fair
>>> share (which might imply that it would occur if you were over your inst.
>>> fair share and someone had requested resources). The code for
>>> FairScheduler.resToPreempt
>>> <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-server-resourcemanager/2.7.1/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#FairScheduler.resToPreempt%28org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue%2Clong%29>
>>> also seems to use getFairShare rather than getSteadyFairShare() for
>>> preemption so that would imply that it is using instantaneous fair share
>>> rather than steady state.
>>> Could you share your YARN site/fair-scheduler and Spark configurations?
>>> Could you also share the YARN Scheduler UI (specifically the top of of the
>>> RM which shows how many resources are in use)?
>>> Since it's not likely due to steady state fair share, some other
>>> possible reasons why this might be taking place (this is not remotely
>>> conclusive but with no information this is what comes to mind):
>>> - You're not reaching
>>> yarn.scheduler.fair.preemption.cluster-utilization-threshold. Perhaps
>>> due to core/memory ratio inconsistency with the cluster.
>>> - Your second job doesn't have a sufficient level of parallelism to
>>> request more executors than what it is recieving (perhaps there are fewer
>>> than 13 tasks at any point in time) and you don't have
>>> spark.dynamicAllocation.minExecutors set?
>>> -Hamel
>>> On Tue, Feb 23, 2016 at 8:20 PM Prabhu Joseph <
>>> prabhujose.gates@gmail.com> wrote:
>>>> Hi All,
>>>>  A YARN cluster with 352 Nodes (10TB, 3000cores) and has Fair Scheduler
>>>> with root queue having 230 queues.
>>>>     Each Queue is configured with maxResources equal to Total Cluster
>>>> Resource. When a Spark job is submitted into a queue A, it is given with
>>>> 10TB, 3000 cores according to instantaneous Fair Share and it is holding
>>>> the entire resource without releasing. After some time, when another job
>>>> submitted into other queue B, it will get the Fair Share 45GB and 13 cores
>>>> i.e (10TB,3000 cores)/230 using Preemption. Now if some more jobs are
>>>> submitted into queue B, all the jobs in B has to share the 45GB and 13
>>>> cores. Whereas the job which is in queue A holds the entire cluster
>>>> resource affecting the other jobs.
>>>>      This kind of issue often happens when a Spark job submitted first
>>>> which holds the entire cluster resource. What is the best way to fix this
>>>> issue. Can we make preemption to happen for instantaneous fair share
>>>> instead of fair share, will it help.
>>>> Note:
>>>> 1. We do not want to give weight for particular queue. Because all the
>>>> 240 queues are critical.
>>>> 2. Changing the queues into nested does not solve the issue.
>>>> 3. Adding maxResource to queue  won't allow the first job to pick
>>>> entire cluster resource, but still configuring the optimal maxResource for
>>>> 230 queue is difficult and also the first job can't use the entire cluster
>>>> resource when the cluster is idle.
>>>> 4. We do not want to handle it in Spark ApplicationMaster, then we need
>>>> to check for other new YARN application type with similar behavior. We want
>>>> YARN to control this behavior by killing the resources which is hold by
>>>> first job for longer period.
>>>> Thanks,
>>>> Prabhu Joseph

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message