spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: [YARN] Questions about YARN's queues and Spark's FAIR scheduler
Date Thu, 16 Jun 2016 15:26:08 GMT
Hi Jacek,

Your point

" it could use FIFO or FAIR task scheduling. My question is when would I
need to use FAIR?... "

Good point and this is my two cents on this.

FAIR scheduling (in the realm of YARN as resource manager) is a method of
assigning resources to Spark jobs such that all jobs get, on average, *an
equal share of resources over time*. When there is a single job running
within a YARN cluster, that job uses the entire cluster.



Now when other Spark jobs are submitted, tasks slots that free up are
assigned to the new jobs, so that each job gets roughly the same amount of
core time. I think in FAIR mode YARN forms a queue of jobs (being FAIR).
This lets short jobs finish in reasonable time while not starving long
jobs. It is also a reasonable way to share a cluster between a number of
users. Finally, FAIR sharing can also work with job priorities. The
priorities are used as weights to determine the fraction of total compute
time that each job should get. I have never tried this myself.


HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 16:11, Jacek Laskowski <jacek@japila.pl> wrote:

> Hi,
>
> Thanks for your prompt answer.
>
> You said "the resource scheduling is handled to YARN" so it's only about
> vcores and memory, right? Once Spark has the resources (be it as a custom
> queue in YARN's Capacity Scheduler or default), it could use FIFO or FAIR
> task scheduling. My question is when would I need to use FAIR? Is this
> about TaskSetManagers (that represent Stages) to let more "parallel" stages
> be computed? Why would I need to go for FAIR ever?
>
>
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> On Thu, Jun 16, 2016 at 4:08 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Hi,
>>
>> If YARN is chosen as the Spark resource scheduler then the resource
>> scheduling is handled to YARN. In YARN, the ResourceManager is a resource
>> scheduler. It optimizes for cluster resource utilization to keep all
>> resources in use all the time. It assumes the responsibility to negotiate a
>> specified container in which to start the ApplicationMaster and then
>> launches the ApplicationMaster. A Container represents a collection of
>> physical resources such as allocated memory (RAM) and CPU cores.
>>
>> So back to your point in YARN MODE, i.e. --master yarn, if they are
>> resources available then yarn would kick of another container. You can see
>> that from yarn_resource_manager and yarn_node_manager logs.
>>
>> You also mentioned
>>
>> You can also spark-submit a Spark application using FAIR scheduler
>> (the default is FIFO) using -c spark.scheduler.mode=FAIR.
>>
>> In FAIR mode, there's also a notion of queue-like (Schedulable) pools.
>> They can also control the resource shares assigned to Spark
>> jobs/applications. You could sc.setLocalProperty to control what pool
>> to use.
>>
>> The notion of pools is nothing new. Most threaded model architecture use
>> pools. However, I am not sure how many users/resource manager go ahead and
>> create pools. In real life I don't think many people bother. I think I am
>> looking at this from a practical point of view as opposed to what the scala
>> code is saying which I believe you are alluding to.  I guess -c is short
>> for --conf.  Yes you can do that via the following:
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>                  ................... \
>>                 --conf "spark.scheduler.mode=FAIR" \
>>
>> You can even run in local mode via FAIR scheduler. Setting up these
>> parameters don't affect how the actual job runs if the resource is not
>> available. In other words the default FIFO will apply.  I am not convinced
>> the validity of some of these parameters in real life.
>>
>> Like most things the proof of the pudding is in the eating. These
>> theoretical points have to be established through experiment. For example
>> in local mode the default scheduler is FIFO which seems reasonable.
>> However, I can instruct Spark to run with FAIR scheduler although it has no
>> bearing in real life
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>                 --packages com.databricks:spark-csv_2.11:1.3.0 \
>>                 --driver-memory 2G \
>>                 --num-executors 1 \
>>                 --executor-memory 2G \
>>                 --master local \
>>                 --executor-cores 2 \
>>
>> *--conf "spark.scheduler.mode=FAIR" \*                --conf
>> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps" \
>>                 --jars
>> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
>>                 --class "${FILE_NAME}" \
>>                 --conf "spark.ui.port=${SP}" \
>>                 --conf "spark.driver.port=54631" \
>>                 --conf "spark.fileserver.port=54731" \
>>                 --conf "spark.blockManager.port=54832" \
>>                 --conf "spark.kryoserializer.buffer.max=512" \
>>                 ${JAR_FILE}
>>
>>
>>
>> and you can see that in GUI under environment tab
>>
>> In local mode I can submit as many spark-submit job if I wish. The
>> constraint would be the recourses within the host box. One JVM runs
>> independent of another.
>>
>>
>>
>> [image: Inline images 1]
>>
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 16 June 2016 at 12:37, Jacek Laskowski <jacek@japila.pl> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to get my head around the different parts of Spark on YARN
>>> architecture with YARN's schedulers and queues as well as Spark's own
>>> schedulers - FAIR and FIFO.
>>>
>>> I'd appreciate if you could read how I see things and correct me where
>>> I'm wrong. Thanks!
>>>
>>> The default scheduler in YARN is Capacity Scheduler [1]. It comes with
>>> the notion of queues. When you spark-submit a Spark application with
>>> --master yarn, you can specify --queue for the scheduling queue and it
>>> is **only** to offer the right share of CPUs and memory to the
>>> application. There could be more resources in the cluster, but that
>>> particular queue has only that exact share of vcores and memory.
>>>
>>> In other words, Spark does not know about any other resources but the
>>> ones available in the queue.
>>>
>>> Is this correct?
>>>
>>> You can also spark-submit a Spark application using FAIR scheduler
>>> (the default is FIFO) using -c spark.scheduler.mode=FAIR.
>>>
>>> In FAIR mode, there's also a notion of queue-like (Schedulable) pools.
>>> They can also control the resource shares assigned to Spark
>>> jobs/applications. You could sc.setLocalProperty to control what pool
>>> to use.
>>>
>>> Is this correct?
>>>
>>> If both are yes, why would I want to go as far as using queues and
>>> FAIR scheduling mode with pools? What are the benefits? Is this for
>>> multi-tenant environments? Do you have any use cases that would fit
>>> better with FAIR scheduling mode? What about YARN's queues with Spark
>>> on YARN?
>>>
>>> Share as much as you could since the topic bothers me so much (and
>>> without your support I won't be able to recover from this painful
>>> mental state :))
>>>
>>> Thanks for reading so far! Appreciate any help.
>>>
>>> [1]
>>> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> ----
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Mime
View raw message