hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Craig Ingram <cinple....@gmail.com>
Subject Launching containers asynchronously
Date Fri, 13 Oct 2017 18:30:49 GMT
I was recently doing some research into Spark on YARN's startup time and
observed slow, synchronous allocation of containers/executors. I am testing
on a 4 node bare metal cluster w/48 cores and 128GB memory per node. YARN
was only allocating about 3 containers per second. Moreover when starting 3
Spark applications at the same time with each requesting 44 containers, the
first application would get all 44 requested containers and then the next
application would start getting containers and so on.

>From looking at the code, it appears this is by design. There is an
undocumented configuration variable that will enable asynchronous
allocation of containers. I'm sure I'm missing something, but why is this
not the default? Is there a bug or race condition in this code path? I've
done some testing with it and it's been working and is significantly faster.

Here's the config:
`yarn.scheduler.capacity.schedule-asynchronously.enable`

Any help understanding this would be appreciated.

Thanks,
Craig


If you're curious about the performance difference with this setting, here
are the results:

The following tool was used for the benchmarks:
https://github.com/SparkTC/spark-bench

# async scheduler research
The goal of this test is to determine if running Spark on YARN with async
scheduling of containers reduces the amount of time required for an
application to receive all of its requested resources. This setting should
also reduce the overall runtime of short-lived applications/stages or
notebook paragraphs. This setting could prove crucial to achieving optimal
performance when sharing resources on a cluster with dynalloc enabled.
## Test Setup
Must update /etc/hadoop/conf/capacity-scheduler.xml (or through Ambari)
between runs.
`yarn.scheduler.capacity.schedule-asynchronously.enable=true|false`

conf files request executors counts of:
* 2
* 20
* 50
* 100
The apps are being submitted to the default queue on each cluster which
caps at 48 cores on dynalloc and 72 cores on baremetal. The default queue
was expanded for the last two tests on baremetal so it could potentially
take advantage of all 144 cores.
## Test Environments
### dynalloc
4 VMs in Fyre (1 master, 3 workers)
8 CPUs/16 GB per node
model name    : QEMU Virtual CPU version 2.5+
### baremetal
4 baremetal instances in Fyre (1 master, 3 workers)
48 CPUs/128GB per node
model name    : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

## Using spark-bench with timedsleep workload sync
### dynalloc
conf | avg | stddev
--- | --- | ---
spark-on-yarn-schedule-async0.time | 23.814900 | 1.110725
spark-on-yarn-schedule-async1.time | 29.770250 | 0.830528
spark-on-yarn-schedule-async2.time | 44.486600 | 0.593516
spark-on-yarn-schedule-async3.time | 44.337700 | 0.490139
### baremetal - 2 queues splitting cluster 72 cores each
conf | avg | stddev
--- | --- | ---
spark-on-yarn-schedule-async0.time | 14.827000 | 0.292290
spark-on-yarn-schedule-async1.time | 19.613150 | 0.155421
spark-on-yarn-schedule-async2.time | 30.768400 | 0.083400
spark-on-yarn-schedule-async3.time | 40.931850 | 0.092160
### baremetal - 1 queue to rule them all - 144 cores
conf | avg | stddev
--- | --- | ---
spark-on-yarn-schedule-async0.time | 14.833050 | 0.334061
spark-on-yarn-schedule-async1.time | 19.575000 | 0.212836
spark-on-yarn-schedule-async2.time | 30.765350 | 0.111035
spark-on-yarn-schedule-async3.time | 41.763300 | 0.182700

## Using spark-bench with timedsleep workload async
### dynalloc
conf | avg | stddev
--- | --- | ---
spark-on-yarn-schedule-async0.time | 22.575150 | 0.574296
spark-on-yarn-schedule-async1.time | 26.904150 | 1.244602
spark-on-yarn-schedule-async2.time | 44.721800 | 0.655388
spark-on-yarn-schedule-async3.time | 44.570000 | 0.514540
#### 2nd run
conf | avg | stddev
--- | --- | ---
spark-on-yarn-schedule-async0.time | 22.441200 | 0.715875
spark-on-yarn-schedule-async1.time | 26.683400 | 0.583762
spark-on-yarn-schedule-async2.time | 44.227250 | 0.512568
spark-on-yarn-schedule-async3.time | 44.238750 | 0.329712
### baremetal - 2 queues splitting cluster 72 cores each
conf | avg | stddev
--- | --- | ---
spark-on-yarn-schedule-async0.time | 12.902350 | 0.125505
spark-on-yarn-schedule-async1.time | 13.830600 | 0.169598
spark-on-yarn-schedule-async2.time | 16.738050 | 0.265091
spark-on-yarn-schedule-async3.time | 40.654500 | 0.111417
### baremetal - 1 queue to rule them all - 144 cores
conf | avg | stddev
--- | --- | ---
spark-on-yarn-schedule-async0.time | 12.987150 | 0.118169
spark-on-yarn-schedule-async1.time | 13.837150 | 0.145871
spark-on-yarn-schedule-async2.time | 16.816300 | 0.253437
spark-on-yarn-schedule-async3.time | 23.113450 | 0.320744

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message