drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <prog...@mapr.com>
Subject Re: Discuss about Drill's schedule policy
Date Sun, 20 Aug 2017 20:06:37 GMT
Hi Weijie,

Great analysis. Let’s look at a few more data points.

Drill has no central scheduler (this is a feature: it makes the cluster much easier to manage
and has no single point of failure. It was probably the easiest possible solution while Drill
was being built.) Instead of central control, Drill is based on the assumption of symmetry:
all Drillbits are identical. So, each Foreman, acting independently, should try to schedule
its load in a way that evenly distributes work across nodes in the cluster. If all Drillbits
do the same, then load should be balanced; there should be no “hot spots.”

Note, for this to work, Drill should either own the cluster, or any other workload on the
cluster should also be evenly distributed.

Drill makes another simplification: that the cluster has infinite resources (or, equivalently,
that the admin sized the cluster for peak load.) That is, as Sudheesh puts it, “Drill is
optimistic” Therefore, Drill usually runs with no throttling mechanism to limit overall
cluster load. In real clusters, of course, resources are limited and either a large query
load, or a few large queries, can saturate some or all of the available resources.

Drill has a feature, seldom used, to throttle queries based purely on number. These ZK-based
queues can allow, say, 5 queries to run (each of which is assumed to be evenly distributed.)
In actual fact, the ZK-based queues recognize that typical workload have many small, and a
few large, queries and so Drill offers the “small query” and “large query” queues.

OK, so that’s where we are today. I think I’m not stepping too far out of line to observe
that the above model is just a bit naive. It does not take into consideration the available
cores, memory or disk I/Os. It does not consider the fact that memory has a hard upper limit
and must be managed. Drill creates one thread for each minor fragment limited by the number
of cores. But, each query can contain dozens or more fragments, resulting in far, far more
threads per query than a node has cores. That is, Drill’s current scheduling model does
not consider that, above a certain level, adding more threads makes the system slower because
of thrashing.

You propose a closed-loop, reactive control system (schedule load based on observed load on
each Drillbit.) However, control-system theory tells us that such a system is subject to oscillation.
All Foremen observe that a node X is loaded so none send it work. Node X later finishes its
work and becomes underloaded. All Foremen now prefer node X and it swings back to being overloaded.
In fact, Impala tried an open-loop design and there is some evidence in their documentation
that they hit these very problems.

So, what else could we do? As we’ve wrestled with these issues, we’ve come to the understanding
that we need an open-loop, predictive solution. That is a fancy name for what YARN or Mesos
does: keep track of available resources, reserve them for a task, and monitor the task so
that it stays within the resource allocation. Predict load via allocation rather than reacting
to actual load.

In Drill, that might mean a scheduler which looks at all incoming queries and assigns cluster
resources to each; queueing the query if necessary until resources become available. It also
means that queries must live within their resource allocation. (The planner can help by predicting
the likely needed resources. Then, at run time, spill-to-disk and other mechanisms allow queries
to honor the resource limits.)

The scheduler-based design is nothing new: it seems to be what Impala settled on, it is what
YARN does for batch jobs, and it is a common pattern in other query engines.

Back to the RPC issue. With proper scheduling, we limit load on each Drillbit so that RPC
(and ZK heartbeats) can operate correctly. That is, rather than overloading a node, then attempting
to recover, we wish instead to manage to load to prevent the overload in the first place.

A coming pull request will take a first, small, step: it will allocate memory to queries based
on the limit set by the ZK-based queues. The next step is to figure out how to limit the number
of threads per query. (As noted above, a single large query can overwhelm the cluster if,
say, it tries to do 100 subqueries with many sorts, joins, etc.) We welcome suggestions and
pointers to how others have solved the problem.

We also keep tossing around the idea of introducing that central scheduler. But, that is quite
a bit of work and we’ve hard that users seem to have struggles with maintaining the YARN
and Impala schedulers, so we’re somewhat hesitant to move away from a purely symmetrical
configuration. Suggestions in this area are very welcome.

For now, try turning on the ZK queues to limit concurrent queries and prevent overload. Ensure
your cluster is sized for your workload. Ensure other work on the cluster is also symmetrical
and doe not compete with Drill for resources.

And, please continue to share your experiences!


- Paul

> On Aug 20, 2017, at 5:39 AM, weijie tong <tongweijie178@gmail.com> wrote:
> HI all:
>  Drill's current schedule policy seems a little simple. The
> SimpleParallelizer assigns endpoints in round robin model which ignores the
> system's load and other factors. To critical scenario, some drillbits are
> suffering frequent full GCs which will let their control RPC blocked.
> Current assignment will not exclude these drillbits from the next coming
> queries's assignment. then the problem will get worse .
>  I propose to add a zk path to hold bad drillbits. Forman will recognize
> bad drillbits by waiting timeout (timeout of  control response from
> intermediate fragments), then update the bad drillbits path. Next coming
> queries will exclude these drillbits from the assignment list.
>  How do you think about it or any suggests ? If sounds ok ,will file a
> JIRA and give some contributes.

View raw message