drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "weijie.tong (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-5975) Resource utilization
Date Sun, 19 Nov 2017 11:17:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16258448#comment-16258448
] 

weijie.tong edited comment on DRILL-5975 at 11/19/17 11:16 AM:
---------------------------------------------------------------

Yes , we already used the queue option for a long time. It's good to prevent the cluster from
being overloaded, but too coarse as being a scheduler. I have noticed the [DRILL-5716|https://issues.apache.org/jira/browse/DRILL-5716].
It's a good design to the memory allocation. But I think it will give little help to prevent
a Drillbit from being assigned too much work at current architecture.

We need a scheduler to do the fragment level scheduling work ,call it first level schedule.
The YARN like model schedule works at the Drillbit node level ,call it second level schedule.
First level schedule can work upon the second level schedule, says when we deploy the Drill
on the Yarn. 

I propose this design by having investigated the Flink and Spark projects. I prefer Flink's
[design|https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks] . I
will look into Presto as your advice. I also have discussed with [~jni] privately.  We have
some common opinion at some point. 

I think it's necessary to introduce the RecordBatchManager role to break the current RPC cascaded
dependence between two MajorFragments at the data exchange stage.If the memory is enough or
the upper MajorFragment's calculation is fast enough, the written RecordBatchs will be pushed
into the consumers quickly, no chance to go to disk, behaves the same performance as current
implementation.   It will also let MinorFragment level schedule become possible.

Welcome to discuss.



was (Author: weijie):
Yes , we already used the queue option for a long time. It's good to prevent the cluster from
being overloaded, but too coarse as being a scheduler. I have noticed the [DRILL-5716|https://issues.apache.org/jira/browse/DRILL-5716].
It's a good design to the memory allocation. But I think it will give little help to prevent
a Drillbit from being assigned too much work at current architecture.

We need a scheduler to do the fragment level scheduling work ,call it first level schedule.
The YARN like model schedule works at the Drillbit node level ,call it second level schedule.
First level schedule can work upon the second level schedule, says when we deploy the Drill
on the Yarn. 

I propose this design by having investigated the Flink and Spark projects. I prefer Flink's
design [design|https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks]
. I will look into Presto as your advice. I also have discussed with [~jni] privately.  We
have some common opinion at some point. 

I think it's necessary to introduce the RecordBatchManager role to break the current RPC cascaded
dependence between two MajorFragments at the data exchange stage.If the memory is enough or
the upper MajorFragment's calculation is fast enough, the written RecordBatchs will be pushed
into the consumers quickly, no chance to go to disk, behaves the same performance as current
implementation.   It will also let MinorFragment level schedule become possible.

Welcome to discuss.


> Resource utilization
> --------------------
>
>                 Key: DRILL-5975
>                 URL: https://issues.apache.org/jira/browse/DRILL-5975
>             Project: Apache Drill
>          Issue Type: New Feature
>    Affects Versions: 2.0.0
>            Reporter: weijie.tong
>            Assignee: weijie.tong
>
> h1. Motivation
> Now the resource utilization radio of Drill's cluster is not too good. Most of the cluster
resource is wasted. We can not afford too much concurrent queries. Once the system accepted
more queries with a not high cpu load, the query which originally is very quick will become
slower and slower.
> The reason is Drill does not supply a scheduler . It just assume all the nodes have enough
calculation resource. Once a query comes, it will schedule the related fragments to random
nodes not caring about the node's load. Some nodes will suffer more cpu context switch to
satisfy the coming query. The profound causes to this is that the runtime minor fragments
construct a runtime tree whose nodes spread different drillbits. The runtime tree is a memory
pipeline that is all the nodes will stay alone the whole lifecycle of a query by sending out
data to upper nodes successively, even though some node could run quickly and quit immediately.What's
more the runtime tree is constructed before actual running. The schedule target to Drill will
become the whole runtime tree nodes.
> h1. Design
> It will be hard to schedule the runtime tree nodes as a whole. So I try to solve this
by breaking the runtime cascade nodes. The graph below describes the initial design. !https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png!
   [graph link|https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png]
> Every Drillbit instance will have a RecordBatchManager which will accept all the RecordBatchs
written by the senders of local different MinorFragments. The RecordBatchManager will hold
the RecordBatchs in memory firstly then disk storage . Once the first RecordBatch of a MinorFragment
sender of one query occurs , it will notice the FragmentScheduler. The FragmentScheduler is
instanced by the Foreman.It holds the whole PlanFragment execution graph.It will allocate
a new corresponding FragmentExecutor to run the generated RecordBatch. The allocated FragmentExecutor
will then notify the corresponding FragmentManager to indicate that I am ready to receive
the data. Then the FragmentManger will send out the RecordBatch one by one to the corresponding
FragmentExecutor's receiver like what the current Sender does by throttling the data stream.
> What we can gain from this design is :
> a. The computation leaf node does not to wait for the consumer's speed to end its life
to release the resource.
> b. The sending data logic will be isolated from the computation nodes and shared by different
FragmentManagers.
> c. We can schedule the MajorFragments according to Drillbit's actual resource capacity
at runtime.
> d. Drill's pipeline data processing characteristic is also retained.
> h1. Plan
> This will be a large PR ,so I plan to divide it into some small ones.
> a. to implement the RecordManager.
> b. to implement a simple random FragmentScheduler and the whole event flow.
> c. to implement a primitive FragmentScheduler which may reference the Sparrow project.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message