drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5716) Queue-based memory assignment for buffering operators
Date Wed, 30 Aug 2017 22:27:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148149#comment-16148149
] 

ASF GitHub Bot commented on DRILL-5716:
---------------------------------------

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/928

    Drill 5716: Queue-driven memory allocation

    Please see [DRILL-5716](https://issues.apache.org/jira/browse/DRILL-5716) for an overview.
    
    This PR provides five commits that implement a method of allocating memory to queries
based on ZK-based queues. The key concept is that the system has a fixed amount of memory.
Suppose all queries need the same amount of memory. In that case, if we want to run n queries,
each gets an amount that is the total memory divided by n.
    
    This raises two questions:
    
    1. Queries don't need the same amount of memory, so how do we handle this?
    2. How do we know how many queries, n, are to run within Drill?
    
    The solution is to leverage an existing feature: the Zookeeper (ZK) based queueing mechanism.
This mechanism limits (throttles) the number of active queries. If we set the limit to n,
then we know that Drill won't run more than n queries, and we can apply the above math.
    
    But, not all queries are the same size. The ZK-based throttling mechanism addressed this
by allowing *two* queues: a "large query" queue and a "small query" queue. The idea is that
the cluster might run a single large ETL query, but at the same time, run five or 10 small
interactive queries. The numbers are configurable via system options. The existing mechanism
uses query cost to decide when a query is "small" vs. "large." The user sets a threshold in
terms of cost. Queries split into the two queues accordingly.
    
    ### Resource Manager Layer
    
    The first commit introduces a new resource management layer with a number of concepts:
    
    * A pluggable, global resource manager configured at boot time. The code offers three
versions: the null manager (no throttling), the ZK-based one discussed above, and a test-only
one used in unit tests that uses a simple in-process queue.
    * A Per-query resource allocator created by the resource manager to work with a single
query through the query lifecycle.
    
    Since the default manager does no queueing, queues themselves are internal to the resource
manager. The model assumes:
    
    * A query queue that manages the Drillbit's view into the distributed queue state.
    * A queue "lease" that represents a grant of a single slot within a queue to run a query.
    
    The design handles the current resource managers, and allows creating custom schedulers
as needed for specific distributions or applications.
    
    Prior to this PR, the Foreman worked with the ZK-based queues directly inline in the Foreman
code. With the above framework in place, this PR refactors the Foreman to extract the ZK-based
queue code into a resource manager implementation. Functionality is identical, just the location
of the code moves to allow a cleaner design.
    
    One interesting design issue is worth pointing out. The user can enable and disable ZK-queues
at run time. However, the resource manager is global. To make this work, a “dynamic” resource
manager wraps the ZK-based implementation. The dynamic version periodically checks system
options to see if the ZK-based version is enabled. The dynamic version calls to either the
ZK-based version or the default (non-throttled) version depending on what it finds in the
system options.
    
    When a switch occurs, queries already queued will stay queued. Only new queries use the
new setting. This provides a graceful transition to/from throttled mode.
    
    ### Specifics of Memory Planning
    
    The core of this change is a new memory planner. Here's how it works.
    
    * The user enables queues using the existing `exec.queue.enable` session option.
    * The user decides the number of "small" vs. "large" queries to run concurrently by setting
`exec.queue.small` and `exec.queue.large` respectively. (The original numbers for these settings
were far to high, this PR changes the defaults to much lower values.)
    * Decide how to split memory across the two queues by specifying the `exec.queue.memory_ratio`
value. The default is 10, which means that large queries get 10 units of memory while small
queries get 1 unit.
    * Decide on the planner cost threshold that separates "small" vs. "large" queries using
`exec.queue.threshold`.
    * Decide on the maximum queue wait time by setting `exec.queue.timeout_millis`. (Default
is five minutes.)
    
    The memory planner then does the following:
    
    * Determine the number of “memory units” as the number of concurrent small queries
+ the number of concurrent large queries * the memory ratio.
    * Suppose this is a small query. Compute actual memory as system memory / number of memory
units.
    * Traverse the plan. Find the buffering operators grouped by node. (Buffering operators,
in this release, are sort and hash agg.)
    * Suppose we find that, on node X, we have two sorts and a hash agg. This is three operators
on that node.
    * Determine the number of minor fragments running on that node. Suppose there are 10.
    * Divide the per-node memory across the operators on each node and each minor fragment.
Here, we divide our total memory by 3 * 10 = 30 to get the per-operator memory amount.
    
    The above is more complex than the existing memory allocation utilities. The extra behavior
handles a special case seen in several production environments: the case in which a query
runs only in the root fragment on a single node, so that there is only a single minor fragment.
The existing code still divides memory by the possible number of fragments, even if the actual
number is 1. The new code gracefully handles this case and will give all memory to the single
fragment in this case.
    
    One refinement that is still needed is to specify a “reserve” (of, say, 10-20%) to
account for operators that decline to limit their memory usage.
    
    ### Planner and Foreman Changes
    
    Memory planning now depends on the resource manager. The default manager assigns memory
using the existing `MemoryAllocationUtilities`. The ZK-based version uses the math outlined
above.
    
    As a result, the third commit reworks the parallelizers to shift memory planning into
a separate memory planning stage managed by the resource manager. As a result, we must shift
creation of the JSON form of the query plan until after memory allocation is done.
    
    We assume that some future resource manager may learn of available memory only after a
queue lease is granted. So, we shift memory planning until after a queue slot is available;
but before launching fragments.
    
    ### Web UI Enhancements
    
    Early testing revealed that a major limitation of the existing ZK-based queues is that
they are opaque: the user can't tell what is happening. To address this, two UI changes were
made:
    
    * If ZK-based queues are enabled, information about them now appears on the main Web UI
page for the Drillbit.
    * The main query profile page shows, for each query, the query cost (used to select the
"small" vs. "large" queue) and the name of the selected queue: "small" or "large."
    
    In addition, system/session options are now sorted by name to make it easier to find the
queue-related options.
    
    The above required changes to the query profile definition Protobuf (in commit 2) as well
as changes in the various "models" and "templates" used by the Web UI. Note that the new information
also automatically appears in the corresponding JSON REST API calls.
    
    ### Code Cleanup
    
    Commit 5 has a large number of code cleanup items found while doing the above work. These
do not affect functionality.
    
    ### Limitations and Future Improvements
    
    The change in this PR is far from perfect; but it is the best we can do given the limitations
of the existing ZK-based queues. In particular:
    
    * There is no way to determine the total queue length: the number of queries waiting in
the queue.
    * There is no good way to cancel a query stuck in the queue.
    * JDBC/ODBC clients won’t know that their queries are queued, or be able to cancel queued
queries.
    * The two-queue model is highly simplistic; real applications need a variety of queues
and the means to categorize queries into queues.
    * The ZK-queues allow no prioritization.
    * The queueing mechanism does not limit CPU use, only memory use.
    * The planner number used to split queries into queues is not very accurate or intuitive.
    * No provision is made to understand that some queries need little memory, others need
a huge amount. Instead, all queries are either “small” or “large.”
    
    All that said, the present PR is a (small) step in the right direction and lays the groundwork
for additional, future efforts.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-5716

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/928.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #928
    
----
commit db69cbea57d1460a25463d09784ff850b2030f5f
Author: Paul Rogers <progers@maprtech.com>
Date:   2017-08-30T21:32:17Z

    DRILL-5716: Queue-driven memory allocation
    
    Commit 1: Core resource management and query queue abstractions.

commit 5dd7e83db122867368b23f0a878c540e212e3a5c
Author: Paul Rogers <progers@maprtech.com>
Date:   2017-08-30T21:33:07Z

    Commit 2: RPC changes
    
    Adds queue information to the Protobuf layer.

commit ba43570023d18b3c76256c0cc1727e16dbcd5d1f
Author: Paul Rogers <progers@maprtech.com>
Date:   2017-08-30T21:36:23Z

    Commit 3: Foreman and Planner changes
    
    Abstracts memory management out to the new resource management layer.
    This means deferring generating the physical plan JSON to later in the
    process after memory planning.

commit d924527ebdeda34ef9bd575e30b77e753654d75b
Author: Paul Rogers <progers@maprtech.com>
Date:   2017-08-30T21:38:17Z

    Commit 4: Web UI changes
    
    Adds queue information to the main page and the profile page to each
    query.

commit edacd1f4623aab8acf19d6e9ebfb9d2627241ab3
Author: Paul Rogers <progers@maprtech.com>
Date:   2017-08-30T21:38:48Z

    Commit 5: Other changes
    
    Code cleanup. Also sorts the list of options displayed in the Web UI.

----


> Queue-based memory assignment for buffering operators
> -----------------------------------------------------
>
>                 Key: DRILL-5716
>                 URL: https://issues.apache.org/jira/browse/DRILL-5716
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.12.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>
> Apache Drill already has a queueing feature based on ZK semaphores. We did a bit of testing
to  show that the feature does, in fact work. We propose to enhance the feature with some
light revisions to make work with the "managed" external sort and the newly-added spilling
feature for the hash agg operator. The key requirement is to build on what we have for now;
we may want to tackle a larger project to create a more complete solution later.
> Existing functionality:
> * Two ZK-based queues called the “small” and “large” query queues.
> * A threshold, call it T, given as a query cost, to determine the queue into which a
query will go.
> * Admit levels for the two queues: call them Qs and Ql.
> Basically, when a query comes in:
> * Plan the query as usual.
> * Obtain the final query cost from the planner, call this C.
> * If C<T, the query goes into the small queue, else it goes into the large queue.
> * Suppose the small queue. Ask ZK if the query can run.
> * ZK checks if Qs queries are already running. If so, the query waits, else the query
runs.
> The proposed changes include:
> * Refactor the code to provide a queueing API that supports a variety of queuing mechanisms.
> * Provide three: the null queue (default), an in-process queue (for testing) and the
ZK queues.
> * Modify the query profile web UI to show two new bits of information about queues:
> - The queue to which the query was sent.
> - The total planning cost.
> * Modify the query profile web UI to show two memory assignment numbers:
> - Total memory allocated to the query
> - Memory per sort or hash-add operator
> Then, add to the queue mechanism the ability to do memory assignment:
> * Provide a weight, W: every small query gets 1 unit, every large query gets W units.
> * Use the queue admit levels to determine total units: U = Qs + W * Ql.
> * Obtain total direct memory from the system. M.
> * Subtract a reserve percent R for overhead.
> * Do the math to get the memory per query for each query:
> * For the small queue: (M - R) / U
> * For the large queue: (M - R) / U * W
> * Use this memory amount as the “memory per query” number in the existing sort/hash-agg
memory assignment (instead of the fixed 2 GB.)
> The result will be a nice incremental addition to what we already have, and should make
it a bit easier people to actually use the feature (because they can see the planning numbers
and see the queues used, allowing them to effectively tune the system.)
> The API used for the above features also allow third parties to add on a more robust
admission control feature as needed, perhaps tying into an existing queueing mechanism of
their choice.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message