drill-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tshi...@apache.org
Subject [02/31] drill git commit: coordinate with Bridget's draft of Perf Tuning
Date Mon, 18 May 2015 23:36:25 GMT
coordinate with Bridget's draft of Perf Tuning


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/64910ec6
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/64910ec6
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/64910ec6

Branch: refs/heads/gh-pages
Commit: 64910ec60b5446533da460be9896fa74cef87cd0
Parents: 6c87ff0
Author: Kristine Hahn <khahn@maprtech.com>
Authored: Sun May 17 11:23:57 2015 -0700
Committer: Kristine Hahn <khahn@maprtech.com>
Committed: Sun May 17 11:23:57 2015 -0700

----------------------------------------------------------------------
 ...guring-a-multitenant-cluster-introduction.md |  4 +-
 .../050-configuring-multitenant-resources.md    |  6 ++-
 .../060-configuring-a-shared-drillbit.md        | 32 ++++++++--------
 .../010-configuration-options-introduction.md   |  8 ++--
 .../030-planning-and-exececution-options.md     | 40 +-------------------
 5 files changed, 29 insertions(+), 61 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/64910ec6/_docs/configure-drill/030-configuring-a-multitenant-cluster-introduction.md
----------------------------------------------------------------------
diff --git a/_docs/configure-drill/030-configuring-a-multitenant-cluster-introduction.md b/_docs/configure-drill/030-configuring-a-multitenant-cluster-introduction.md
index 80edfc8..c6a272f 100644
--- a/_docs/configure-drill/030-configuring-a-multitenant-cluster-introduction.md
+++ b/_docs/configure-drill/030-configuring-a-multitenant-cluster-introduction.md
@@ -15,6 +15,6 @@ You need to plan and configure the following resources for use with Drill
and ot
 
 * [Memory]({{site.baseurl}}/docs/configuring-multitenant-resources)  
 * [CPU]({{site.baseurl}}/docs/configuring-multitenant-resources#how-to-manage-drill-cpu-resources)
 
-* Disk  
+* [Disk]({{site.baseurl}}/docs/configuring-multitenant-resources#how-to-manage-drill-disk-resources)

 
-When users share a Drillbit, [configure queues]({{site.baseurl}}/docs/configuring-resources-for-a-shared-drillbit#configuring-query-queuing)
and [parallelization]({{site.baseurl}}/docs/configuring-resources-for-a-shared-drillbit#configuring-parallelization)
in addition to memory.
\ No newline at end of file
+When users share a Drillbit, [configure queues]({{site.baseurl}}/docs/configuring-resources-for-a-shared-drillbit#configuring-query-queuing)
and [parallelization]({{site.baseurl}}/docs/configuring-resources-for-a-shared-drillbit#configuring-parallelization)
in addition to memory. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/drill/blob/64910ec6/_docs/configure-drill/050-configuring-multitenant-resources.md
----------------------------------------------------------------------
diff --git a/_docs/configure-drill/050-configuring-multitenant-resources.md b/_docs/configure-drill/050-configuring-multitenant-resources.md
index fcdab5c..c54f430 100644
--- a/_docs/configure-drill/050-configuring-multitenant-resources.md
+++ b/_docs/configure-drill/050-configuring-multitenant-resources.md
@@ -33,4 +33,8 @@ Configure NodeManager and ResourceManager to reconfigure the total memory
requir
 Modify MapReduce memory to suit your application needs. Remaining memory is typically given
to YARN applications. 
 
 ## How to Manage Drill CPU Resources
-Currently, you do not manage CPU resources within Drill. [Use Linux `cgroups`](http://en.wikipedia.org/wiki/Cgroups)
to manage the CPU resources.
\ No newline at end of file
+Currently, you do not manage CPU resources within Drill. [Use Linux `cgroups`](http://en.wikipedia.org/wiki/Cgroups)
to manage the CPU resources.
+
+## How to Manage Disk Resources
+
+The `planner.add_producer_consumer` system option enables or disables a secondary reading
thread that works out of band of the rest of the scanning fragment to prefetch data from disk.
If you interact with a certain type of storage medium that is slow or does not prefetch much
data, this option tells Drill to add a producer consumer reading thread to the operation.
Drill can then assign one thread that focuses on a single reading fragment. If Drill is using
memory, you can disable this option to get better performance. If Drill is using disk space,
you should enable this option and set a reasonable queue size for the `planner.producer_consumer_queue_size`
option. For more information about these options, see the section,  "Performance Tuning".
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/drill/blob/64910ec6/_docs/configure-drill/060-configuring-a-shared-drillbit.md
----------------------------------------------------------------------
diff --git a/_docs/configure-drill/060-configuring-a-shared-drillbit.md b/_docs/configure-drill/060-configuring-a-shared-drillbit.md
index 52e3db4..4e47213 100644
--- a/_docs/configure-drill/060-configuring-a-shared-drillbit.md
+++ b/_docs/configure-drill/060-configuring-a-shared-drillbit.md
@@ -11,20 +11,7 @@ Set [options in sys.options]({{site.baseurl}}/docs/configuration-options-introdu
 * exec.queue.large  
 * exec.queue.small  
 
-### Example Configuration
-
-For example, you configure the queue reserved for large queries for a 5-query maximum. You
configure the queue reserved for small queries for 20 queries. Users start to run queries,
and Drill receives the following query requests in this order:
-
-* Query A (blue): 1 billion records, Drill estimates 10 million rows will be processed  
-* Query B (red): 2 billion records, Drill estimates 20 million rows will be processed  
-* Query C: 1 billion records  
-* Query D: 100 records
-
-The exec.queue.threshold default is 30 million, which is the estimated rows to be processed
by the query. Queries A and B are queued in the large queue. The estimated rows to be processed
reaches the 30 million threshold, filling the queue to capacity. The query C request arrives
and goes on the wait list, and then query D arrives. Query D is queued immediately in the
small queue because of its small size, as shown in the following diagram: 
-
-![drill queuing]({{ site.baseurl }}/docs/img/queuing.png)
-
-The Drill queuing configuration in this example tends to give many users running small queries
a rapid response. Users running a large query might experience some delay until an earlier-received
large query returns, freeing space in the large queue to process queries that are waiting.
+For more information, see the section, "Performance Tuning".
 
 ## Configuring Parallelization
 
@@ -39,7 +26,22 @@ To configure parallelization, configure the following options in the `sys.option
 * `planner.width.max.per.query`  
   Same as max per node but applies to the query as executed by the entire cluster.
 
-Configure the `planner.width.max.per.node` to achieve fine grained, absolute control over
parallelization. 
+### planner.width.max_per_node
+Configure the `planner.width.max.per.node` to achieve fine grained, absolute control over
parallelization. In this context *width* refers to fanout or distribution potential: the ability
to run a query in parallel across the cores on a node and the nodes on a cluster. A physical
plan consists of intermediate operations, known as query &quot;fragments,&quot; that
run concurrently, yielding opportunities for parallelism above and below each exchange operator
in the plan. An exchange operator represents a breakpoint in the execution flow where processing
can be distributed. For example, a single-process scan of a file may flow into an exchange
operator, followed by a multi-process aggregation fragment.
+
+The maximum width per node defines the maximum degree of parallelism for any fragment of
a query, but the setting applies at the level of a single node in the cluster. The *default*
maximum degree of parallelism per node is calculated as follows, with the theoretical maximum
automatically scaled back (and rounded down) so that only 70% of the actual available capacity
is taken into account: number of active drillbits (typically one per node) * number of cores
per node * 0.7
+
+For example, on a single-node test system with 2 cores and hyper-threading enabled: 1 * 4
* 0.7 = 3
+
+When you modify the default setting, you can supply any meaningful number. The system does
not automatically scale down your setting.
+
+### planner.width.max_per_query
+
+The max_per_query value also sets the maximum degree of parallelism for any given stage of
a query, but the setting applies to the query as executed by the whole cluster (multiple nodes).
In effect, the actual maximum width per query is the *minimum of two values*: min((number
of nodes * width.max_per_node), width.max_per_query)
+
+For example, on a 4-node cluster where `width.max_per_node` is set to 6 and `width.max_per_query`
is set to 30: min((4 * 6), 30) = 24
+
+In this case, the effective maximum width per query is 24, not 30.
 
 <!-- ??For example, setting the `planner.width.max.per.query` to 60 will not accelerate
Drill operations because overlapping does not occur when executing 60 queries at the same
time.??
 

http://git-wip-us.apache.org/repos/asf/drill/blob/64910ec6/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
----------------------------------------------------------------------
diff --git a/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
b/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
index ee2ff9e..bdd19f3 100644
--- a/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
+++ b/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
@@ -22,8 +22,8 @@ The sys.options table lists the following options that you can set as a
system o
 | exec.java_compiler                             | DEFAULT          | Switches between DEFAULT,
JDK, and JANINO mode for the current session. Uses Janino by default for generated source
code of less than exec.java_compiler_janino_maxsize; otherwise, switches to the JDK compiler.
                                                                                         
                                                     |
 | exec.java_compiler_debug                       | TRUE             | Toggles the output
of debug-level compiler error messages in runtime generated code.                        
                                                                                         
                                                                                         
                                                                |
 | exec.java_compiler_janino_maxsize              | 262144           | See the exec.java_compiler
option comment. Accepts inputs of type LONG.                                             
                                                                                         
                                                                                         
                                                        |
-| exec.max_hash_table_size                       | 1073741824       | Ending size for hash
tables. Range: 0 - 1073741824. For internal use.                                         
                                                                                         
                                                                                         
                                                              |
-| exec.min_hash_table_size                       | 65536            | Starting size for hash
tables. Increase according to available memory to improve performance. Range: 0 - 1073741824.
For internal use.                                                                        
                                                                                         
                                                        |
+| exec.max_hash_table_size                       | 1073741824       | Ending size for hash
tables. Range: 0 - 1073741824.                                                           
                                                                                         
                                                                                         
                                                              |
+| exec.min_hash_table_size                       | 65536            | Starting size for hash
tables. Increase according to available memory to improve performance. Increasing for very
large aggregations or joins when you have large amounts of memory for Drill to use. Range:
0 - 1073741824.                                                                          
                                                          |
 | exec.queue.enable                              | FALSE            | Changes the state of
query queues to control the number of queries that run simultaneously.                   
                                                                                         
                                                                                         
                                                              |
 | exec.queue.large                               | 10               | Sets the number of
large queries that can run concurrently in the cluster. Range: 0-1000                    
                                                                                         
                                                                                         
                                                                |
 | exec.queue.small                               | 100              | Sets the number of
small queries that can run concurrently in the cluster. Range: 0-1001                    
                                                                                         
                                                                                         
                                                                |
@@ -65,13 +65,13 @@ The sys.options table lists the following options that you can set as
a system o
 | planner.partitioner_sender_max_threads         | 8                | Upper limit of threads
for outbound queuing.                                                                    
                                                                                         
                                                                                         
                                                            |
 | planner.partitioner_sender_set_threads         | -1               | Overwrites the number
of threads used to send out batches of records. Set to -1 to disable. Typically not changed.
                                                                                         
                                                                                         
                                                          |
 | planner.partitioner_sender_threads_factor      | 2                | A heuristic param to
use to influence final number of threads. The higher the value the fewer the number of threads.
                                                                                         
                                                                                         
                                                        |
-| planner.producer_consumer_queue_size           | 10               | How much data to prefetch
from disk in record batches out-of-band of query execution.                              
                                                                                         
                                                                                         
                                                         |
+| planner.producer_consumer_queue_size           | 10               | How much data to prefetch
from disk in record batches out-of-band of query execution. The larger the queue size, the
greater the amount of memory that the queue and overall query execution consumes.        
                                                                                         
                                                        |
 | planner.slice_target                           | 100000           | The number of records
manipulated within a fragment before Drill parallelizes operations.                      
                                                                                         
                                                                                         
                                                             |
 | planner.width.max_per_node                     | 3                | Maximum number of threads
that can run in parallel for a query on a node. A slice is an individual thread. This number
indicates the maximum number of slices per query for the query’s major fragment on a node.
                                                                                         
                                                     |
 | planner.width.max_per_query                    | 1000             | Same as max per node
but applies to the query as executed by the entire cluster. For example, this value might
be the number of active Drillbits, or a higher number to return results faster.          
                                                                                         
                                                              |
 | store.format                                   | parquet          | Output format for data
written to tables with the CREATE TABLE AS (CTAS) command. Allowed values are parquet, json,
or text. Allowed values: 0, -1, 1000000                                                  
                                                                                         
                                                         |
 | store.json.all_text_mode                       | FALSE            | Drill reads all data
from the JSON files as VARCHAR. Prevents schema change errors.                           
                                                                                         
                                                                                         
                                                              |
-| store.json.extended_types                      | FALSE            | Turns on special JSON
structures that Drill serializes for storing more type information than the [four basic JSON
types] (http://docs.mongodb.org/manual/reference/mongodb-extended-json/).                
                                                                                         
                                                          |
+| store.json.extended_types                      | FALSE            | Turns on special JSON
structures that Drill serializes for storing more type information than the [four basic JSON
types](http://docs.mongodb.org/manual/reference/mongodb-extended-json/).                 
                                                                                         
                                                          |
 | store.json.read_numbers_as_double              | FALSE            | Reads numbers with
or without a decimal point as DOUBLE. Prevents schema change errors.                     
                                                                                         
                                                                                         
                                                                |
 | store.mongo.all_text_mode                      | FALSE            | Similar to store.json.all_text_mode
for MongoDB.                                                                             
                                                                                         
                                                                                         
                                               |
 | store.mongo.read_numbers_as_double             | FALSE            | Similar to store.json.read_numbers_as_double.
                                                                                         
                                                                                         
                                                                                         
                                     |

http://git-wip-us.apache.org/repos/asf/drill/blob/64910ec6/_docs/configure-drill/configuration-options/030-planning-and-exececution-options.md
----------------------------------------------------------------------
diff --git a/_docs/configure-drill/configuration-options/030-planning-and-exececution-options.md
b/_docs/configure-drill/configuration-options/030-planning-and-exececution-options.md
index 2608538..d1c9a30 100644
--- a/_docs/configure-drill/configuration-options/030-planning-and-exececution-options.md
+++ b/_docs/configure-drill/configuration-options/030-planning-and-exececution-options.md
@@ -14,42 +14,4 @@ Use the ALTER SYSTEM or ALTER SESSION commands to set options. Typically,
 you set the options at the session level unless you want the setting to
 persist across all sessions.
 
-The summary of system options lists default values. The following descriptions provide more
detail on some of these options:
-
-### exec.min_hash_table_size
-
-The default starting size for hash tables. Increasing this size is useful for very large
aggregations or joins when you have large amounts of memory for Drill to use. Drill can spend
a lot of time resizing the hash table as it finds new data. If you have large data sets, you
can increase this hash table size to increase performance.
-
-### planner.add_producer_consumer
-
-This option enables or disables a secondary reading thread that works out of band of the
rest of the scanning fragment to prefetch data from disk. If you interact with a certain type
of storage medium that is slow or does not prefetch much data, this option tells Drill to
add a producer consumer reading thread to the operation. Drill can then assign one thread
that focuses on a single reading fragment. If Drill is using memory, you can disable this
option to get better performance. If Drill is using disk space, you should enable this option
and set a reasonable queue size for the planner.producer_consumer_queue_size option.
-
-### planner.broadcast_threshold
-
-Threshold, in terms of a number of rows, that determines whether a broadcast join is chosen
for a query. Regardless of the setting of the broadcast_join option (enabled or disabled),
a broadcast join is not chosen unless the right side of the join is estimated to contain fewer
rows than this threshold. The intent of this option is to avoid broadcasting too many rows
for join purposes. Broadcasting involves sending data across nodes and is a network-intensive
operation. (The &quot;right side&quot; of the join, which may itself be a join or
simply a table, is determined by cost-based optimizations and heuristics during physical planning.)
-
-### planner.enable_broadcast_join, planner.enable_hashagg, planner.enable_hashjoin, planner.enable_mergejoin,
planner.enable_multiphase_agg, planner.enable_streamagg
-
-These options enable or disable specific aggregation and join operators for queries. These
operators are all enabled by default and in general should not be disabled.</p><p>Hash
aggregation and hash join are hash-based operations. Streaming aggregation and merge join
are sort-based operations. Both hash-based and sort-based operations consume memory; however,
currently, hash-based operations do not spill to disk as needed, but the sort-based operations
do. If large hash operations do not fit in memory on your system, you may need to disable
these operations. Queries will continue to run, using alternative plans.
-
-### planner.producer_consumer_queue_size
-
-Determines how much data to prefetch from disk (in record batches) out of band of query execution.
The larger the queue size, the greater the amount of memory that the queue and overall query
execution consumes.
-
-### planner.width.max_per_node
-
-In this context *width* refers to fanout or distribution potential: the ability to run a
query in parallel across the cores on a node and the nodes on a cluster. A physical plan consists
of intermediate operations, known as query &quot;fragments,&quot; that run concurrently,
yielding opportunities for parallelism above and below each exchange operator in the plan.
An exchange operator represents a breakpoint in the execution flow where processing can be
distributed. For example, a single-process scan of a file may flow into an exchange operator,
followed by a multi-process aggregation fragment.
-
-The maximum width per node defines the maximum degree of parallelism for any fragment of
a query, but the setting applies at the level of a single node in the cluster. The *default*
maximum degree of parallelism per node is calculated as follows, with the theoretical maximum
automatically scaled back (and rounded down) so that only 70% of the actual available capacity
is taken into account: number of active drillbits (typically one per node) * number of cores
per node * 0.7
-
-For example, on a single-node test system with 2 cores and hyper-threading enabled: 1 * 4
* 0.7 = 3
-
-When you modify the default setting, you can supply any meaningful number. The system does
not automatically scale down your setting.
-
-### planner.width.max_per_query
-
-The max_per_query value also sets the maximum degree of parallelism for any given stage of
a query, but the setting applies to the query as executed by the whole cluster (multiple nodes).
In effect, the actual maximum width per query is the *minimum of two values*: min((number
of nodes * width.max_per_node), width.max_per_query)
-
-For example, on a 4-node cluster where `width.max_per_node` is set to 6 and `width.max_per_query`
is set to 30: min((4 * 6), 30) = 24
-
-In this case, the effective maximum width per query is 24, not 30.
\ No newline at end of file
+The [summary of system options]({{site.baseurl}}/docs/configuration-options-introduction)
lists default values and a short description of the planning and execution options. The planning
option names have a planning preface. Execution options have an exe preface. For more information,
see the section, "Performance Turning".The following descriptions provide more detail on some
of these options:


Mime
View raw message