drill-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From krish...@apache.org
Subject drill git commit: 1.4 update
Date Wed, 16 Dec 2015 01:59:01 GMT
Repository: drill
Updated Branches:
  refs/heads/gh-pages c61c47fe6 -> 7aa38e042


1.4 update

case/cast example per vicki

DRILL-3949


Project: http://git-wip-us.apache.org/repos/asf/drill/repo
Commit: http://git-wip-us.apache.org/repos/asf/drill/commit/7aa38e04
Tree: http://git-wip-us.apache.org/repos/asf/drill/tree/7aa38e04
Diff: http://git-wip-us.apache.org/repos/asf/drill/diff/7aa38e04

Branch: refs/heads/gh-pages
Commit: 7aa38e042150e7ef294792dff3b7dc79f4aa6906
Parents: c61c47f
Author: Kris Hahn <krishahn@apache.org>
Authored: Tue Dec 15 13:25:53 2015 -0800
Committer: Kris Hahn <krishahn@apache.org>
Committed: Tue Dec 15 17:56:07 2015 -0800

----------------------------------------------------------------------
 .../010-configuration-options-introduction.md       | 10 +++++++---
 .../020-storage-plugin-registration.md              |  4 ++--
 .../060-text-files-csv-tsv-psv.md                   | 16 +++++++++++++++-
 3 files changed, 24 insertions(+), 6 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/drill/blob/7aa38e04/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
----------------------------------------------------------------------
diff --git a/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
b/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
index 6aa9017..8282740 100644
--- a/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
+++ b/_docs/configure-drill/configuration-options/010-configuration-options-introduction.md
@@ -16,8 +16,9 @@ The sys.options table lists the following options that you can set as a
system o
 
 | Name                                           | Default          | Comments          
                                                                                         
                                                                                         
                                                                                         
                                                                |
 |------------------------------------------------|------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| drill.exec.functions.cast_empty_string_to_null | FALSE            | Not supported in this
release.                                                                                 
                                                                                         
                                                                                         
                                                             |
+| drill.exec.functions.cast_empty_string_to_null | FALSE            | In a text file, treat
empty fields as NULL values instead of empty string.                                     
                                                                                         
                                                                                         
                                                             |
 | drill.exec.storage.file.partition.column.label | dir              | The column label for
directory levels in results of queries of files in a directory. Accepts a string input.  
                                                                                         
                                                                                         
                                                              |
+| exec.enable_union_type                         | false            | Enable support for
Avro union type.                                                                         
                                                                                         
                                                                                         
                                                                |
 | exec.errors.verbose                            | FALSE            | Toggles verbose output
of executable error messages                                                             
                                                                                         
                                                                                         
                                                            |
 | exec.java_compiler                             | DEFAULT          | Switches between DEFAULT,
JDK, and JANINO mode for the current session. Uses Janino by default for generated source
code of less than exec.java_compiler_janino_maxsize; otherwise, switches to the JDK compiler.
                                                                                         
                                                     |
 | exec.java_compiler_debug                       | TRUE             | Toggles the output
of debug-level compiler error messages in runtime generated code.                        
                                                                                         
                                                                                         
                                                                |
@@ -59,9 +60,9 @@ The sys.options table lists the following options that you can set as a
system o
 | planner.memory.enable_memory_estimation        | FALSE            | Toggles the state of
memory estimation and re-planning of the query. When enabled, Drill conservatively estimates
memory requirements and typically excludes these operators from the plan and negatively impacts
performance.                                                                             
                                                     |
 | planner.memory.hash_agg_table_factor           | 1.1              | A heuristic value for
influencing the size of the hash aggregation table.                                      
                                                                                         
                                                                                         
                                                             |
 | planner.memory.hash_join_table_factor          | 1.1              | A heuristic value for
influencing the size of the hash aggregation table.                                      
                                                                                         
                                                                                         
                                                             |
-| planner.memory_limit                           | 268435456 bytes  | Defines the maximum
amount of direct memory allocated to a query for planning. When multiple queries run concurrently,
each query is allocated the amount of memory set by this parameter.Increase the value of this
parameter and rerun the query if partition pruning failed due to insufficient memory.    
                                                  |
 | planner.memory.max_query_memory_per_node       | 2147483648 bytes | Sets the maximum estimate
of memory for a query per node in bytes. If the estimate is too low, Drill re-plans the query
without memory-constrained operators.                                                    
                                                                                         
                                                     |
 | planner.memory.non_blocking_operators_memory   | 64               | Extra query memory
per node for non-blocking operators. This option is currently used only for memory estimation.
Range: 0-2048 MB                                                                         
                                                                                         
                                                           |
+| planner.memory_limit                           | 268435456 bytes  | Defines the maximum
amount of direct memory allocated to a query for planning. When multiple queries run concurrently,
each query is allocated the amount of memory set by this parameter.Increase the value of this
parameter and rerun the query if partition pruning failed due to insufficient memory.    
                                                  |
 | planner.nestedloopjoin_factor                  | 100              | A heuristic value for
influencing the nested loop join.                                                        
                                                                                         
                                                                                         
                                                             |
 | planner.partitioner_sender_max_threads         | 8                | Upper limit of threads
for outbound queuing.                                                                    
                                                                                         
                                                                                         
                                                            |
 | planner.partitioner_sender_set_threads         | -1               | Overwrites the number
of threads used to send out batches of records. Set to -1 to disable. Typically not changed.
                                                                                         
                                                                                         
                                                          |
@@ -70,16 +71,19 @@ The sys.options table lists the following options that you can set as
a system o
 | planner.slice_target                           | 100000           | The number of records
manipulated within a fragment before Drill parallelizes operations.                      
                                                                                         
                                                                                         
                                                             |
 | planner.width.max_per_node                     | 3                | Maximum number of threads
that can run in parallel for a query on a node. A slice is an individual thread. This number
indicates the maximum number of slices per query for the query’s major fragment on a node.
                                                                                         
                                                     |
 | planner.width.max_per_query                    | 1000             | Same as max per node
but applies to the query as executed by the entire cluster. For example, this value might
be the number of active Drillbits, or a higher number to return results faster.          
                                                                                         
                                                              |
+| security.admin.user_groups                     | n/a              | Unsupported as of 1.4.
A comma-separated list of administrator groups for Web Console security.                 
                                                                                         
                                                                                         
                                                            |
+| security.admin.users                           | <a name>         | Unsupported as
of 1.4. A comma-separated list of user names who you want to give administrator privileges.
                                                                                         
                                                                                         
                                                                  |
 | store.format                                   | parquet          | Output format for data
written to tables with the CREATE TABLE AS (CTAS) command. Allowed values are parquet, json,
psv, csv, or tsv.                                                                        
                                                                                         
                                                         |
 | store.hive.optimize_scan_with_native_readers   | FALSE            | Optimize reads of Parquet-backed
external tables from Hive by using Drill native readers instead of the Hive Serde interface.
(Drill 1.2 and later)                                                                    
                                                                                         
                                               |
 | store.json.all_text_mode                       | FALSE            | Drill reads all data
from the JSON files as VARCHAR. Prevents schema change errors.                           
                                                                                         
                                                                                         
                                                              |
-| store.json.extended_types                      | FALSE            | Turns on special JSON
structures that Drill serializes for storing more type information than the [four basic JSON
types](http://docs.mongodb.org/manual/reference/mongodb-extended-json/).                 
                                                                                         
                                                          |
+| store.json.extended_types                      | FALSE            | Turns on special JSON
structures that Drill serializes for storing more type information than the four basic JSON
types.                                                                                   
                                                                                         
                                                           |
 | store.json.read_numbers_as_double              | FALSE            | Reads numbers with
or without a decimal point as DOUBLE. Prevents schema change errors.                     
                                                                                         
                                                                                         
                                                                |
 | store.mongo.all_text_mode                      | FALSE            | Similar to store.json.all_text_mode
for MongoDB.                                                                             
                                                                                         
                                                                                         
                                               |
 | store.mongo.read_numbers_as_double             | FALSE            | Similar to store.json.read_numbers_as_double.
                                                                                         
                                                                                         
                                                                                         
                                     |
 | store.parquet.block-size                       | 536870912        | Sets the size of a
Parquet row group to the number of bytes less than or equal to the block size of MFS, HDFS,
or the file system.                                                                      
                                                                                         
                                                              |
 | store.parquet.compression                      | snappy           | Compression type for
storing Parquet output. Allowed values: snappy, gzip, none                               
                                                                                         
                                                                                         
                                                              |
 | store.parquet.enable_dictionary_encoding       | FALSE            | For internal use. Do
not change.                                                                              
                                                                                         
                                                                                         
                                                              |
+| store.parquet.dictionary.page-size             | 1048576          |                   
                                                                                         
                                                                                         
                                                                                         
                                                                |
 | store.parquet.use_new_reader                   | FALSE            | Not supported in this
release.                                                                                 
                                                                                         
                                                                                         
                                                             |
 | store.partition.hash_distribute                | FALSE            | Uses a hash algorithm
to distribute data on partition keys in a CTAS partitioning operation. An alpha option--for
experimental use at this stage. Do not use in production systems.                        
                                                                                         
                                                           |
 | store.text.estimated_row_size_bytes            | 100              | Estimate of the row
size in a delimited text file, such as csv. The closer to actual, the better the query plan.
Used for all csv files in the system/session where the value is set. Impacts the decision
to plan a broadcast join or not.                                                         
                                                            |

http://git-wip-us.apache.org/repos/asf/drill/blob/7aa38e04/_docs/connect-a-data-source/020-storage-plugin-registration.md
----------------------------------------------------------------------
diff --git a/_docs/connect-a-data-source/020-storage-plugin-registration.md b/_docs/connect-a-data-source/020-storage-plugin-registration.md
index 9dec247..fd4b0ea 100644
--- a/_docs/connect-a-data-source/020-storage-plugin-registration.md
+++ b/_docs/connect-a-data-source/020-storage-plugin-registration.md
@@ -30,9 +30,9 @@ To register a new storage plugin configuration, enter a storage name, click
**CR
 
 ## Storage Plugin Configuration Persistance
 
-Drill saves storage plugin configurations in a temporary directory (embedded mode) or in
ZooKeeper (distributed mode). For example, on Mac OS X, Drill uses `/tmp/drill/sys.storage_plugins`
to store storage plugin configurations. The temporary directory clears when you quit the Drill
shell. To save your storage plugin configurations from one session to the next, set the following
option in the `drill-override.conf` file if you are running Drill in embedded mode.
+Drill saves storage plugin configurations in a temporary directory (embedded mode) or in
ZooKeeper (distributed mode). For example, on Mac OS X, Drill uses `/tmp/drill/sys.storage_plugins`
to store storage plugin configurations. The temporary directory clears when you reboot. Copy
storage plugin configurations to a secure location to save them when you run drill in embedded
mode.
 
-`drill.exec.sys.store.provider.local.path = "/mypath"`
+<!-- `drill.exec.sys.store.provider.local.path = "/mypath"` -->
 
 <!-- Enabling authorization to protect this data through the Web Console and REST API
does not include protection for the data in the tmp directory or in ZooKeeper. 
 

http://git-wip-us.apache.org/repos/asf/drill/blob/7aa38e04/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md
----------------------------------------------------------------------
diff --git a/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md b/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md
index ccdfc54..0fdb165 100644
--- a/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md
+++ b/_docs/data-sources-and-file-formats/060-text-files-csv-tsv-psv.md
@@ -27,7 +27,21 @@ If your text file have headers, you can enable extractHeader and select
particul
 
 ### Cast data
 
-You can also improve performance by casting the VARCHAR data to INT, FLOAT, DATETIME, and
so on when you read the data from a text file. Drill performs better reading fixed-width than
reading VARCHAR data. 
+You can also improve performance by casting the VARCHAR data in a text file to INT, FLOAT,
DATETIME, and so on when you read the data from a text file. Drill performs better reading
fixed-width than reading VARCHAR data. 
+
+Text files that include empty strings might produce unacceptable results. Common ways to
deal with empty strings are:
+
+* Set the drill.exec.functions.cast_empty_string_to_null SESSION/SYSTEM option to true. 

+* Use a case statement to cast empty strings to values you want. For example, create a Parquet
table named test from a CSV file named test.csv, and cast empty strings in the CSV to null
in any column the empty string appears:  
+
+          CREATE TABLE test AS SELECT
+            case when COLUMNS[0] = '' then CAST(NULL AS INTEGER) else CAST(COLUMNS[0] AS
INTEGER) end AS c1,
+            case when COLUMNS[1] = '' then CAST(NULL AS VARCHAR(20)) else CAST(COLUMNS[1]
AS VARCHAR(20)) end AS c2,
+            case when COLUMNS[2] = '' then CAST(NULL AS DOUBLE) else CAST(COLUMNS[2] AS DOUBLE)
end AS c3,
+            case when COLUMNS[3] = '' then CAST(NULL AS DATE) else CAST(COLUMNS[3] AS DATE)
end AS c4,
+            case when COLUMNS[4] = '' then CAST(NULL AS VARCHAR(20)) else CAST(COLUMNS[4]
AS VARCHAR(20)) end AS c5
+          FROM `test.csv`; 
+
 
 ### Use a Distributed File System
 Using a distributed file system, such as HDFS, instead of a local file system to query the
files also improves performance because currently Drill does not split files on block splits.


Mime
View raw message