DECIMAL_V2 Query Option

JAVA_TOOL_OPTIONS="-Dsentry.allow.uri.db.policyfile=true" http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/ae2f8d03/docs/build/html/topics/impala_char.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_char.html b/docs/build/html/topics/impala_char.html index e0b4cb9..62ab8ef 100644 --- a/docs/build/html/topics/impala_char.html +++ b/docs/build/html/topics/impala_char.html @@ -240,7 +240,7 @@ select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c fr Kudu considerations: - Currently, the data types DECIMAL, TIMESTAMP, CHAR, VARCHAR, + Currently, the data types DECIMAL, CHAR, VARCHAR, ARRAY, MAP, and STRUCT cannot be used with Kudu tables. http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/ae2f8d03/docs/build/html/topics/impala_components.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_components.html b/docs/build/html/topics/impala_components.html index d3d210d..c7245c3 100644 --- a/docs/build/html/topics/impala_components.html +++ b/docs/build/html/topics/impala_components.html @@ -53,6 +53,12 @@ + In Impala 2.9 and higher, you can control which hosts act as query coordinators + and which act as query executors, to improve scalability for highly concurrent workloads on large clusters. + See Scalability Considerations for Impala for details. + + + Related information: Modifying Impala Startup Options, Starting Impala, Setting the Idle Query and Idle Session Timeouts for impalad, Ports Used by Impala, Using Impala through a Proxy for High Availability http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/ae2f8d03/docs/build/html/topics/impala_compute_stats.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_compute_stats.html b/docs/build/html/topics/impala_compute_stats.html index fcba3d6..a20c0b2 100644 --- a/docs/build/html/topics/impala_compute_stats.html +++ b/docs/build/html/topics/impala_compute_stats.html @@ -543,7 +543,7 @@ show table stats item_partitioned; Kudu tables. Therefore, you do not need to re-run the operation when you see -1 in the # Rows column of the output from SHOW TABLE STATS. That column always shows -1 for - all Kudu tables. + all Kudu tables.

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/ae2f8d03/docs/build/html/topics/impala_conditional_functions.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_conditional_functions.html b/docs/build/html/topics/impala_conditional_functions.html index 713946b..7490c1a 100644 --- a/docs/build/html/topics/impala_conditional_functions.html +++ b/docs/build/html/topics/impala_conditional_functions.html @@ -488,6 +488,60 @@ END

+select x, nvl2(x, 999, 0) from nvl2_demo; ++------+---------------------------+ +| x | if(x is not null, 999, 0) | ++------+---------------------------+ +| NULL | 0 | +| 1 | 999 | +| NULL | 0 | +| 2 | 999 | ++------+---------------------------+ + +select s, nvl2(s, 'is not null', 'is null') from nvl2_demo; ++------+---------------------------------------------+ +| s | if(s is not null, 'is not null', 'is null') | ++------+---------------------------------------------+ +| NULL | is null | +| one | is not null | +| NULL | is null | +| two | is not null | ++------+---------------------------------------------+ +

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] db_name.]table_name [PARTITIONED BY (col_name[, ...])] + [SORT BY ([column [, column ...]])] [COMMENT 'table_comment'] [WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)] [ @@ -130,6 +132,7 @@ file_format: CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name LIKE PARQUET 'hdfs_path_of_parquet_file' + [SORT BY ([column [, column ...]])] [COMMENT 'table_comment'] [PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)] [WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)] @@ -346,6 +349,83 @@ AS + Sorted tables (SORT BY clause): + + + + The optional SORT BY clause lets you specify zero or more columns + that are sorted in the data files created by each Impala INSERT or + CREATE TABLE AS SELECT operation. Creating data files that are + sorted is most useful for Parquet tables, where the metadata stored inside each file includes + the minimum and maximum values for each column in the file. (The statistics apply to each row group + within the file; for simplicity, Impala writes a single row group in each file.) Grouping + data values together in relatively narrow ranges within each data file makes it possible + for Impala to quickly skip over data files that do not contain value ranges indicated in + the WHERE clause of a query, and can improve the effectiveness + of Parquet encoding and compression. + + + + This clause is not applicable for Kudu tables or HBase tables. Although it works + for other HDFS file formats besides Parquet, the more efficient layout is most + evident with Parquet tables, because each Parquet data file includes statistics + about the data values in that file. + + + + The SORT BY columns cannot include any partition key columns + for a partitioned table, because those column values are not represented in + the underlying data files. + + + + Because data files can arrive in Impala tables by mechanisms that do not respect + the SORT BY clause, such as LOAD DATA or ETL + tools that create HDFS files, Impala does not guarantee or rely on the data being + sorted. The sorting aspect is only used to create a more efficient layout for + Parquet files generated by Impala, which helps to optimize the processing of + those Parquet files during Impala queries. During an INSERT + or CREATE TABLE AS SELECT operation, the sorting occurs + when the SORT BY clause applies to the destination table + for the data, regardless of whether the source table has a SORT BY + clause. + + + + For example, when creating a table intended to contain census data, you might define + sort columns such as last name and state. If a data file in this table contains a + narrow range of last names, for example from Smith to Smythe, + Impala can quickly detect that this data file contains no matches for a WHERE + clause such as WHERE last_name = 'Jones' and avoid reading the entire file. + + +CREATE TABLE census_data (last_name STRING, first_name STRING, state STRING, address STRING) + SORT BY (last_name, state) + STORED AS PARQUET; + + + + Likewise, if an existing table contains data without any sort order, you can reorganize + the data in a more efficient way by using INSERT or + CREATE TABLE AS SELECT to copy that data into a new table with a + SORT BY clause: + + +CREATE TABLE sorted_census_data + SORT BY (last_name, state) + STORED AS PARQUET + AS SELECT last_name, first_name, state, address + FROM unsorted_census_data; + + + + The metadata for the SORT BY clause is stored in the TBLPROPERTIES + fields for the table. Other SQL engines that can interoperate with Impala tables, such as Hive + and Spark SQL, do not recognize this property when inserting into a table that has a SORT BY + clause. + + + Kudu considerations: http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/ae2f8d03/docs/build/html/topics/impala_datetime_functions.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_datetime_functions.html b/docs/build/html/topics/impala_datetime_functions.html index 222ae8c..1649c0a 100644 --- a/docs/build/html/topics/impala_datetime_functions.html +++ b/docs/build/html/topics/impala_datetime_functions.html @@ -169,7 +169,7 @@ select now(), current_timestamp(); | 2016-05-19 16:10:14.237849000 | 2016-05-19 16:10:14.237849000 | +-------------------------------+-------------------------------+ -select current_timestamp() as right_now, +select current_timestamp() as right_now, current_timestamp() + interval 3 hours as in_three_hours; +-------------------------------+-------------------------------+ | right_now | in_three_hours | @@ -391,7 +391,7 @@ select date_sub(cast('2016-05-31' as timestamp), interval 1 months) as 'april_31 Examples: - The following example shows how comparing a "late" value with + The following example shows how comparing a "late" value with an "earlier" value produces a positive number. In this case, the result is (365 * 5) + 1, because one of the intervening years is a leap year. @@ -713,9 +713,10 @@ select now() as right_now, days_sub(now(), 31) as 31_days_ago; Purpose: Returns one of the numeric date or time fields from a TIMESTAMP value. - Unit argument: The unit string can be one of year, - month, day, hour, minute, - second, or millisecond. This argument value is case-insensitive. + Unit argument: The unit string can be one of epoch, + year, month, day, hour, + minute, second, or millisecond. + This argument value is case-insensitive. In Impala 2.0 and higher, you can use special syntax rather than a regular function call, for @@ -754,8 +755,8 @@ select now() as right_now, +-------------------------------+-----------+------------+ select now() as right_now, - extract(day from now()) as this_day, - extract(hour from now()) as this_hour; + extract(day from now()) as this_day, + extract(hour from now()) as this_hour; +-------------------------------+----------+-----------+ | right_now | this_day | this_hour | +-------------------------------+----------+-----------+ @@ -1696,6 +1697,14 @@ with t1 as (select trunc(now(), 'dd') as today) Return type: timestamp + Kudu considerations: + + + The nanosecond portion of an Impala TIMESTAMP value + is rounded to the nearest microsecond when that value is stored in a + Kudu table. + + Examples: @@ -1731,6 +1740,14 @@ select now() as right_now, nanoseconds_add(now(), 1e9) as 1_second_later; Return type: timestamp + + Kudu considerations: + + + The nanosecond portion of an Impala TIMESTAMP value + is rounded to the nearest microsecond when that value is stored in a + Kudu table. + select now() as right_now, nanoseconds_sub(now(), 1) as 1_nanosecond_earlier; +-------------------------------+-------------------------------+ http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/ae2f8d03/docs/build/html/topics/impala_decimal.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_decimal.html b/docs/build/html/topics/impala_decimal.html index 8cec53e..9604c5f 100644 --- a/docs/build/html/topics/impala_decimal.html +++ b/docs/build/html/topics/impala_decimal.html @@ -807,7 +807,7 @@ SELECT CAST(1000.5 AS DECIMAL); Kudu considerations: - Currently, the data types DECIMAL, TIMESTAMP, CHAR, VARCHAR, + Currently, the data types DECIMAL, CHAR, VARCHAR, ARRAY, MAP, and STRUCT cannot be used with Kudu tables. http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/ae2f8d03/docs/build/html/topics/impala_decimal_v2.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_decimal_v2.html b/docs/build/html/topics/impala_decimal_v2.html new file mode 100644 index 0000000..4f1b5ea --- /dev/null +++ b/docs/build/html/topics/impala_decimal_v2.html @@ -0,0 +1,32 @@ + +DECIMAL_V2 Query Option + + DECIMAL_V2 Query Option + + + + + + + A query option that changes behavior related to the DECIMAL + data type. + + + Important: + + This query option is currently unsupported. + Its precise behavior is currently undefined and might change + in the future. + + + + + Type: Boolean; recognized values are 1 and 0, or true and false; + any other value interpreted as false + + + Default: false (shown as 0 in output of SET statement) + + +Parent topic: Query Options for the SET Statement \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/ae2f8d03/docs/build/html/topics/impala_default_join_distribution_mode.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_default_join_distribution_mode.html b/docs/build/html/topics/impala_default_join_distribution_mode.html new file mode 100644 index 0000000..c866519 --- /dev/null +++ b/docs/build/html/topics/impala_default_join_distribution_mode.html @@ -0,0 +1,113 @@ + +DEFAULT_JOIN_DISTRIBUTION_MODE Query Option + + DEFAULT_JOIN_DISTRIBUTION_MODE Query Option + + + + + + + + This option determines the join distribution that Impala uses when any of the tables + involved in a join query is missing statistics. + + + + Impala optimizes join queries based on the presence of table statistics, + which are produced by the Impala COMPUTE STATS statement. + By default, when a table involved in the join query does not have statistics, + Impala uses the "broadcast" technique that transmits the entire contents + of the table to all executor nodes participating in the query. If one table + involved in a join has statistics and the other does not, the table without + statistics is broadcast. If both tables are missing statistics, the table + that is referenced second in the join order is broadcast. This behavior + is appropriate when the table involved is relatively small, but can lead to + excessive network, memory, and CPU overhead if the table being broadcast is + large. + + + + Because Impala queries frequently involve very large tables, and suboptimal + joins for such tables could result in spilling or out-of-memory errors, + the setting DEFAULT_JOIN_DISTRIBUTION_MODE=SHUFFLE lets you + override the default behavior. The shuffle join mechanism divides the corresponding rows + of each table involved in a join query using a hashing algorithm, and transmits + subsets of the rows to other nodes for processing. Typically, this kind of join is + more efficient for joins between large tables of similar size. + + + + The setting DEFAULT_JOIN_DISTRIBUTION_MODE=SHUFFLE is + recommended when setting up and deploying new clusters, because it is less likely + to result in serious consequences such as spilling or out-of-memory errors if + the query plan is based on incomplete information. This setting is not the default, + to avoid changing the performance characteristics of join queries for clusters that + are already tuned for their existing workloads. + + + + Type: integer + + + The allowed values are BROADCAST (equivalent to 0) + or SHUFFLE (equivalent to 1). + + + + Examples: + + + The following examples demonstrate appropriate scenarios for each + setting of this query option. + + + +-- Create a billion-row table. +create table big_table stored as parquet + as select * from huge_table limit 1e9; + +-- For a big table with no statistics, the +-- shuffle join mechanism is appropriate. +set default_join_distribution_mode=shuffle; + +...join queries involving the big table... + + + +-- Create a hundred-row table. +create table tiny_table stored as parquet + as select * from huge_table limit 100; + +-- For a tiny table with no statistics, the +-- broadcast join mechanism is appropriate. +set default_join_distribution_mode=broadcast; + +...join queries involving the tiny table... + + + +compute stats tiny_table; +compute stats big_table; + +-- Once the stats are computed, the query option has +-- no effect on join queries involving these tables. +-- Impala can determine the absolute and relative sizes +-- of each side of the join query by examining the +-- row size, cardinality, and so on of each table. + +...join queries involving both of these tables... + + + + Related information: + + + COMPUTE STATS Statement, + Joins in Impala SELECT Statements, + Performance Considerations for Join Queries + + + +Parent topic: Query Options for the SET Statement \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/ae2f8d03/docs/build/html/topics/impala_describe.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_describe.html b/docs/build/html/topics/impala_describe.html index 963ef6e..0c20071 100644 --- a/docs/build/html/topics/impala_describe.html +++ b/docs/build/html/topics/impala_describe.html @@ -745,7 +745,7 @@ Returned 27 row(s) in 0.17s - The following example shows DESCRIBE output for a simple Kudu table, with + The following example shows DESCRIBE output for a simple Kudu table, with a single-column primary key and all column attributes left with their default values: http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/ae2f8d03/docs/build/html/topics/impala_double.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_double.html b/docs/build/html/topics/impala_double.html index b87994c..a1b87fb 100644 --- a/docs/build/html/topics/impala_double.html +++ b/docs/build/html/topics/impala_double.html @@ -59,6 +59,17 @@ The data type REAL is an alias for DOUBLE. + + + Impala does not evaluate NaN (not a number) as equal to any other numeric values, + including other NaN values. For example, the following statement, which evaluates equality + between two NaN values, returns false: + + + +SELECT CAST('nan' AS DOUBLE)=CAST('nan' AS DOUBLE); + + Examples: http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/ae2f8d03/docs/build/html/topics/impala_explain.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_explain.html b/docs/build/html/topics/impala_explain.html index 473a94d..0de916d 100644 --- a/docs/build/html/topics/impala_explain.html +++ b/docs/build/html/topics/impala_explain.html @@ -248,7 +248,7 @@ EXPLAIN_LEVEL set to extended against HDFS-based tables. - + To see which predicates Impala can "push down" to Kudu for efficient evaluation, without transmitting unnecessary rows back to Impala, look for the kudu predicates item in @@ -260,22 +260,27 @@ EXPLAIN_LEVEL set to extended and non-primary key column Y, you can see that some operators in the WHERE clause are evaluated immediately by Kudu and others are evaluated later by Impala: + + EXPLAIN SELECT x,y from kudu_table WHERE - x = 1 AND x NOT IN (2,3) AND y = 1 - AND x IS NOT NULL AND x > 0; + x = 1 AND y NOT IN (2,3) AND z = 1 + AND a IS NOT NULL AND b > 0 AND length(s) > 5; +---------------- | Explain String +---------------- ... -| 00:SCAN KUDU [jrussell.hash_only] -| predicates: x IS NOT NULL, x NOT IN (2, 3) -| kudu predicates: x = 1, x > 0, y = 1 +| 00:SCAN KUDU [kudu_table] +| predicates: y NOT IN (2, 3), length(s) > 5 +| kudu predicates: a IS NOT NULL, b > 0, x = 1, z = 1 - Only binary predicates and IN predicates containing - literal values that exactly match the types in the Kudu table, and do not + + + Only binary predicates, IS NULL and IS NOT NULL + (in Impala 2.9 and higher), and IN predicates + containing literal values that exactly match the types in the Kudu table, and do not require any casting, can be pushed to Kudu. - + Related information: http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/ae2f8d03/docs/build/html/topics/impala_explain_plan.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_explain_plan.html b/docs/build/html/topics/impala_explain_plan.html index bcd0855..e749869 100644 --- a/docs/build/html/topics/impala_explain_plan.html +++ b/docs/build/html/topics/impala_explain_plan.html @@ -111,8 +111,8 @@ The amount of detail displayed in the EXPLAIN output is controlled by the EXPLAIN_LEVEL query option. You typically - increase this setting from normal to verbose (or from 0 - to 1) when doublechecking the presence of table and column statistics during performance + increase this setting from standard to extended (or from 1 + to 2) when doublechecking the presence of table and column statistics during performance tuning, or when estimating query resource usage in conjunction with the resource management features.