Impala Analytic Functions

Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 79849200BE8 for ; Fri, 18 Nov 2016 00:12:13 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 763ED160B1B; Thu, 17 Nov 2016 23:12:13 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5D115160B1D for ; Fri, 18 Nov 2016 00:12:11 +0100 (CET) Received: (qmail 42296 invoked by uid 500); 17 Nov 2016 23:12:10 -0000 Mailing-List: contact commits-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@impala.incubator.apache.org Delivered-To: mailing list commits@impala.incubator.apache.org Received: (qmail 42277 invoked by uid 99); 17 Nov 2016 23:12:10 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2016 23:12:10 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id BB544C04FE for ; Thu, 17 Nov 2016 23:12:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -6.219 X-Spam-Level: X-Spam-Status: No, score=-6.219 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id Wa9GekVLQkVa for ; Thu, 17 Nov 2016 23:12:04 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id 753AD60E5B for ; Thu, 17 Nov 2016 23:11:41 +0000 (UTC) Received: (qmail 40229 invoked by uid 99); 17 Nov 2016 23:11:40 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2016 23:11:40 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 6BA20F1715; Thu, 17 Nov 2016 23:11:40 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: jbapple@apache.org To: commits@impala.incubator.apache.org Date: Thu, 17 Nov 2016 23:12:23 -0000 Message-Id: In-Reply-To: References: X-Mailer: ASF-Git Admin Mailer Subject: [45/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch. archived-at: Thu, 17 Nov 2016 23:12:13 -0000 http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_analytic_functions.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_analytic_functions.xml b/docs/topics/impala_analytic_functions.xml new file mode 100644 index 0000000..0a31363 --- /dev/null +++ b/docs/topics/impala_analytic_functions.xml @@ -0,0 +1,1739 @@ + + + + + Impala Analytic Functions + + + + Analytic Functions + + + + + + + + + + + + + + + + + + +

+ analytic functions + + window functions + Analytic functions (also known as window functions) are a special category of built-in functions. Like + aggregate functions, they examine the contents of multiple input rows to compute each output value. However, + rather than being limited to one result value per GROUP BY group, they operate on + windows where the input rows are ordered and grouped using flexible conditions expressed through + an OVER() clause. +

+ +

+ + + +

+ Some functions, such as LAG() and RANK(), can only be used in this analytic + context. Some aggregate functions do double duty: when you call the aggregation functions such as + MAX(), SUM(), AVG(), and so on with an + OVER() clause, they produce an output value for each row, based on computations across other + rows in the window. +

+ +

+ Although analytic functions often compute the same value you would see from an aggregate function in a + GROUP BY query, the analytic functions produce a value for each row in the result set rather + than a single value for each group. This flexibility lets you include additional columns in the + SELECT list, offering more opportunities for organizing and filtering the result set. +

+ +

+ Analytic function calls are only allowed in the SELECT list and in the outermost + ORDER BY clause of the query. During query processing, analytic functions are evaluated + after other query stages such as joins, WHERE, and GROUP BY, +

+ + + + + + + + + +

+ The rows that are part of each partition are analyzed by computations across an ordered or unordered set of + rows. For example, COUNT() and SUM() might be applied to all the rows in + the partition, in which case the order of analysis does not matter. The ORDER BY clause + might be used inside the OVER() clause to defines the ordering that applies to functions + such as LAG() and FIRST_VALUE(). +

+ + + + + +

+ Analytic functions are frequently used in fields such as finance and science to provide trend, outlier, and + bucketed analysis for large data sets. You might also see the term window functions in database + literature, referring to the sequence of rows (the window) that the function call applies to, + particularly when the OVER clause includes a ROWS or RANGE + keyword. +

+ +

+ The following sections describe the analytic query clauses and the pure analytic functions provided by + Impala. For usage information about aggregate functions in an analytic context, see + . +

+ +

+ + + + + + OVER Clause + + + +

+ The OVER clause is required for calls to pure analytic functions such as + LEAD(), RANK(), and FIRST_VALUE(). When you include an + OVER clause with calls to aggregate functions such as MAX(), + COUNT(), or SUM(), they operate as analytic functions. +

+ +

+ +function(args) OVER([partition_by_clause] [order_by_clause [window_clause]]) + +partition_by_clause ::= PARTITION BY expr [, expr ...] +order_by_clause ::= ORDER BY expr [ASC | DESC] [NULLS FIRST | NULLS LAST] [, expr [ASC | DESC] [NULLS FIRST | NULLS LAST] ...] +window_clause: See Window Clause + + +

+ PARTITION BY clause: +

+ +

+ The PARTITION BY clause acts much like the GROUP BY clause in the + outermost block of a query. It divides the rows into groups containing identical values in one or more + columns. These logical groups are known as partitions. Throughout the discussion of analytic + functions, partitions refers to the groups produced by the PARTITION BY clause, not + to partitioned tables. However, note the following limitation that applies specifically to analytic function + calls involving partitioned tables. +

+ +

+ The sequence of results from an analytic function resets for each new partition in the result set. + That is, the set of preceding or following rows considered by the analytic function always come from a + single partition. Any MAX(), SUM(), ROW_NUMBER(), and so + on apply to each partition independently. Omit the PARTITION BY clause to apply the + analytic operation to all the rows in the table. +

+ +

+ ORDER BY clause: +

+ +

+ The ORDER BY clause works much like the ORDER BY clause in the outermost + block of a query. It defines the order in which rows are evaluated for the entire input set, or for each + group produced by a PARTITION BY clause. You can order by one or multiple expressions, and + for each expression optionally choose ascending or descending order and whether nulls come first or last in + the sort order. Because this ORDER BY clause only defines the order in which rows are + evaluated, if you want the results to be output in a specific order, also include an ORDER + BY clause in the outer block of the query. +

+ +

+ When the ORDER BY clause is omitted, the analytic function applies to all items in the + group produced by the PARTITION BY clause. When the ORDER BY clause is + included, the analysis can apply to all or a subset of the items in the group, depending on the optional + window clause. +

+ +

+ The order in which the rows are analyzed is only defined for those columns specified in ORDER + BY clauses. +

+ +

+ One difference between the analytic and outer uses of the ORDER BY clause: inside the + OVER clause, ORDER BY 1 or other integer value is interpreted as a + constant sort value (effectively a no-op) rather than referring to column 1. +

+ +

+ Window clause: +

+ +

+ The window clause is only allowed in combination with an ORDER BY clause. If the + ORDER BY clause is specified but the window clause is not, the default window is + RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. See + for full details. +

+ +

+ Because HBase tables are optimized for single-row lookups rather than full scans, analytic functions using + the OVER() clause are not recommended for HBase tables. Although such queries work, their + performance is lower than on comparable tables using HDFS data files. +

+ +

+ Analytic functions are very efficient for Parquet tables. The data that is examined during evaluation of + the OVER() clause comes from a specified set of columns, and the values for each column + are arranged sequentially within each data file. +

+ +

+ Analytic functions are convenient to use with text tables for exploratory business intelligence. When the + volume of data is substantial, prefer to use Parquet tables for performance-critical analytic queries. +

+ +

+ The following example shows how to synthesize a numeric sequence corresponding to all the rows in a table. + The new table has the same columns as the old one, plus an additional column ID containing + the integers 1, 2, 3, and so on, corresponding to the order of a TIMESTAMP column in the + original table. +

+ + + +CREATE TABLE events_with_id AS + SELECT + row_number() OVER (ORDER BY date_and_time) AS id, + c1, c2, c3, c4 + FROM events; + + +

+ The following example shows how to determine the number of rows containing each value for a column. Unlike + a corresponding GROUP BY query, this one can analyze a single column and still return all + values (not just the distinct ones) from the other columns. +

+ + + +SELECT x, y, z, + count() OVER (PARTITION BY x) AS how_many_x +FROM t1; + + +

+ +

+ You cannot directly combine the DISTINCT operator with analytic function calls. You can + put the analytic function call in a WITH clause or an inline view, and apply the + DISTINCT operator to its result set. +

+ +WITH t1 AS (SELECT x, sum(x) OVER (PARTITION BY x) AS total FROM t1) + SELECT DISTINCT x, total FROM t1; + + + + + + + + + Window Clause + + + +

+ Certain analytic functions accept an optional window clause, which makes the function analyze + only certain rows around the current row rather than all rows in the partition. For example, you can + get a moving average by specifying some number of preceding and following rows, or a running count or + running total by specifying all rows up to the current position. This clause can result in different + analytic results for rows within the same partition. +

+ +

+ The window clause is supported with the AVG(), COUNT(), + FIRST_VALUE(), LAST_VALUE(), and SUM() functions. + + For MAX() and MIN(), the window clause only allowed if the start bound is + UNBOUNDED PRECEDING +

+ +

+ +ROWS BETWEEN [ { m | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | n } FOLLOWING] ] +RANGE BETWEEN [ {m | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | n } FOLLOWING] ] + +

+ ROWS BETWEEN defines the size of the window in terms of the indexes of the rows in the + result set. The size of the window is predictable based on the clauses the position within the result set. +

+ +

+ RANGE BETWEEN does not currently support numeric arguments to define a variable-size + sliding window. + +

+ + + +

+ Currently, Impala supports only some combinations of arguments to the RANGE clause: +

+ +

+ RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (the default when ORDER + BY is specified and the window clause is omitted) +
+ RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING +
+ RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING +

+ +

+ When RANGE is used, CURRENT ROW includes not just the current row but all + rows that are tied with the current row based on the ORDER BY expressions. +

+ +

+ The following examples show financial data for a fictional stock symbol JDR. The closing + price moves up and down each day. +

+ +create table stock_ticker (stock_symbol string, closing_price decimal(8,2), closing_date timestamp); +...load some data... +select * from stock_ticker order by stock_symbol, closing_date ++--------------+---------------+---------------------+ +| stock_symbol | closing_price | closing_date | ++--------------+---------------+---------------------+ +| JDR | 12.86 | 2014-10-02 00:00:00 | +| JDR | 12.89 | 2014-10-03 00:00:00 | +| JDR | 12.94 | 2014-10-04 00:00:00 | +| JDR | 12.55 | 2014-10-05 00:00:00 | +| JDR | 14.03 | 2014-10-06 00:00:00 | +| JDR | 14.75 | 2014-10-07 00:00:00 | +| JDR | 13.98 | 2014-10-08 00:00:00 | ++--------------+---------------+---------------------+ + + +

+ The queries use analytic functions with window clauses to compute moving averages of the closing price. For + example, ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING produces an average of the value from a + 3-day span, producing a different value for each row. The first row, which has no preceding row, only gets + averaged with the row following it. If the table contained more than one stock symbol, the + PARTITION BY clause would limit the window for the moving average to only consider the + prices for a single stock. +

+ +select stock_symbol, closing_date, closing_price, + avg(closing_price) over (partition by stock_symbol order by closing_date + rows between 1 preceding and 1 following) as moving_average + from stock_ticker; ++--------------+---------------------+---------------+----------------+ +| stock_symbol | closing_date | closing_price | moving_average | ++--------------+---------------------+---------------+----------------+ +| JDR | 2014-10-02 00:00:00 | 12.86 | 12.87 | +| JDR | 2014-10-03 00:00:00 | 12.89 | 12.89 | +| JDR | 2014-10-04 00:00:00 | 12.94 | 12.79 | +| JDR | 2014-10-05 00:00:00 | 12.55 | 13.17 | +| JDR | 2014-10-06 00:00:00 | 14.03 | 13.77 | +| JDR | 2014-10-07 00:00:00 | 14.75 | 14.25 | +| JDR | 2014-10-08 00:00:00 | 13.98 | 14.36 | ++--------------+---------------------+---------------+----------------+ + + +

+ The clause ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW produces a cumulative moving + average, from the earliest data up to the value for each day. +

+ +select stock_symbol, closing_date, closing_price, + avg(closing_price) over (partition by stock_symbol order by closing_date + rows between unbounded preceding and current row) as moving_average + from stock_ticker; ++--------------+---------------------+---------------+----------------+ +| stock_symbol | closing_date | closing_price | moving_average | ++--------------+---------------------+---------------+----------------+ +| JDR | 2014-10-02 00:00:00 | 12.86 | 12.86 | +| JDR | 2014-10-03 00:00:00 | 12.89 | 12.87 | +| JDR | 2014-10-04 00:00:00 | 12.94 | 12.89 | +| JDR | 2014-10-05 00:00:00 | 12.55 | 12.81 | +| JDR | 2014-10-06 00:00:00 | 14.03 | 13.05 | +| JDR | 2014-10-07 00:00:00 | 14.75 | 13.33 | +| JDR | 2014-10-08 00:00:00 | 13.98 | 13.42 | ++--------------+---------------------+---------------+----------------+ + + + + + + + + + + + AVG Function - Analytic Context + + + +

+ You can include an OVER clause with a call to this function to use it as an analytic + function. See for details and examples. +

+ + + + + + + + COUNT Function - Analytic Context + + + +

+ You can include an OVER clause with a call to this function to use it as an analytic + function. See for details and examples. +

+ + + + + + + + CUME_DIST Function (<keyword keyref="impala23"/> or higher only) + + + +

+ Returns the cumulative distribution of a value. The value for each row in the result set is greater than 0 + and less than or equal to 1. +

+ +

+ +CUME_DIST (expr) + OVER ([partition_by_clause] order_by_clause) + + +

+ The ORDER BY clause is required. The PARTITION BY clause is optional. The + window clause is not allowed. +

+ +

+ Within each partition of the result set, the CUME_DIST() value represents an ascending + sequence that ends at 1. Each value represents the proportion of rows in the partition whose values are + less than or equal to the value in the current row. +

+ +

+ If the sequence of input values contains ties, the CUME_DIST() results are identical for the + tied values. +

+ +

+ Impala only supports the CUME_DIST() function in an analytic context, not as a regular + aggregate function. +

+ +

+ This example uses a table with 9 rows. The CUME_DIST() + function evaluates the entire table because there is no PARTITION BY clause, + with the rows ordered by the weight of the animal. + the sequence of values shows that 1/9 of the values are less than or equal to the lightest + animal (mouse), 2/9 of the values are less than or equal to the second-lightest animal, + and so on up to the heaviest animal (elephant), where 9/9 of the rows are less than or + equal to its weight. +

+ +create table animals (name string, kind string, kilos decimal(9,3)); +insert into animals values + ('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 'Mammal', 0.020), + ('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5), + ('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 'Mammal', 5); + +select name, cume_dist() over (order by kilos) from animals; ++------------+-----------------------+ +| name | cume_dist() OVER(...) | ++------------+-----------------------+ +| Elephant | 1 | +| Giraffe | 0.8888888888888888 | +| Polar bear | 0.7777777777777778 | +| Horse | 0.6666666666666666 | +| Ostrich | 0.5555555555555556 | +| Condor | 0.4444444444444444 | +| Housecat | 0.3333333333333333 | +| Owl | 0.2222222222222222 | +| Mouse | 0.1111111111111111 | ++------------+-----------------------+ + + +

+ Using a PARTITION BY clause produces a separate sequence for each partition + group, in this case one for mammals and one for birds. Because there are 3 birds and 6 mammals, + the sequence illustrates how 1/3 of the Bird rows have a kilos value that is less than or equal to + the lightest bird, 1/6 of the Mammal rows have a kilos value that is less than or equal to + the lightest mammal, and so on until both the heaviest bird and heaviest mammal have a CUME_DIST() + value of 1. +

+ +select name, kind, cume_dist() over (partition by kind order by kilos) from animals ++------------+--------+-----------------------+ +| name | kind | cume_dist() OVER(...) | ++------------+--------+-----------------------+ +| Ostrich | Bird | 1 | +| Condor | Bird | 0.6666666666666666 | +| Owl | Bird | 0.3333333333333333 | +| Elephant | Mammal | 1 | +| Giraffe | Mammal | 0.8333333333333334 | +| Polar bear | Mammal | 0.6666666666666666 | +| Horse | Mammal | 0.5 | +| Housecat | Mammal | 0.3333333333333333 | +| Mouse | Mammal | 0.1666666666666667 | ++------------+--------+-----------------------+ + + +

+ We can reverse the ordering within each partition group by using an ORDER BY ... DESC + clause within the OVER() clause. Now the lightest (smallest value of kilos) + animal of each kind has a CUME_DIST() value of 1. +

+ +select name, kind, cume_dist() over (partition by kind order by kilos desc) from animals ++------------+--------+-----------------------+ +| name | kind | cume_dist() OVER(...) | ++------------+--------+-----------------------+ +| Owl | Bird | 1 | +| Condor | Bird | 0.6666666666666666 | +| Ostrich | Bird | 0.3333333333333333 | +| Mouse | Mammal | 1 | +| Housecat | Mammal | 0.8333333333333334 | +| Horse | Mammal | 0.6666666666666666 | +| Polar bear | Mammal | 0.5 | +| Giraffe | Mammal | 0.3333333333333333 | +| Elephant | Mammal | 0.1666666666666667 | ++------------+--------+-----------------------+ + + +

+ The following example manufactures some rows with identical values in the kilos column, + to demonstrate how the results look in case of tie values. For simplicity, it only shows the CUME_DIST() + sequence for the Bird rows. Now with 3 rows all with a value of 15, all of those rows have the same + CUME_DIST() value. 4/5 of the rows have a value for kilos that is less than or + equal to 15. +

+ +insert into animals values ('California Condor', 'Bird', 15), ('Andean Condor', 'Bird', 15) + +select name, kind, cume_dist() over (order by kilos) from animals where kind = 'Bird'; ++-------------------+------+-----------------------+ +| name | kind | cume_dist() OVER(...) | ++-------------------+------+-----------------------+ +| Ostrich | Bird | 1 | +| Condor | Bird | 0.8 | +| California Condor | Bird | 0.8 | +| Andean Condor | Bird | 0.8 | +| Owl | Bird | 0.2 | ++-------------------+------+-----------------------+ + + +

+ The following example shows how to use an ORDER BY clause in the outer block + to order the result set in case of ties. Here, all the Bird rows are together, then in descending order + by the result of the CUME_DIST() function, and all tied CUME_DIST() + values are ordered by the animal name. +

+ +select name, kind, cume_dist() over (partition by kind order by kilos) as ordering + from animals +where + kind = 'Bird' +order by kind, ordering desc, name; ++-------------------+------+----------+ +| name | kind | ordering | ++-------------------+------+----------+ +| Ostrich | Bird | 1 | +| Andean Condor | Bird | 0.8 | +| California Condor | Bird | 0.8 | +| Condor | Bird | 0.8 | +| Owl | Bird | 0.2 | ++-------------------+------+----------+ + + + + + + + + + DENSE_RANK Function + + + +

+ Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers + for duplicate values of the ORDER BY expressions. After generating duplicate output values + for the tied input values, the function continues the sequence with the next higher integer. + Therefore, the sequence contains duplicates but no gaps when the input contains duplicates. Starts the + sequence over for each group produced by the PARTITIONED BY clause. +

+ +

+ +DENSE_RANK() OVER([partition_by_clause] order_by_clause) + +

+ The PARTITION BY clause is optional. The ORDER BY clause is required. The + window clause is not allowed. +

+ + + +

+ +

+ Often used for top-N and bottom-N queries. For example, it could produce a top 10 report including + all the items with the 10 highest values, even if several items tied for 1st place. +

+ +

+ Similar to ROW_NUMBER and RANK. These functions differ in how they treat + duplicate combinations of values. +

+ +

+ The following example demonstrates how the DENSE_RANK() function identifies where each + value places in the result set, producing the same result for duplicate values, but with a strict + sequence from 1 to the number of groups. For example, when results are ordered by the X + column, both 1 values are tied for first; both 2 values are tied for + second; and so on. +

+ +select x, dense_rank() over(order by x) as rank, property from int_t; ++----+------+----------+ +| x | rank | property | ++----+------+----------+ +| 1 | 1 | square | +| 1 | 1 | odd | +| 2 | 2 | even | +| 2 | 2 | prime | +| 3 | 3 | prime | +| 3 | 3 | odd | +| 4 | 4 | even | +| 4 | 4 | square | +| 5 | 5 | odd | +| 5 | 5 | prime | +| 6 | 6 | even | +| 6 | 6 | perfect | +| 7 | 7 | lucky | +| 7 | 7 | lucky | +| 7 | 7 | lucky | +| 7 | 7 | odd | +| 7 | 7 | prime | +| 8 | 8 | even | +| 9 | 9 | square | +| 9 | 9 | odd | +| 10 | 10 | round | +| 10 | 10 | even | ++----+------+----------+ + + +

+ The following examples show how the DENSE_RANK() function is affected by the + PARTITION property within the ORDER BY clause. +

+ +

+ Partitioning by the PROPERTY column groups all the even, odd, and so on values together, + and DENSE_RANK() returns the place of each value within the group, producing several + ascending sequences. +

+ +select x, dense_rank() over(partition by property order by x) as rank, property from int_t; ++----+------+----------+ +| x | rank | property | ++----+------+----------+ +| 2 | 1 | even | +| 4 | 2 | even | +| 6 | 3 | even | +| 8 | 4 | even | +| 10 | 5 | even | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 1 | 1 | odd | +| 3 | 2 | odd | +| 5 | 3 | odd | +| 7 | 4 | odd | +| 9 | 5 | odd | +| 6 | 1 | perfect | +| 2 | 1 | prime | +| 3 | 2 | prime | +| 5 | 3 | prime | +| 7 | 4 | prime | +| 10 | 1 | round | +| 1 | 1 | square | +| 4 | 2 | square | +| 9 | 3 | square | ++----+------+----------+ + + +

+ Partitioning by the X column groups all the duplicate numbers together and returns the + place each each value within the group; because each value occurs only 1 or 2 times, + DENSE_RANK() designates each X value as either first or second within its + group. +

+ +select x, dense_rank() over(partition by x order by property) as rank, property from int_t; ++----+------+----------+ +| x | rank | property | ++----+------+----------+ +| 1 | 1 | odd | +| 1 | 2 | square | +| 2 | 1 | even | +| 2 | 2 | prime | +| 3 | 1 | odd | +| 3 | 2 | prime | +| 4 | 1 | even | +| 4 | 2 | square | +| 5 | 1 | odd | +| 5 | 2 | prime | +| 6 | 1 | even | +| 6 | 2 | perfect | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 7 | 2 | odd | +| 7 | 3 | prime | +| 8 | 1 | even | +| 9 | 1 | odd | +| 9 | 2 | square | +| 10 | 1 | even | +| 10 | 2 | round | ++----+------+----------+ + + +

+ The following example shows how DENSE_RANK() produces a continuous sequence while still + allowing for ties. In this case, Croesus and Midas both have the second largest fortune, while Crassus has + the third largest. (In , you see a similar query with the + RANK() function that shows that while Crassus has the third largest fortune, he is the + fourth richest person.) +

+ +select dense_rank() over (order by net_worth desc) as placement, name, net_worth from wealth order by placement, name; ++-----------+---------+---------------+ +| placement | name | net_worth | ++-----------+---------+---------------+ +| 1 | Solomon | 2000000000.00 | +| 2 | Croesus | 1000000000.00 | +| 2 | Midas | 1000000000.00 | +| 3 | Crassus | 500000000.00 | +| 4 | Scrooge | 80000000.00 | ++-----------+---------+---------------+ + + +

+ +

+ , +

+ + + + + + + + FIRST_VALUE Function + + + +

+ Returns the expression value from the first row in the window. The return value is NULL if + the input expression is NULL. +

+ +

+ +FIRST_VALUE(expr) OVER([partition_by_clause] order_by_clause [window_clause]) + +

+ The PARTITION BY clause is optional. The ORDER BY clause is required. The + window clause is optional. +

+ +

+ If any duplicate values occur in the tuples evaluated by the ORDER BY clause, the result + of this function is not deterministic. Consider adding additional ORDER BY columns to + ensure consistent ordering. +

+ +

+ The following example shows a table with a wide variety of country-appropriate greetings. For consistency, + we want to standardize on a single greeting for each country. The FIRST_VALUE() function + helps to produce a mail merge report where every person from the same country is addressed with the same + greeting. +

+ +select name, country, greeting from mail_merge ++---------+---------+--------------+ +| name | country | greeting | ++---------+---------+--------------+ +| Pete | USA | Hello | +| John | USA | Hi | +| Boris | Germany | Guten tag | +| Michael | Germany | Guten morgen | +| Bjorn | Sweden | Hej | +| Mats | Sweden | Tja | ++---------+---------+--------------+ + +select country, name, + first_value(greeting) + over (partition by country order by name, greeting) as greeting + from mail_merge; ++---------+---------+-----------+ +| country | name | greeting | ++---------+---------+-----------+ +| Germany | Boris | Guten tag | +| Germany | Michael | Guten tag | +| Sweden | Bjorn | Hej | +| Sweden | Mats | Hej | +| USA | John | Hi | +| USA | Pete | Hi | ++---------+---------+-----------+ + + +

+ Changing the order in which the names are evaluated changes which greeting is applied to each group. +

+ +

+ + + + + + + + LAG Function + + + +

+ This function returns the value of an expression using column values from a preceding row. You specify an + integer offset, which designates a row position some number of rows previous to the current row. Any column + references in the expression argument refer to column values from that prior row. Typically, the table + contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows. +

+ +

+ +LAG (expr [, offset] [, default]) + OVER ([partition_by_clause] order_by_clause) + +

+ The ORDER BY clause is required. The PARTITION BY clause is optional. The + window clause is not allowed. +

+ +

+ Sometimes used an an alternative to doing a self-join. +

+ +

+ The following example uses the same stock data created in . For each day, the + query prints the closing price alongside the previous day's closing price. The first row for each stock + symbol has no previous row, so that LAG() value is NULL. +

+ +select stock_symbol, closing_date, closing_price, + lag(closing_price,1) over (partition by stock_symbol order by closing_date) as "yesterday closing" + from stock_ticker + order by closing_date; ++--------------+---------------------+---------------+-------------------+ +| stock_symbol | closing_date | closing_price | yesterday closing | ++--------------+---------------------+---------------+-------------------+ +| JDR | 2014-09-13 00:00:00 | 12.86 | NULL | +| JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 | +| JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 | +| JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 | +| JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 | +| JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 | +| JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 | ++--------------+---------------------+---------------+-------------------+ + + +

+ The following example does an arithmetic operation between the current row and a value from the previous + row, to produce a delta value for each day. This example also demonstrates how ORDER BY + works independently in the different parts of the query. The ORDER BY closing_date in the + OVER clause makes the query analyze the rows in chronological order. Then the outer query + block uses ORDER BY closing_date DESC to present the results with the most recent date + first. +

+ +select stock_symbol, closing_date, closing_price, + cast( + closing_price - lag(closing_price,1) over + (partition by stock_symbol order by closing_date) + as decimal(8,2) + ) + as "change from yesterday" + from stock_ticker + order by closing_date desc; ++--------------+---------------------+---------------+-----------------------+ +| stock_symbol | closing_date | closing_price | change from yesterday | ++--------------+---------------------+---------------+-----------------------+ +| JDR | 2014-09-19 00:00:00 | 13.98 | -0.76 | +| JDR | 2014-09-18 00:00:00 | 14.75 | 0.72 | +| JDR | 2014-09-17 00:00:00 | 14.03 | 1.47 | +| JDR | 2014-09-16 00:00:00 | 12.55 | -0.38 | +| JDR | 2014-09-15 00:00:00 | 12.94 | 0.04 | +| JDR | 2014-09-14 00:00:00 | 12.89 | 0.03 | +| JDR | 2014-09-13 00:00:00 | 12.86 | NULL | ++--------------+---------------------+---------------+-----------------------+ + + +

+ +

+ This function is the converse of . +

+ + + + + + + + LAST_VALUE Function + + + +

+ Returns the expression value from the last row in the window. This same value is repeated for all result + rows for the group. The return value is NULL if the input expression is + NULL. +

+ +

+ +LAST_VALUE(expr) OVER([partition_by_clause] order_by_clause [window_clause]) + +

+ The PARTITION BY clause is optional. The ORDER BY clause is required. The + window clause is optional. +

+ +

+ The following example uses the same MAIL_MERGE table as in the example for + . Because the default window when ORDER + BY is used is BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, the query requires the + UNBOUNDED FOLLOWING to look ahead to subsequent rows and find the last value for each + country. +

+ +

+ + + + + + + + LEAD Function + + + +

+ This function returns the value of an expression using column values from a following row. You specify an + integer offset, which designates a row position some number of rows after to the current row. Any column + references in the expression argument refer to column values from that later row. Typically, the table + contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows. +

+ +

+ +LEAD (expr [, offset] [, default]) + OVER ([partition_by_clause] order_by_clause) + +

+ The ORDER BY clause is required. The PARTITION BY clause is optional. The + window clause is not allowed. +

+ +

+ Sometimes used an an alternative to doing a self-join. +

+ +

+ The following example uses the same stock data created in . The query analyzes + the closing price for a stock symbol, and for each day evaluates if the closing price for the following day + is higher or lower. +

+ +select stock_symbol, closing_date, closing_price, + case + (lead(closing_price,1) + over (partition by stock_symbol order by closing_date) + - closing_price) > 0 + when true then "higher" + when false then "flat or lower" + end as "trending" +from stock_ticker + order by closing_date; ++--------------+---------------------+---------------+---------------+ +| stock_symbol | closing_date | closing_price | trending | ++--------------+---------------------+---------------+---------------+ +| JDR | 2014-09-13 00:00:00 | 12.86 | higher | +| JDR | 2014-09-14 00:00:00 | 12.89 | higher | +| JDR | 2014-09-15 00:00:00 | 12.94 | flat or lower | +| JDR | 2014-09-16 00:00:00 | 12.55 | higher | +| JDR | 2014-09-17 00:00:00 | 14.03 | higher | +| JDR | 2014-09-18 00:00:00 | 14.75 | flat or lower | +| JDR | 2014-09-19 00:00:00 | 13.98 | NULL | ++--------------+---------------------+---------------+---------------+ + + +

+ +

+ This function is the converse of . +

+ + + + + + + + MAX Function - Analytic Context + + + +

+ You can include an OVER clause with a call to this function to use it as an analytic + function. See for details and examples. +

+ + + + + + + + MIN Function - Analytic Context + + + +

+ You can include an OVER clause with a call to this function to use it as an analytic + function. See for details and examples. +

+ + + + + + + + NTH_VALUE Function + + + +

+ +

+ + + + + + + + NTILE Function (<keyword keyref="impala23"/> or higher only) + + + +

+ Returns the bucket number associated with each row, between 1 and the value of an expression. For + example, creating 100 buckets puts the lowest 1% of values in the first bucket, while creating 10 buckets + puts the lowest 10% of values in the first bucket. Each partition can have a different number of buckets. + +

+ +

+ +NTILE (expr [, offset ...] + OVER ([partition_by_clause] order_by_clause) + +

+ The ORDER BY clause is required. The PARTITION BY clause is optional. The + window clause is not allowed. +

+ +

+ The ntile name is derived from the practice of dividing result sets into fourths (quartile), tenths + (decile), and so on. The NTILE() function divides the result set based on an arbitrary + percentile value. +

+ +

+ The number of buckets must be a positive integer. +

+ +

+ The number of items in each bucket is identical or almost so, varying by at most 1. If the number of items + does not divide evenly between the buckets, the remaining N items are divided evenly among the first N + buckets. +

+ +

+ If the number of buckets N is greater than the number of input rows in the partition, then the first N + buckets each contain one item, and the remaining buckets are empty. +

+ +

+ The following example shows divides groups of animals into 4 buckets based on their weight. The + ORDER BY ... DESC clause in the OVER() clause means that the heaviest 25% + are in the first group, and the lightest 25% are in the fourth group. (The ORDER BY in the + outermost part of the query shows how you can order the final result set independently from the order in + which the rows are evaluated by the OVER() clause.) Because there are 9 rows in the group, + divided into 4 buckets, the first bucket receives the extra item. +

+ +create table animals (name string, kind string, kilos decimal(9,3)); + +insert into animals values + ('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 'Mammal', 0.020), + ('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5), + ('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 'Mammal', 5); + +select name, ntile(4) over (order by kilos desc) as quarter + from animals +order by quarter desc; ++------------+---------+ +| name | quarter | ++------------+---------+ +| Owl | 4 | +| Mouse | 4 | +| Condor | 3 | +| Housecat | 3 | +| Horse | 2 | +| Ostrich | 2 | +| Elephant | 1 | +| Giraffe | 1 | +| Polar bear | 1 | ++------------+---------+ + + +

+ The following examples show how the PARTITION clause works for the + NTILE() function. Here, we divide each kind of animal (mammal or bird) into 2 buckets, + the heavier half and the lighter half. +

+ +select name, kind, ntile(2) over (partition by kind order by kilos desc) as half + from animals +order by kind; ++------------+--------+------+ +| name | kind | half | ++------------+--------+------+ +| Ostrich | Bird | 1 | +| Condor | Bird | 1 | +| Owl | Bird | 2 | +| Elephant | Mammal | 1 | +| Giraffe | Mammal | 1 | +| Polar bear | Mammal | 1 | +| Horse | Mammal | 2 | +| Housecat | Mammal | 2 | +| Mouse | Mammal | 2 | ++------------+--------+------+ + + +

+ Again, the result set can be ordered independently + from the analytic evaluation. This next example lists all the animals heaviest to lightest, + showing that elephant and giraffe are in the top half of mammals by weight, while + housecat and mouse are in the bottom half. +

+ +select name, kind, ntile(2) over (partition by kind order by kilos desc) as half + from animals +order by kilos desc; ++------------+--------+------+ +| name | kind | half | ++------------+--------+------+ +| Elephant | Mammal | 1 | +| Giraffe | Mammal | 1 | +| Polar bear | Mammal | 1 | +| Horse | Mammal | 2 | +| Ostrich | Bird | 1 | +| Condor | Bird | 1 | +| Housecat | Mammal | 2 | +| Owl | Bird | 2 | +| Mouse | Mammal | 2 | ++------------+--------+------+ + + + + + + + + + PERCENT_RANK Function (<keyword keyref="impala23"/> or higher only) + + + +

+ +PERCENT_RANK (expr) + OVER ([partition_by_clause] order_by_clause) + + +

+ Calculates the rank, expressed as a percentage, of each row within a group of rows. + If rank is the value for that same row from the RANK() function (from 1 to the total number of rows in the partition group), + then the PERCENT_RANK() value is calculated as (rank - 1) / (rows_in_group - 1) . + If there is only a single item in the partition group, its PERCENT_RANK() value is 0. +

+ +

+ The ORDER BY clause is required. The PARTITION BY clause is optional. The + window clause is not allowed. +

+ +

+ This function is similar to the RANK and CUME_DIST() functions: it returns an ascending sequence representing the position of each + row within the rows of the same partition group. The actual numeric sequence is calculated differently, + and the handling of duplicate (tied) values is different. +

+ +

+ The return values range from 0 to 1 inclusive. + The first row in each partition group always has the value 0. + A NULL value is considered the lowest possible value. + In the case of duplicate input values, all the corresponding rows in the result set + have an identical value: the lowest PERCENT_RANK() value of those + tied rows. (In contrast to CUME_DIST(), where all tied rows have + the highest CUME_DIST() value.) +

+ +

+ The following example uses the same ANIMALS table as the examples for CUME_DIST() + and NTILE(), with a few additional rows to illustrate the results where some values are + NULL or there is only a single row in a partition group. +

+ +insert into animals values ('Komodo dragon', 'Reptile', 70); +insert into animals values ('Unicorn', 'Mythical', NULL); +insert into animals values ('Fire-breathing dragon', 'Mythical', NULL); + + +

+ As with CUME_DIST(), there is an ascending sequence for each kind of animal. + For example, the Birds and Mammals rows each have a PERCENT_RANK() sequence + that ranges from 0 to 1. + The Reptile row has a PERCENT_RANK() of 0 because that partition group contains only a single item. + Both Mythical animals have a PERCENT_RANK() of 0 because + a NULL is considered the lowest value within its partition group. +

+ +select name, kind, percent_rank() over (partition by kind order by kilos) from animals; ++-----------------------+----------+--------------------------+ +| name | kind | percent_rank() OVER(...) | ++-----------------------+----------+--------------------------+ +| Mouse | Mammal | 0 | +| Housecat | Mammal | 0.2 | +| Horse | Mammal | 0.4 | +| Polar bear | Mammal | 0.6 | +| Giraffe | Mammal | 0.8 | +| Elephant | Mammal | 1 | +| Komodo dragon | Reptile | 0 | +| Owl | Bird | 0 | +| California Condor | Bird | 0.25 | +| Andean Condor | Bird | 0.25 | +| Condor | Bird | 0.25 | +| Ostrich | Bird | 1 | +| Fire-breathing dragon | Mythical | 0 | +| Unicorn | Mythical | 0 | ++-----------------------+----------+--------------------------+ + + + + + + + + RANK Function + + + +

+ Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers + for duplicate values of the ORDER BY expressions. After generating duplicate output values + for the tied input values, the function increments the sequence by the number of tied values. + Therefore, the sequence contains both duplicates and gaps when the input contains duplicates. Starts the + sequence over for each group produced by the PARTITIONED BY clause. +

+ +

+ +RANK() OVER([partition_by_clause] order_by_clause) + +

+ The PARTITION BY clause is optional. The ORDER BY clause is required. The + window clause is not allowed. +

+ +

+ + + +

+ Often used for top-N and bottom-N queries. For example, it could produce a top 10 report including + several items that were tied for 10th place. +

+ +

+ Similar to ROW_NUMBER and DENSE_RANK. These functions differ in how they + treat duplicate combinations of values. +

+ +

+ The following example demonstrates how the RANK() function identifies where each value + places in the result set, producing the same result for duplicate values, and skipping values in the + sequence to account for the number of duplicates. For example, when results are ordered by the + X column, both 1 values are tied for first; both 2 + values are tied for third; and so on. +

+ +select x, rank() over(order by x) as rank, property from int_t; ++----+------+----------+ +| x | rank | property | ++----+------+----------+ +| 1 | 1 | square | +| 1 | 1 | odd | +| 2 | 3 | even | +| 2 | 3 | prime | +| 3 | 5 | prime | +| 3 | 5 | odd | +| 4 | 7 | even | +| 4 | 7 | square | +| 5 | 9 | odd | +| 5 | 9 | prime | +| 6 | 11 | even | +| 6 | 11 | perfect | +| 7 | 13 | lucky | +| 7 | 13 | lucky | +| 7 | 13 | lucky | +| 7 | 13 | odd | +| 7 | 13 | prime | +| 8 | 18 | even | +| 9 | 19 | square | +| 9 | 19 | odd | +| 10 | 21 | round | +| 10 | 21 | even | ++----+------+----------+ + + +

+ The following examples show how the RANK() function is affected by the + PARTITION property within the ORDER BY clause. +

+ +

+ Partitioning by the PROPERTY column groups all the even, odd, and so on values together, + and RANK() returns the place of each value within the group, producing several ascending + sequences. +

+ +select x, rank() over(partition by property order by x) as rank, property from int_t; ++----+------+----------+ +| x | rank | property | ++----+------+----------+ +| 2 | 1 | even | +| 4 | 2 | even | +| 6 | 3 | even | +| 8 | 4 | even | +| 10 | 5 | even | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 1 | 1 | odd | +| 3 | 2 | odd | +| 5 | 3 | odd | +| 7 | 4 | odd | +| 9 | 5 | odd | +| 6 | 1 | perfect | +| 2 | 1 | prime | +| 3 | 2 | prime | +| 5 | 3 | prime | +| 7 | 4 | prime | +| 10 | 1 | round | +| 1 | 1 | square | +| 4 | 2 | square | +| 9 | 3 | square | ++----+------+----------+ + + +

+ Partitioning by the X column groups all the duplicate numbers together and returns the + place each each value within the group; because each value occurs only 1 or 2 times, + RANK() designates each X value as either first or second within its + group. +

+ +select x, rank() over(partition by x order by property) as rank, property from int_t; ++----+------+----------+ +| x | rank | property | ++----+------+----------+ +| 1 | 1 | odd | +| 1 | 2 | square | +| 2 | 1 | even | +| 2 | 2 | prime | +| 3 | 1 | odd | +| 3 | 2 | prime | +| 4 | 1 | even | +| 4 | 2 | square | +| 5 | 1 | odd | +| 5 | 2 | prime | +| 6 | 1 | even | +| 6 | 2 | perfect | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 7 | 1 | lucky | +| 7 | 4 | odd | +| 7 | 5 | prime | +| 8 | 1 | even | +| 9 | 1 | odd | +| 9 | 2 | square | +| 10 | 1 | even | +| 10 | 2 | round | ++----+------+----------+ + + +

+ The following example shows how a magazine might prepare a list of history's wealthiest people. Croesus and + Midas are tied for second, then Crassus is fourth. +

+ +select rank() over (order by net_worth desc) as rank, name, net_worth from wealth order by rank, name; ++------+---------+---------------+ +| rank | name | net_worth | ++------+---------+---------------+ +| 1 | Solomon | 2000000000.00 | +| 2 | Croesus | 1000000000.00 | +| 2 | Midas | 1000000000.00 | +| 4 | Crassus | 500000000.00 | +| 5 | Scrooge | 80000000.00 | ++------+---------+---------------+ + + +

+ +

+ , + +

+ + + + + + + + ROW_NUMBER Function + + + +

+ Returns an ascending sequence of integers, starting with 1. Starts the sequence over for each group + produced by the PARTITIONED BY clause. The output sequence includes different values for + duplicate input values. Therefore, the sequence never contains any duplicates or gaps, regardless of + duplicate input values. +

+ +

+ +ROW_NUMBER() OVER([partition_by_clause] order_by_clause) + +

+ The ORDER BY clause is required. The PARTITION BY clause is optional. The + window clause is not allowed. +

+ +

+ Often used for top-N and bottom-N queries where the input values are known to be unique, or precisely N + rows are needed regardless of duplicate values. +

+ +

+ Because its result value is different for each row in the result set (when used without a PARTITION + BY clause), ROW_NUMBER() can be used to synthesize unique numeric ID values, for + example for result sets involving unique values or tuples. +

+ +

+ Similar to RANK and DENSE_RANK. These functions differ in how they treat + duplicate combinations of values. +

+ +

+ The following example demonstrates how ROW_NUMBER() produces a continuous numeric + sequence, even though some values of X are repeated. +

+ +select x, row_number() over(order by x, property) as row_number, property from int_t; ++----+------------+----------+ +| x | row_number | property | ++----+------------+----------+ +| 1 | 1 | odd | +| 1 | 2 | square | +| 2 | 3 | even | +| 2 | 4 | prime | +| 3 | 5 | odd | +| 3 | 6 | prime | +| 4 | 7 | even | +| 4 | 8 | square | +| 5 | 9 | odd | +| 5 | 10 | prime | +| 6 | 11 | even | +| 6 | 12 | perfect | +| 7 | 13 | lucky | +| 7 | 14 | lucky | +| 7 | 15 | lucky | +| 7 | 16 | odd | +| 7 | 17 | prime | +| 8 | 18 | even | +| 9 | 19 | odd | +| 9 | 20 | square | +| 10 | 21 | even | +| 10 | 22 | round | ++----+------------+----------+ + + +

+ The following example shows how a financial institution might assign customer IDs to some of history's + wealthiest figures. Although two of the people have identical net worth figures, unique IDs are required + for this purpose. ROW_NUMBER() produces a sequence of five different values for the five + input rows. +

+ +select row_number() over (order by net_worth desc) as account_id, name, net_worth + from wealth order by account_id, name; ++------------+---------+---------------+ +| account_id | name | net_worth | ++------------+---------+---------------+ +| 1 | Solomon | 2000000000.00 | +| 2 | Croesus | 1000000000.00 | +| 3 | Midas | 1000000000.00 | +| 4 | Crassus | 500000000.00 | +| 5 | Scrooge | 80000000.00 | ++------------+---------+---------------+ + + +

+ +

+ , +

+ + + + + + + + SUM Function - Analytic Context + + + +

+ You can include an OVER clause with a call to this function to use it as an analytic + function. See for details and examples. +

+ + + + + + http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_appx_count_distinct.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_appx_count_distinct.xml b/docs/topics/impala_appx_count_distinct.xml new file mode 100644 index 0000000..04078db --- /dev/null +++ b/docs/topics/impala_appx_count_distinct.xml @@ -0,0 +1,81 @@ + + + + + APPX_COUNT_DISTINCT Query Option (<keyword keyref="impala20"/> or higher only) + APPX_COUNT_DISTINCT + + + + + + + + + + + + +

+ APPX_COUNT_DISTINCT query option + Allows multiple COUNT(DISTINCT) operations within a single query, by internally rewriting + each COUNT(DISTINCT) to use the NDV() function. The resulting count is + approximate rather than precise. +

+ +

+ The following examples show how the APPX_COUNT_DISTINCT lets you work around the restriction + where a query can only evaluate COUNT(DISTINCT col_name) for a single + column. By default, you can count the distinct values of one column or another, but not both in a single + query: +

+ +[localhost:21000] > select count(distinct x) from int_t; ++-------------------+ +| count(distinct x) | ++-------------------+ +| 10 | ++-------------------+ +[localhost:21000] > select count(distinct property) from int_t; ++--------------------------+ +| count(distinct property) | ++--------------------------+ +| 7 | ++--------------------------+ +[localhost:21000] > select count(distinct x), count(distinct property) from int_t; +ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters +as count(DISTINCT x); deviating function: count(DISTINCT property) + + +

+ When you enable the APPX_COUNT_DISTINCT query option, now the query with multiple + COUNT(DISTINCT) works. The reason this behavior requires a query option is that each + COUNT(DISTINCT) is rewritten internally to use the NDV() function instead, + which provides an approximate result rather than a precise count. +

+ +[localhost:21000] > set APPX_COUNT_DISTINCT=true; +[localhost:21000] > select count(distinct x), count(distinct property) from int_t; ++-------------------+--------------------------+ +| count(distinct x) | count(distinct property) | ++-------------------+--------------------------+ +| 10 | 7 | ++-------------------+--------------------------+ + + +

+ +

+ , + , + +

+ + + http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_appx_median.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_appx_median.xml b/docs/topics/impala_appx_median.xml new file mode 100644 index 0000000..b97f6e3 --- /dev/null +++ b/docs/topics/impala_appx_median.xml @@ -0,0 +1,124 @@ + + + + + APPX_MEDIAN Function + APPX_MEDIAN + + + + + + + + + + + + + + +

+ appx_median() function + An aggregate function that returns a value that is approximately the median (midpoint) of values in the set + of input values. +

+ +

+ +APPX_MEDIAN([DISTINCT | ALL] expression) + + +

+ This function works with any input type, because the only requirement is that the type supports less-than and + greater-than comparison operators. +

+ +

+ Because the return value represents the estimated midpoint, it might not reflect the precise midpoint value, + especially if the cardinality of the input values is very high. If the cardinality is low (up to + approximately 20,000), the result is more accurate because the sampling considers all or almost all of the + different values. +

+ +

+ The return value is always the same as one of the input values, not an in-between value produced by + averaging. +

+ + + +

+ +

+ The following example uses a table of a million random floating-point numbers ranging up to approximately + 50,000. The average is approximately 25,000. Because of the random distribution, we would expect the median + to be close to this same number. Computing the precise median is a more intensive operation than computing + the average, because it requires keeping track of every distinct value and how many times each occurs. The + APPX_MEDIAN() function uses a sampling algorithm to return an approximate result, which in + this case is close to the expected value. To make sure that the value is not substantially out of range due + to a skewed distribution, subsequent queries confirm that there are approximately 500,000 values higher than + the APPX_MEDIAN() value, and approximately 500,000 values lower than the + APPX_MEDIAN() value. +

+ +[localhost:21000] > select min(x), max(x), avg(x) from million_numbers; ++-------------------+-------------------+-------------------+ +| min(x) | max(x) | avg(x) | ++-------------------+-------------------+-------------------+ +| 4.725693727250069 | 49994.56852674231 | 24945.38563793553 | ++-------------------+-------------------+-------------------+ +[localhost:21000] > select appx_median(x) from million_numbers; ++----------------+ +| appx_median(x) | ++----------------+ +| 24721.6 | ++----------------+ +[localhost:21000] > select count(x) as higher from million_numbers where x > (select appx_median(x) from million_numbers); ++--------+ +| higher | ++--------+ +| 502013 | ++--------+ +[localhost:21000] > select count(x) as lower from million_numbers where x < (select appx_median(x) from million_numbers); ++--------+ +| lower | ++--------+ +| 497987 | ++--------+ + + +

+ The following example computes the approximate median using a subset of the values from the table, and then + confirms that the result is a reasonable estimate for the midpoint. +

+ +[localhost:21000] > select appx_median(x) from million_numbers where x between 1000 and 5000; ++-------------------+ +| appx_median(x) | ++-------------------+ +| 3013.107787358159 | ++-------------------+ +[localhost:21000] > select count(x) as higher from million_numbers where x between 1000 and 5000 and x > 3013.107787358159; ++--------+ +| higher | ++--------+ +| 37692 | ++--------+ +[localhost:21000] > select count(x) as lower from million_numbers where x between 1000 and 5000 and x < 3013.107787358159; ++-------+ +| lower | ++-------+ +| 37089 | ++-------+ + + +