From commits-return-9563-archive-asf-public=cust-asf.ponee.io@carbondata.apache.org Sat Mar 3 13:43:56 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 17D9F1807B5 for ; Sat, 3 Mar 2018 13:43:50 +0100 (CET) Received: (qmail 21467 invoked by uid 500); 3 Mar 2018 12:43:50 -0000 Mailing-List: contact commits-help@carbondata.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@carbondata.apache.org Delivered-To: mailing list commits@carbondata.apache.org Received: (qmail 21341 invoked by uid 99); 3 Mar 2018 12:43:50 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Mar 2018 12:43:50 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 18E2BF4E9F; Sat, 3 Mar 2018 12:43:48 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: ravipesala@apache.org To: commits@carbondata.apache.org Date: Sat, 03 Mar 2018 12:44:11 -0000 Message-Id: <15375547dd1f43589d902d931bc41614@git.apache.org> In-Reply-To: References: X-Mailer: ASF-Git Admin Mailer Subject: [24/25] carbondata git commit: [CARBONDATA-2098] Optimize pre-aggregate documentation [CARBONDATA-2098] Optimize pre-aggregate documentation optimize pre-aggregate documentation move to separate file add more examples This closes #2022 Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/ec893412 Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/ec893412 Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/ec893412 Branch: refs/heads/branch-1.3 Commit: ec893412bc40b72a642a0173f5f5b3fdcbba0877 Parents: a816e0c Author: sraghunandan Authored: Fri Mar 2 17:02:39 2018 +0530 Committer: ravipesala Committed: Sat Mar 3 18:11:51 2018 +0530 ---------------------------------------------------------------------- docs/data-management-on-carbondata.md | 242 -------------- docs/preaggregate-guide.md | 313 +++++++++++++++++++ .../examples/PreAggregateTableExample.scala | 50 ++- 3 files changed, 362 insertions(+), 243 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/carbondata/blob/ec893412/docs/data-management-on-carbondata.md ---------------------------------------------------------------------- diff --git a/docs/data-management-on-carbondata.md b/docs/data-management-on-carbondata.md index ea80d41..2aa4a49 100644 --- a/docs/data-management-on-carbondata.md +++ b/docs/data-management-on-carbondata.md @@ -26,7 +26,6 @@ This tutorial is going to introduce all commands and data operations on CarbonDa * [UPDATE AND DELETE](#update-and-delete) * [COMPACTION](#compaction) * [PARTITION](#partition) -* [PRE-AGGREGATE TABLES](#pre-aggregate-tables) * [BUCKETING](#bucketing) * [SEGMENT MANAGEMENT](#segment-management) @@ -859,247 +858,6 @@ This tutorial is going to introduce all commands and data operations on CarbonDa * The partitioned column can be excluded from SORT_COLUMNS, this will let other columns to do the efficient sorting. * When writing SQL on a partition table, try to use filters on the partition column. - -## PRE-AGGREGATE TABLES - CarbonData supports pre aggregating of data so that OLAP kind of queries can fetch data - much faster. Aggregate tables are created as datamaps so that the handling is as efficient as - other indexing support. Users can create as many aggregate tables they require as datamaps to - improve their query performance, provided the storage requirements and loading speeds are - acceptable. - - For main table called **sales** which is defined as - - ``` - CREATE TABLE sales ( - order_time timestamp, - user_id STRING, - sex STRING, - country STRING, - quantity INT, - price BIGINT) - STORED BY 'carbondata' - ``` - - user can create pre-aggregate tables using the DDL - - ``` - CREATE DATAMAP agg_sales - ON TABLE sales - USING "preaggregate" - AS - SELECT country, sex, sum(quantity), avg(price) - FROM sales - GROUP BY country, sex - ``` - -

Functions supported in pre-aggregate tables

- -| Function | Rollup supported | -|-----------|----------------| -| SUM | Yes | -| AVG | Yes | -| MAX | Yes | -| MIN | Yes | -| COUNT | Yes | - - -##### How pre-aggregate tables are selected -For the main table **sales** and pre-aggregate table **agg_sales** created above, queries of the -kind -``` -SELECT country, sex, sum(quantity), avg(price) from sales GROUP BY country, sex - -SELECT sex, sum(quantity) from sales GROUP BY sex - -SELECT sum(price), country from sales GROUP BY country -``` - -will be transformed by Query Planner to fetch data from pre-aggregate table **agg_sales** - -But queries of kind -``` -SELECT user_id, country, sex, sum(quantity), avg(price) from sales GROUP BY user_id, country, sex - -SELECT sex, avg(quantity) from sales GROUP BY sex - -SELECT country, max(price) from sales GROUP BY country -``` - -will fetch the data from the main table **sales** - -##### Loading data to pre-aggregate tables -For existing table with loaded data, data load to pre-aggregate table will be triggered by the -CREATE DATAMAP statement when user creates the pre-aggregate table. -For incremental loads after aggregates tables are created, loading data to main table triggers -the load to pre-aggregate tables once main table loading is complete. These loads are automic -meaning that data on main table and aggregate tables are only visible to the user after all tables -are loaded - -##### Querying data from pre-aggregate tables -Pre-aggregate tables cannot be queries directly. Queries are to be made on main table. Internally -carbondata will check associated pre-aggregate tables with the main table, and if the -pre-aggregate tables satisfy the query condition, the plan is transformed automatically to use -pre-aggregate table to fetch the data. - -##### Compacting pre-aggregate tables -Compaction command (ALTER TABLE COMPACT) need to be run separately on each pre-aggregate table. -Running Compaction command on main table will **not automatically** compact the pre-aggregate -tables. Compaction is an optional operation for pre-aggregate table. If compaction is performed on -main table but not performed on pre-aggregate table, all queries still can benefit from -pre-aggregate tables. To further improve performance on pre-aggregate tables, compaction can be -triggered on pre-aggregate tables directly, it will merge the segments inside pre-aggregate table. - -##### Update/Delete Operations on pre-aggregate tables -This functionality is not supported. - - NOTE (RESTRICTION): - Update/Delete operations are not supported on main table which has pre-aggregate tables - created on it. All the pre-aggregate tables will have to be dropped before update/delete - operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually - after update/delete operations are completed - -##### Delete Segment Operations on pre-aggregate tables -This functionality is not supported. - - NOTE (RESTRICTION): - Delete Segment operations are not supported on main table which has pre-aggregate tables - created on it. All the pre-aggregate tables will have to be dropped before delete segment - operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually - after delete segment operations are completed - -##### Alter Table Operations on pre-aggregate tables -This functionality is not supported. - - NOTE (RESTRICTION): - Adding new column in new table does not have any affect on pre-aggregate tables. However if - dropping or renaming a column has impact in pre-aggregate table, such operations will be - rejected and error will be thrown. All the pre-aggregate tables will have to be dropped - before alter operations can be performed on the main table. Pre-aggregate tables can be rebuilt - manually after alter table operations are completed - -### Supporting timeseries data (Alpha feature in 1.3.0) -CarbonData has built-in understanding of time hierarchy and levels: year, month, day, hour, minute, second. -Timeseries pre-aggregate tables can be created for the hierarchy and CarbonData can do automatic -roll-up for the queries on these hierarchies. - - ``` - CREATE DATAMAP agg_year - ON TABLE sales - USING "timeseries" - DMPROPERTIES ( - 'event_time'='order_time', - 'year_granualrity'='1', - ) AS - SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), - avg(price) FROM sales GROUP BY order_time, country, sex - - CREATE DATAMAP agg_month - ON TABLE sales - USING "timeseries" - DMPROPERTIES ( - 'event_time'='order_time', - 'month_granualrity'='1', - ) AS - SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), - avg(price) FROM sales GROUP BY order_time, country, sex - - CREATE DATAMAP agg_day - ON TABLE sales - USING "timeseries" - DMPROPERTIES ( - 'event_time'='order_time', - 'day_granualrity'='1', - ) AS - SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), - avg(price) FROM sales GROUP BY order_time, country, sex - - CREATE DATAMAP agg_sales_hour - ON TABLE sales - USING "timeseries" - DMPROPERTIES ( - 'event_time'='order_time', - 'hour_granualrity'='1', - ) AS - SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), - avg(price) FROM sales GROUP BY order_time, country, sex - - CREATE DATAMAP agg_minute - ON TABLE sales - USING "timeseries" - DMPROPERTIES ( - 'event_time'='order_time', - 'minute_granualrity'='1', - ) AS - SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), - avg(price) FROM sales GROUP BY order_time, country, sex - - CREATE DATAMAP agg_minute - ON TABLE sales - USING "timeseries" - DMPROPERTIES ( - 'event_time'='order_time', - 'minute_granualrity'='1', - ) AS - SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), - avg(price) FROM sales GROUP BY order_time, country, sex - ``` - For Querying data and automatically roll-up to the desired aggregation level, CarbonData supports - UDF as - ``` - timeseries(timeseries column name, 'aggregation level') - ``` - Examples - ``` - SELECT - timeseries(order_time, 'hour'), - sum(quantity) - FROM sales - GROUP BY timeseries(order_time, 'hour') - ``` - - It is **not necessary** to create pre-aggregate tables for each granularity unless required for - query. CarbonData can roll-up the data and fetch it. - - For Example: For main table **sales**, If timeseries pre-aggregate tables were created as - - ``` - CREATE DATAMAP agg_day - ON TABLE sales - USING "timeseries" - DMPROPERTIES ( - 'event_time'='order_time', - 'day_granualrity'='1', - ) AS - SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), - avg(price) FROM sales GROUP BY order_time, country, sex - - CREATE DATAMAP agg_sales_hour - ON TABLE sales - USING "timeseries" - DMPROPERTIES ( - 'event_time'='order_time', - 'hour_granualrity'='1', - ) AS - SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), - avg(price) FROM sales GROUP BY order_time, country, sex - ``` - - Queries like below will be rolled-up and fetched from pre-aggregate tables - ``` - Select timeseries(order_time, 'month'), sum(quantity) from sales group by timeseries(order_time, - 'month') - - Select timeseries(order_time, 'year'), sum(quantity) from sales group by timeseries(order_time, - 'year') - ``` - - NOTE (RESTRICTION): - * Only 1 is supported for granularity value of timeseries pre-aggregate table. Other granularity value are not supported. - * Only one granularity can be defined on creating one timeseries pre-aggregate table. Other granularity are created separately. - * Pre-aggregate tables for the desired levels needs to be created one after the other - * Pre-aggregate tables are created for each level needs to be dropped separately - - ## BUCKETING Bucketing feature can be used to distribute/organize the table/partition data into multiple files such http://git-wip-us.apache.org/repos/asf/carbondata/blob/ec893412/docs/preaggregate-guide.md ---------------------------------------------------------------------- diff --git a/docs/preaggregate-guide.md b/docs/preaggregate-guide.md new file mode 100644 index 0000000..411433a --- /dev/null +++ b/docs/preaggregate-guide.md @@ -0,0 +1,313 @@ +# CarbonData Pre-aggregate tables + +## Quick example +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME + +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars +```shell +mvn clean package -DskipTests -Pspark-2.2 +``` + +Start spark-shell in new terminal, type :paste, then copy and run the following code. +```scala + import java.io.File + import org.apache.spark.sql.{CarbonEnv, SparkSession} + import org.apache.spark.sql.CarbonSession._ + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery} + import org.apache.carbondata.core.util.path.CarbonStorePath + + val warehouse = new File("./warehouse").getCanonicalPath + val metastore = new File("./metastore").getCanonicalPath + + val spark = SparkSession + .builder() + .master("local") + .appName("preAggregateExample") + .config("spark.sql.warehouse.dir", warehouse) + .getOrCreateCarbonSession(warehouse, metastore) + + spark.sparkContext.setLogLevel("ERROR") + + // drop table if exists previously + spark.sql(s"DROP TABLE IF EXISTS sales") + // Create target carbon table and populate with initial data + spark.sql( + s""" + | CREATE TABLE sales ( + | user_id string, + | country string, + | quantity int, + | price bigint) + | STORED BY 'carbondata'""".stripMargin) + + spark.sql( + s""" + | CREATE DATAMAP agg_sales + | ON TABLE sales + | USING "preaggregate" + | AS + | SELECT country, sum(quantity), avg(price) + | FROM sales + | GROUP BY country""".stripMargin) + + import spark.implicits._ + import org.apache.spark.sql.SaveMode + import scala.util.Random + + val r = new Random() + val df = spark.sparkContext.parallelize(1 to 10) + .map(x => ("ID." + r.nextInt(100000), "country" + x % 8, x % 50, x % 60)) + .toDF("user_id", "country", "quantity", "price") + + // Create table with pre-aggregate table + df.write.format("carbondata") + .option("tableName", "sales") + .option("compress", "true") + .mode(SaveMode.Append).save() + + spark.sql( + s""" + |SELECT country, sum(quantity), avg(price) + | from sales GROUP BY country""".stripMargin).show + + spark.stop +``` + +##PRE-AGGREGATE TABLES + Carbondata supports pre aggregating of data so that OLAP kind of queries can fetch data + much faster.Aggregate tables are created as datamaps so that the handling is as efficient as + other indexing support.Users can create as many aggregate tables they require as datamaps to + improve their query performance,provided the storage requirements and loading speeds are + acceptable. + + For main table called **sales** which is defined as + + ``` + CREATE TABLE sales ( + order_time timestamp, + user_id string, + sex string, + country string, + quantity int, + price bigint) + STORED BY 'carbondata' + ``` + + user can create pre-aggregate tables using the DDL + + ``` + CREATE DATAMAP agg_sales + ON TABLE sales + USING "preaggregate" + AS + SELECT country, sex, sum(quantity), avg(price) + FROM sales + GROUP BY country, sex + ``` + + + +

Functions supported in pre-aggregate tables

+ +| Function | Rollup supported | +|-----------|----------------| +| SUM | Yes | +| AVG | Yes | +| MAX | Yes | +| MIN | Yes | +| COUNT | Yes | + + +##### How pre-aggregate tables are selected +For the main table **sales** and pre-aggregate table **agg_sales** created above, queries of the +kind +``` +SELECT country, sex, sum(quantity), avg(price) from sales GROUP BY country, sex + +SELECT sex, sum(quantity) from sales GROUP BY sex + +SELECT sum(price), country from sales GROUP BY country +``` + +will be transformed by Query Planner to fetch data from pre-aggregate table **agg_sales** + +But queries of kind +``` +SELECT user_id, country, sex, sum(quantity), avg(price) from sales GROUP BY user_id, country, sex + +SELECT sex, avg(quantity) from sales GROUP BY sex + +SELECT country, max(price) from sales GROUP BY country +``` + +will fetch the data from the main table **sales** + +##### Loading data to pre-aggregate tables +For existing table with loaded data, data load to pre-aggregate table will be triggered by the +CREATE DATAMAP statement when user creates the pre-aggregate table. +For incremental loads after aggregates tables are created, loading data to main table triggers +the load to pre-aggregate tables once main table loading is complete. These loads are automic +meaning that data on main table and aggregate tables are only visible to the user after all tables +are loaded + +##### Querying data from pre-aggregate tables +Pre-aggregate tables cannot be queries directly. Queries are to be made on main table. Internally +carbondata will check associated pre-aggregate tables with the main table, and if the +pre-aggregate tables satisfy the query condition, the plan is transformed automatically to use +pre-aggregate table to fetch the data. + +##### Compacting pre-aggregate tables +Compaction command (ALTER TABLE COMPACT) need to be run separately on each pre-aggregate table. +Running Compaction command on main table will **not automatically** compact the pre-aggregate +tables.Compaction is an optional operation for pre-aggregate table. If compaction is performed on +main table but not performed on pre-aggregate table, all queries still can benefit from +pre-aggregate tables. To further improve performance on pre-aggregate tables, compaction can be +triggered on pre-aggregate tables directly, it will merge the segments inside pre-aggregate table. + +##### Update/Delete Operations on pre-aggregate tables +This functionality is not supported. + + NOTE (RESTRICTION): + Update/Delete operations are not supported on main table which has pre-aggregate tables + created on it. All the pre-aggregate tables will have to be dropped before update/delete + operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually + after update/delete operations are completed + +##### Delete Segment Operations on pre-aggregate tables +This functionality is not supported. + + NOTE (RESTRICTION): + Delete Segment operations are not supported on main table which has pre-aggregate tables + created on it. All the pre-aggregate tables will have to be dropped before update/delete + operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually + after delete segment operations are completed + +##### Alter Table Operations on pre-aggregate tables +This functionality is not supported. + + NOTE (RESTRICTION): + Adding new column in new table does not have any affect on pre-aggregate tables. However if + dropping or renaming a column has impact in pre-aggregate table, such operations will be + rejected and error will be thrown. All the pre-aggregate tables will have to be dropped + before Alter Operations can be performed on the main table. Pre-aggregate tables can be rebuilt + manually after Alter Table operations are completed + +### Supporting timeseries data (Alpha feature in 1.3.0) +Carbondata has built-in understanding of time hierarchy and levels: year, month, day, hour, minute. +Multiple pre-aggregate tables can be created for the hierarchy and Carbondata can do automatic +roll-up for the queries on these hierarchies. + + ``` + CREATE DATAMAP agg_year + ON TABLE sales + USING "timeseries" + DMPROPERTIES ( + 'event_time'='order_time', + 'year_granualrity'='1', + ) AS + SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), + avg(price) FROM sales GROUP BY order_time, country, sex + + CREATE DATAMAP agg_month + ON TABLE sales + USING "timeseries" + DMPROPERTIES ( + 'event_time'='order_time', + 'month_granualrity'='1', + ) AS + SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), + avg(price) FROM sales GROUP BY order_time, country, sex + + CREATE DATAMAP agg_day + ON TABLE sales + USING "timeseries" + DMPROPERTIES ( + 'event_time'='order_time', + 'day_granualrity'='1', + ) AS + SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), + avg(price) FROM sales GROUP BY order_time, country, sex + + CREATE DATAMAP agg_sales_hour + ON TABLE sales + USING "timeseries" + DMPROPERTIES ( + 'event_time'='order_time', + 'hour_granualrity'='1', + ) AS + SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), + avg(price) FROM sales GROUP BY order_time, country, sex + + CREATE DATAMAP agg_minute + ON TABLE sales + USING "timeseries" + DMPROPERTIES ( + 'event_time'='order_time', + 'minute_granualrity'='1', + ) AS + SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), + avg(price) FROM sales GROUP BY order_time, country, sex + + CREATE DATAMAP agg_minute + ON TABLE sales + USING "timeseries" + DMPROPERTIES ( + 'event_time'='order_time', + 'minute_granualrity'='1', + ) AS + SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), + avg(price) FROM sales GROUP BY order_time, country, sex + ``` + + For Querying data and automatically roll-up to the desired aggregation level,Carbondata supports + UDF as + ``` + timeseries(timeseries column name, 'aggregation level') + ``` + ``` + Select timeseries(order_time, 'hour'), sum(quantity) from sales group by timeseries(order_time, + 'hour') + ``` + + It is **not necessary** to create pre-aggregate tables for each granularity unless required for + query. Carbondata can roll-up the data and fetch it. + + For Example: For main table **sales** , If pre-aggregate tables were created as + + ``` + CREATE DATAMAP agg_day + ON TABLE sales + USING "timeseries" + DMPROPERTIES ( + 'event_time'='order_time', + 'day_granualrity'='1', + ) AS + SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), + avg(price) FROM sales GROUP BY order_time, country, sex + + CREATE DATAMAP agg_sales_hour + ON TABLE sales + USING "timeseries" + DMPROPERTIES ( + 'event_time'='order_time', + 'hour_granualrity'='1', + ) AS + SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price), + avg(price) FROM sales GROUP BY order_time, country, sex + ``` + + Queries like below will be rolled-up and fetched from pre-aggregate tables + ``` + Select timeseries(order_time, 'month'), sum(quantity) from sales group by timeseries(order_time, + 'month') + + Select timeseries(order_time, 'year'), sum(quantity) from sales group by timeseries(order_time, + 'year') + ``` + + NOTE (RESTRICTION): + * Only value of 1 is supported for hierarchy levels. Other hierarchy levels are not supported. + Other hierarchy levels are not supported + * pre-aggregate tables for the desired levels needs to be created one after the other + * pre-aggregate tables created for each level needs to be dropped separately + http://git-wip-us.apache.org/repos/asf/carbondata/blob/ec893412/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala ---------------------------------------------------------------------- diff --git a/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala b/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala index d27eefb..ace3dcc 100644 --- a/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala +++ b/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala @@ -38,6 +38,7 @@ object PreAggregateTableExample { // 1. simple usage for Pre-aggregate tables creation and query spark.sql("DROP TABLE IF EXISTS mainTable") + spark.sql("DROP TABLE IF EXISTS mainTable_other") spark.sql(""" | CREATE TABLE mainTable | (id Int, @@ -47,10 +48,23 @@ object PreAggregateTableExample { | STORED BY 'org.apache.carbondata.format' """.stripMargin) + spark.sql(""" + | CREATE TABLE mainTable_other + | (id Int, + | name String, + | city String, + | age Int) + | STORED BY 'org.apache.carbondata.format' + """.stripMargin) + spark.sql(s""" LOAD DATA LOCAL INPATH '$testData' into table mainTable """) + spark.sql(s""" + LOAD DATA LOCAL INPATH '$testData' into table mainTable_other + """) + spark.sql( s"""create datamap preagg_sum on table mainTable using 'preaggregate' as | select id,sum(age) from mainTable group by id""" @@ -59,14 +73,17 @@ object PreAggregateTableExample { s"""create datamap preagg_avg on table mainTable using 'preaggregate' as | select id,avg(age) from mainTable group by id""" .stripMargin) + spark.sql( - s"""create datamap preagg_count on table mainTable using 'preaggregate' as + s"""create datamap preagg_count_age on table mainTable using 'preaggregate' as | select id,count(age) from mainTable group by id""" .stripMargin) + spark.sql( s"""create datamap preagg_min on table mainTable using 'preaggregate' as | select id,min(age) from mainTable group by id""" .stripMargin) + spark.sql( s"""create datamap preagg_max on table mainTable using 'preaggregate' as | select id,max(age) from mainTable group by id""" @@ -74,10 +91,40 @@ object PreAggregateTableExample { spark.sql( s""" + | create datamap preagg_case on table mainTable using 'preaggregate' as + | select name,sum(case when age=35 then id else 0 end) from mainTable group by name + | """.stripMargin) + + spark.sql( + s"""create datamap preagg_count on table maintable using 'preaggregate' as + | select name, count(*) from maintable group by name""".stripMargin) + + spark.sql( + s""" | SELECT id,max(age) | FROM mainTable group by id """.stripMargin).show() + spark.sql( + s""" + | select name, count(*) from + | mainTable group by name + """.stripMargin).show() + + spark.sql( + s""" + | select name as NewName, + | sum(case when age=35 then id else 0 end) as sum + | from mainTable group by name order by name + """.stripMargin).show() + + spark.sql( + s""" + | select t1.name,t1.city from mainTable_other t1 join + | (select name as newnewname,sum(age) as sum + | from mainTable group by name )t2 on t1.name=t2.newnewname + """.stripMargin).show() + // 2.compare the performance : with pre-aggregate VS main table // build test data, if set the data is larger than 100M, it will take 10+ mins. @@ -160,6 +207,7 @@ object PreAggregateTableExample { // scalastyle:on spark.sql("DROP TABLE IF EXISTS mainTable") + spark.sql("DROP TABLE IF EXISTS mainTable_other") spark.sql("DROP TABLE IF EXISTS personTable") spark.sql("DROP TABLE IF EXISTS personTableWithoutAgg")