hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bhavanisu...@apache.org
Subject [incubator-hudi] branch asf-site updated: [HUDI-589] Follow on fixes to querying_data page
Date Mon, 02 Mar 2020 22:05:02 GMT
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 7e82fe2  [HUDI-589] Follow on fixes to querying_data page
7e82fe2 is described below

commit 7e82fe2f1f1137ae946039b80ad3246abc3af7a3
Author: Vinoth Chandar <vchandar@confluent.io>
AuthorDate: Mon Mar 2 13:43:39 2020 -0800

    [HUDI-589] Follow on fixes to querying_data page
---
 docs/_docs/2_3_querying_data.md | 90 ++++++++++++++++++++---------------------
 1 file changed, 45 insertions(+), 45 deletions(-)

diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 1a2ae08..c4ab865 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -9,10 +9,10 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 
 Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of
querying, as explained [before](/docs/concepts.html#query-types). 
 Once the table is synced to the Hive metastore, it provides external Hive tables backed by
Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the table can be queried by popular query engines like Hive, Spark
SQL, Spark datasource and Presto.
+bundle has been installed, the table can be queried by popular query engines like Hive, Spark
SQL, Spark Datasource API and Presto.
 
 Specifically, following Hive tables are registered based off [table name](/docs/configurations.html#TABLE_NAME_OPT_KEY)

-and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during write.   
+and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) configs passed during write.
  
 
 If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get: 
  - `hudi_trips` supports snapshot query and incremental query on the table backed by `HoodieParquetInputFormat`,
exposing purely columnar data.
@@ -20,37 +20,39 @@ If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we
get:
 
 If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
  - `hudi_trips_rt` supports snapshot query and incremental query (providing near-real time
data) on the table  backed by `HoodieParquetRealtimeInputFormat`, exposing merged view of
base and log data.
- - `hudi_trips_ro` supports read optimized query on the table backed by `HoodieParquetInputFormat`,
exposing purely columnar data.
- 
+ - `hudi_trips_ro` supports read optimized query on the table backed by `HoodieParquetInputFormat`,
exposing purely columnar data stored in base files.
 
-As discussed in the concepts section, the one key primitive needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
+As discussed in the concepts section, the one key capability needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
 is obtaining a change stream/log from a table. Hudi tables can be queried incrementally,
which means you can get ALL and ONLY the updated & new rows 
 since a specified instant time. This, together with upserts, is particularly useful for building
data pipelines where 1 or more source Hudi tables are incrementally queried (streams/facts),
 joined with other tables (tables/dimensions), to [write out deltas](/docs/writing_data.html)
to a target Hudi table. Incremental queries are realized by querying one of the tables above,

 with special configurations that indicates to query planning that only incremental data needs
to be fetched out of the table. 
 
 
-## SUPPORT MATRIX
+## Support Matrix
+
+Following tables show whether a given query is supported on specific query engine.
 
-### COPY_ON_WRITE tables
+### Copy-On-Write tables
   
-||Snapshot|Incremental|Read Optimized|
-||--------|-----------|--------------|
-|**Hive**|Y|Y|N/A|
-|**Spark SQL**|Y|Y|N/A|
-|**Spark datasource**|Y|Y|N/A|
-|**Presto**|Y|N|N/A|
+|Query Engine|Snapshot Queries|Incremental Queries|
+|------------|--------|-----------|
+|**Hive**|Y|Y|
+|**Spark SQL**|Y|Y|
+|**Spark Datasource**|Y|Y|
+|**Presto**|Y|N|
+
+Note that `Read Optimized` queries are not applicable for COPY_ON_WRITE tables.
 
-### MERGE_ON_READ tables
+### Merge-On-Read tables
 
-||Snapshot|Incremental|Read Optimized|
-||--------|-----------|--------------|
+|Query Engine|Snapshot Queries|Incremental Queries|Read Optimized Queries|
+|------------|--------|-----------|--------------|
 |**Hive**|Y|Y|Y|
 |**Spark SQL**|Y|Y|Y|
-|**Spark datasource**|N|N|Y|
+|**Spark Datasource**|N|N|Y|
 |**Presto**|N|N|Y|
 
-
 In sections, below we will discuss specific setup to access different query types from different
query engines. 
 
 ## Hive
@@ -103,13 +105,11 @@ would ensure Map Reduce execution is chosen for a Hive query, which
combines par
 separated) and calls InputFormat.listStatus() only once with all those partitions.
 
 ## Spark SQL
-Supports all query types across both Hudi table types, relying on the custom Hudi input formats
again like Hive. 
-Typically notebook users and spark-shell users leverage spark sql for querying Hudi tables.
Please add hudi-spark-bundle 
-as described above via --jars or --packages.
+Once the Hudi tables have been registered to the Hive metastore, it can be queried using
the Spark-Hive integration. It supports all query types across both Hudi table types, 
+relying on the custom Hudi input formats again like Hive. Typically notebook users and spark-shell
users leverage spark sql for querying Hudi tables. Please add hudi-spark-bundle as described
above via --jars or --packages.
  
-### Snapshot query {#spark-snapshot-query}
-By default, Spark SQL will try to use its own parquet reader instead of Hive SerDe when reading
from Hive metastore parquet tables. 
-However, for MERGE_ON_READ tables which has both parquet and avro data, this default setting
needs to be turned off using set `spark.sql.hive.convertMetastoreParquet=false`. 
+By default, Spark SQL will try to use its own parquet reader instead of Hive SerDe when reading
from Hive metastore parquet tables. However, for MERGE_ON_READ tables which has 
+both parquet and avro data, this default setting needs to be turned off using set `spark.sql.hive.convertMetastoreParquet=false`.

 This will force Spark to fallback to using the Hive Serde to read the data (planning/executions
is still Spark). 
 
 ```java
@@ -126,23 +126,30 @@ If using spark's built in support, additionally a path filter needs
to be pushed
 spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);
 ```
 
-### Incremental querying {#spark-incr-query}
-Incremental queries work like hive incremental queries. The `hudi-spark` module offers the
DataSource API, a more elegant way to query data from Hudi table and process it via Spark.
-A sample incremental query, that will obtain all records written since `beginInstantTime`,
looks like below.
+## Spark Datasource
+
+The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi COPY_ON_WRITE
tables can be queried via Spark datasource similar to how standard 
+datasources work (e.g: `spark.read.parquet`). Both snapshot querying and incremental querying
are supported here. Typically spark jobs require adding `--jars <path to jar>/hudi-spark-bundle_2.11-<hudi
version>.jar` to classpath of drivers 
+and executors. Alternatively, hudi-spark-bundle can also fetched via the `--packages` options
(e.g: `--packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`).
+
+
+### Incremental query {#spark-incr-query}
+Of special interest to spark pipelines, is Hudi's ability to support incremental queries,
like below. A sample incremental query, that will obtain all records written since `beginInstantTime`,
looks like below.
+Thanks to Hudi's support for record level change streams, these incremental pipelines often
offer 10x efficiency over batch counterparts, by only processing the changed records.
+The following snippet shows how to obtain all records changed after `beginInstantTime` and
run some SQL on them.
 
 ```java
  Dataset<Row> hudiIncQueryDF = spark.read()
      .format("org.apache.hudi")
-     .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(),
-             DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
-     .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
-            <beginInstantTime>)
+     .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+     .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(), <beginInstantTime>)
      .load(tablePath); // For incremental query, pass in the root/base path of table
      
 hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental")
 spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_incremental
where fare > 20.0").show()
 ```
 
+For examples, refer to [Setup spark-shell in quickstart](/docs/quick-start-guide.html#setup-spark-shell).

 Please refer to [configurations](/docs/configurations.html#spark-datasource) section, to
view all datasource options.
 
 Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit
indexing.
@@ -150,27 +157,20 @@ Additionally, `HoodieReadClient` offers the following functionality
using Hudi's
 | **API** | **Description** |
 |-------|--------|
 | read(keys) | Read out the data corresponding to the keys as a DataFrame, using Hudi's own
index for faster lookup |
-| filterExists() | Filter out already existing records from the provided RDD[HoodieRecord].
Useful for de-duplication |
+| filterExists() | Filter out already existing records from the provided `RDD[HoodieRecord]`.
Useful for de-duplication |
 | checkExists(keys) | Check if the provided keys exist in a Hudi table |
 
-## Spark datasource
-
-Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how standard datasources
work (e.g: `spark.read.parquet`). 
-Both snapshot querying and incremental querying are supported here. Typically spark jobs
require adding `--jars <path to jar>/hudi-spark-bundle_2.11:0.5.1-incubating`
-to classpath of drivers and executors. When using spark shell instead of `--jars`, `--packages`
can also be used to fetch the hudi-spark-bundle like this: `--packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
-For examples, refer to [Setup spark-shell in quickstart](/docs/quick-start-guide.html#setup-spark-shell).
-
 ## Presto
 
-Presto is a popular query engine, providing interactive query performance. Presto currently
supports snapshot queries on
-COPY_ON_WRITE and read optimized queries on MERGE_ON_READ Hudi tables. This requires the
`hudi-presto-bundle` jar
-to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation.
+Presto is a popular query engine, providing interactive query performance. Presto currently
supports snapshot queries on COPY_ON_WRITE and read optimized queries 
+on MERGE_ON_READ Hudi tables. This requires the `hudi-presto-bundle` jar to be placed into
`<presto_install>/plugin/hive-hadoop2/`, across the installation.
+
+## Impala (Not Officially Released)
 
-## Impala(Not Officially Released)
+### Snapshot Query
 
-### Read optimized table
+Impala is able to query Hudi Copy-on-write table as an [EXTERNAL TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#external_tables)
on HDFS.  
 
-Impala is able to query Hudi read optimized table as an [EXTERNAL TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#external_tables)
on HDFS.  
 To create a Hudi read optimized table on Impala:
 ```
 CREATE EXTERNAL TABLE database.table_name


Mime
View raw message