From commits-return-11966-archive-asf-public=cust-asf.ponee.io@hudi.apache.org  Sun Feb 23 07:44:13 2020
Return-Path: <commits-return-11966-archive-asf-public=cust-asf.ponee.io@hudi.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 55D50180652
	for <archive-asf-public@cust-asf.ponee.io>; Sun, 23 Feb 2020 08:44:13 +0100 (CET)
Received: (qmail 63839 invoked by uid 500); 23 Feb 2020 07:44:12 -0000
Mailing-List: contact commits-help@hudi.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:commits-help@hudi.apache.org>
List-Unsubscribe: <mailto:commits-unsubscribe@hudi.apache.org>
List-Post: <mailto:commits@hudi.apache.org>
List-Id: <commits.hudi.apache.org>
Reply-To: dev@hudi.apache.org
Delivered-To: mailing list commits@hudi.apache.org
Received: (qmail 63830 invoked by uid 99); 23 Feb 2020 07:44:12 -0000
Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Feb 2020 07:44:12 +0000
From: GitBox <git@apache.org>
To: commits@hudi.apache.org
Subject: [GitHub] [incubator-hudi] bhasudha commented on a change in pull request
 #1333: [HUDI-589][DOCS] Fix querying_data page
Message-ID: <158244385271.13545.4705939775221486383.gitbox@gitbox.apache.org>
References: <hudi.1333.MDExOlB1bGxSZXF1ZXN0Mzc1MTQ3NDk4.gitbox@gitbox.apache.org>
In-Reply-To: <hudi.1333.MDExOlB1bGxSZXF1ZXN0Mzc1MTQ3NDk4.gitbox@gitbox.apache.org>
Date: Sun, 23 Feb 2020 07:44:12 -0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

bhasudha commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r382972929
 
 

 ##########
 File path: docs/_docs/2_3_querying_data.md
 ##########
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: `set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles into jobs/notebooks. At a high level, there are two ways to access Hudi tables in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically spark jobs require adding `--jars <path to jar>/hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source) for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch the hudi-spark-bundle like this: `--packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or `hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying Hudi tables. Please add hudi-spark-bundle 
 
 Review comment:
   Also need help here to add context on how Spark SQL integrates with Spark and Hive. Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services