hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From le...@apache.org
Subject [incubator-hudi] branch asf-site updated: [HUDI-634] Cut 0.5.2 documentation and write release note (#1390)
Date Thu, 12 Mar 2020 07:22:01 GMT
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 4d3d9bd  [HUDI-634] Cut 0.5.2 documentation and write release note (#1390)
4d3d9bd is described below

commit 4d3d9bd124410489cca1c2e764d1ba05f1a3474b
Author: vinoyang <yanghua1127@gmail.com>
AuthorDate: Thu Mar 12 15:21:49 2020 +0800

    [HUDI-634] Cut 0.5.2 documentation and write release note (#1390)
---
 docs/_config.yml                             |   29 +
 docs/_data/navigation.yml                    |   70 +-
 docs/_docs/0.5.2/0_1_s3_filesystem.cn.md     |   83 ++
 docs/_docs/0.5.2/0_1_s3_filesystem.md        |   82 ++
 docs/_docs/0.5.2/0_2_gcs_filesystem.cn.md    |   63 ++
 docs/_docs/0.5.2/0_2_gcs_filesystem.md       |   62 ++
 docs/_docs/0.5.2/0_3_migration_guide.cn.md   |   74 ++
 docs/_docs/0.5.2/0_3_migration_guide.md      |   72 ++
 docs/_docs/0.5.2/0_4_docker_demo.cn.md       | 1154 ++++++++++++++++++++++++
 docs/_docs/0.5.2/0_4_docker_demo.md          | 1232 ++++++++++++++++++++++++++
 docs/_docs/0.5.2/1_1_quick_start_guide.cn.md |  162 ++++
 docs/_docs/0.5.2/1_1_quick_start_guide.md    |  220 +++++
 docs/_docs/0.5.2/1_2_structure.md            |   22 +
 docs/_docs/0.5.2/1_3_use_cases.cn.md         |   69 ++
 docs/_docs/0.5.2/1_3_use_cases.md            |   68 ++
 docs/_docs/0.5.2/1_4_powered_by.cn.md        |   59 ++
 docs/_docs/0.5.2/1_4_powered_by.md           |   70 ++
 docs/_docs/0.5.2/1_5_comparison.cn.md        |   50 ++
 docs/_docs/0.5.2/1_5_comparison.md           |   58 ++
 docs/_docs/0.5.2/2_1_concepts.cn.md          |  156 ++++
 docs/_docs/0.5.2/2_1_concepts.md             |  173 ++++
 docs/_docs/0.5.2/2_2_writing_data.cn.md      |  224 +++++
 docs/_docs/0.5.2/2_2_writing_data.md         |  253 ++++++
 docs/_docs/0.5.2/2_3_querying_data.cn.md     |  178 ++++
 docs/_docs/0.5.2/2_3_querying_data.md        |  201 +++++
 docs/_docs/0.5.2/2_4_configurations.cn.md    |  474 ++++++++++
 docs/_docs/0.5.2/2_4_configurations.md       |  437 +++++++++
 docs/_docs/0.5.2/2_5_performance.cn.md       |   64 ++
 docs/_docs/0.5.2/2_5_performance.md          |   66 ++
 docs/_docs/0.5.2/2_6_deployment.cn.md        |  435 +++++++++
 docs/_docs/0.5.2/2_6_deployment.md           |  598 +++++++++++++
 docs/_docs/0.5.2/3_1_privacy.cn.md           |   25 +
 docs/_docs/0.5.2/3_1_privacy.md              |   24 +
 docs/_docs/0.5.2/3_2_docs_versions.cn.md     |   21 +
 docs/_docs/0.5.2/3_2_docs_versions.md        |   19 +
 docs/_includes/nav_list                      |    8 +
 docs/_includes/quick_link.html               |    2 +
 docs/_pages/index.md                         |    2 +-
 docs/_pages/releases.cn.md                   |   33 +-
 docs/_pages/releases.md                      |   29 +
 40 files changed, 7117 insertions(+), 4 deletions(-)

diff --git a/docs/_config.yml b/docs/_config.yml
index db8bba4..df8658d 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -14,6 +14,9 @@ previous_docs:
   - version: Latest
     en: /docs/quick-start-guide.html
     cn: /cn/docs/quick-start-guide.html
+  - version: 0.5.2
+    en: /docs/0.5.2-quick-start-guide.html
+    cn: /cn/docs/0.5.2-quick-start-guide.html
   - version: 0.5.1
     en: /docs/0.5.1-quick-start-guide.html
     cn: /cn/docs/0.5.1-quick-start-guide.html
@@ -144,6 +147,32 @@ cn_author:
       icon: "fa fa-navicon"
       url: "/security"
 
+0.5.2_author:
+  name             : "Quick Links"
+  bio              : "Hudi *ingests* & *manages* storage of large analytical datasets over DFS."
+  links:
+    - label: "Documentation"
+      icon: "fa fa-book"
+      url: "/docs/0.5.2-quick-start-guide"
+    - label: "Technical Wiki"
+      icon: "fa fa-wikipedia-w"
+      url: "https://cwiki.apache.org/confluence/display/HUDI"
+    - label: "Contribution Guide"
+      icon: "fa fa-thumbs-o-up"
+      url: "/contributing"
+    - label: "Join on Slack"
+      icon: "fa fa-slack"
+      url: "https://join.slack.com/t/apache-hudi/shared_invite/enQtODYyNDAxNzc5MTg2LTE5OTBlYmVhYjM0N2ZhOTJjOWM4YzBmMWU2MjZjMGE4NDc5ZDFiOGQ2N2VkYTVkNzU3ZDQ4OTI1NmFmYWQ0NzE"
+    - label: "Fork on GitHub"
+      icon: "fa fa-github"
+      url: "https://github.com/apache/incubator-hudi"
+    - label: "Report Issues"
+      icon: "fa fa-navicon"
+      url: "https://issues.apache.org/jira/projects/HUDI/summary"
+    - label: "Report Security Issues"
+      icon: "fa fa-navicon"
+      url: "/security"
+
 # Layout Defaults
 defaults:
   # _posts
diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml
index 184b882..915af56 100644
--- a/docs/_data/navigation.yml
+++ b/docs/_data/navigation.yml
@@ -229,4 +229,72 @@ cn_docs:
       - title: "文档版本"
         url: /cn/docs/0.5.1-docs-versions.html
       - title: "版权信息"
-        url: /cn/docs/0.5.1-privacy.html
\ No newline at end of file
+        url: /cn/docs/0.5.1-privacy.html
+
+0.5.2_docs:
+  - title: Getting Started
+    children:
+      - title: "Quick Start"
+        url: /docs/0.5.2-quick-start-guide.html
+      - title: "Use Cases"
+        url: /docs/0.5.2-use_cases.html
+      - title: "Talks & Powered By"
+        url: /docs/0.5.2-powered_by.html
+      - title: "Comparison"
+        url: /docs/0.5.2-comparison.html
+      - title: "Docker Demo"
+        url: /docs/0.5.2-docker_demo.html
+  - title: Documentation
+    children:
+      - title: "Concepts"
+        url: /docs/0.5.2-concepts.html
+      - title: "Writing Data"
+        url: /docs/0.5.2-writing_data.html
+      - title: "Querying Data"
+        url: /docs/0.5.2-querying_data.html
+      - title: "Configuration"
+        url: /docs/0.5.2-configurations.html
+      - title: "Performance"
+        url: /docs/0.5.2-performance.html
+      - title: "Deployment"
+        url: /docs/0.5.2-deployment.html
+  - title: INFO
+    children:
+      - title: "Docs Versions"
+        url: /docs/0.5.2-docs-versions.html
+      - title: "Privacy Policy"
+        url: /docs/0.5.2-privacy.html
+
+0.5.2_cn_docs:
+  - title: 入门指南
+    children:
+      - title: "快速开始"
+        url: /cn/docs/0.5.2-quick-start-guide.html
+      - title: "使用案例"
+        url: /cn/docs/0.5.2-use_cases.html
+      - title: "演讲 & hudi 用户"
+        url: /cn/docs/0.5.2-powered_by.html
+      - title: "对比"
+        url: /cn/docs/0.5.2-comparison.html
+      - title: "Docker 示例"
+        url: /cn/docs/0.5.2-docker_demo.html
+  - title: 帮助文档
+    children:
+      - title: "概念"
+        url: /cn/docs/0.5.2-concepts.html
+      - title: "写入数据"
+        url: /cn/docs/0.5.2-writing_data.html
+      - title: "查询数据"
+        url: /cn/docs/0.5.2-querying_data.html
+      - title: "配置"
+        url: /cn/docs/0.5.2-configurations.html
+      - title: "性能"
+        url: /cn/docs/0.5.2-performance.html
+      - title: "管理"
+        url: /cn/docs/0.5.2-deployment.html
+  - title: 其他信息
+    children:
+      - title: "文档版本"
+        url: /cn/docs/0.5.2-docs-versions.html
+      - title: "版权信息"
+        url: /cn/docs/0.5.2-privacy.html
\ No newline at end of file
diff --git a/docs/_docs/0.5.2/0_1_s3_filesystem.cn.md b/docs/_docs/0.5.2/0_1_s3_filesystem.cn.md
new file mode 100644
index 0000000..1d9a7bb
--- /dev/null
+++ b/docs/_docs/0.5.2/0_1_s3_filesystem.cn.md
@@ -0,0 +1,83 @@
+---
+version: 0.5.2
+title: S3 Filesystem
+keywords: hudi, hive, aws, s3, spark, presto
+permalink: /cn/docs/0.5.2-s3_hoodie.html
+summary: In this page, we go over how to configure Hudi with S3 filesystem.
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+In this page, we explain how to get your Hudi spark job to store into AWS S3.
+
+## AWS configs
+
+There are two configurations required for Hudi-S3 compatibility:
+
+- Adding AWS Credentials for Hudi
+- Adding required Jars to classpath
+
+### AWS Credentials
+
+Simplest way to use Hudi with S3, is to configure your `SparkSession` or `SparkContext` with S3 credentials. Hudi will automatically pick this up and talk to S3.
+
+Alternatively, add the required configs in your core-site.xml from where Hudi can fetch them. Replace the `fs.defaultFS` with your S3 bucket name and Hudi should be able to read/write from the bucket.
+
+```xml
+  <property>
+      <name>fs.defaultFS</name>
+      <value>s3://ysharma</value>
+  </property>
+
+  <property>
+      <name>fs.s3.impl</name>
+      <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
+  </property>
+
+  <property>
+      <name>fs.s3.awsAccessKeyId</name>
+      <value>AWS_KEY</value>
+  </property>
+
+  <property>
+       <name>fs.s3.awsSecretAccessKey</name>
+       <value>AWS_SECRET</value>
+  </property>
+
+  <property>
+       <name>fs.s3n.awsAccessKeyId</name>
+       <value>AWS_KEY</value>
+  </property>
+
+  <property>
+       <name>fs.s3n.awsSecretAccessKey</name>
+       <value>AWS_SECRET</value>
+  </property>
+```
+
+
+Utilities such as hudi-cli or deltastreamer tool, can pick up s3 creds via environmental variable prefixed with `HOODIE_ENV_`. For e.g below is a bash snippet to setup
+such variables and then have cli be able to work on datasets stored in s3
+
+```java
+export HOODIE_ENV_fs_DOT_s3a_DOT_access_DOT_key=$accessKey
+export HOODIE_ENV_fs_DOT_s3a_DOT_secret_DOT_key=$secretKey
+export HOODIE_ENV_fs_DOT_s3_DOT_awsAccessKeyId=$accessKey
+export HOODIE_ENV_fs_DOT_s3_DOT_awsSecretAccessKey=$secretKey
+export HOODIE_ENV_fs_DOT_s3n_DOT_awsAccessKeyId=$accessKey
+export HOODIE_ENV_fs_DOT_s3n_DOT_awsSecretAccessKey=$secretKey
+export HOODIE_ENV_fs_DOT_s3n_DOT_impl=org.apache.hadoop.fs.s3a.S3AFileSystem
+```
+
+
+
+### AWS Libs
+
+AWS hadoop libraries to add to our classpath
+
+ - com.amazonaws:aws-java-sdk:1.10.34
+ - org.apache.hadoop:hadoop-aws:2.7.3
+
+AWS glue data libraries are needed if AWS glue data is used
+
+ - com.amazonaws.glue:aws-glue-datacatalog-hive2-client:1.11.0
+ - com.amazonaws:aws-java-sdk-glue:1.11.475
diff --git a/docs/_docs/0.5.2/0_1_s3_filesystem.md b/docs/_docs/0.5.2/0_1_s3_filesystem.md
new file mode 100644
index 0000000..f6d8faa
--- /dev/null
+++ b/docs/_docs/0.5.2/0_1_s3_filesystem.md
@@ -0,0 +1,82 @@
+---
+version: 0.5.2
+title: S3 Filesystem
+keywords: hudi, hive, aws, s3, spark, presto
+permalink: /docs/0.5.2-s3_hoodie.html
+summary: In this page, we go over how to configure Hudi with S3 filesystem.
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+In this page, we explain how to get your Hudi spark job to store into AWS S3.
+
+## AWS configs
+
+There are two configurations required for Hudi-S3 compatibility:
+
+- Adding AWS Credentials for Hudi
+- Adding required Jars to classpath
+
+### AWS Credentials
+
+Simplest way to use Hudi with S3, is to configure your `SparkSession` or `SparkContext` with S3 credentials. Hudi will automatically pick this up and talk to S3.
+
+Alternatively, add the required configs in your core-site.xml from where Hudi can fetch them. Replace the `fs.defaultFS` with your S3 bucket name and Hudi should be able to read/write from the bucket.
+
+```xml
+  <property>
+      <name>fs.defaultFS</name>
+      <value>s3://ysharma</value>
+  </property>
+
+  <property>
+      <name>fs.s3.impl</name>
+      <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
+  </property>
+
+  <property>
+      <name>fs.s3.awsAccessKeyId</name>
+      <value>AWS_KEY</value>
+  </property>
+
+  <property>
+       <name>fs.s3.awsSecretAccessKey</name>
+       <value>AWS_SECRET</value>
+  </property>
+
+  <property>
+       <name>fs.s3n.awsAccessKeyId</name>
+       <value>AWS_KEY</value>
+  </property>
+
+  <property>
+       <name>fs.s3n.awsSecretAccessKey</name>
+       <value>AWS_SECRET</value>
+  </property>
+```
+
+
+Utilities such as hudi-cli or deltastreamer tool, can pick up s3 creds via environmental variable prefixed with `HOODIE_ENV_`. For e.g below is a bash snippet to setup
+such variables and then have cli be able to work on datasets stored in s3
+
+```java
+export HOODIE_ENV_fs_DOT_s3a_DOT_access_DOT_key=$accessKey
+export HOODIE_ENV_fs_DOT_s3a_DOT_secret_DOT_key=$secretKey
+export HOODIE_ENV_fs_DOT_s3_DOT_awsAccessKeyId=$accessKey
+export HOODIE_ENV_fs_DOT_s3_DOT_awsSecretAccessKey=$secretKey
+export HOODIE_ENV_fs_DOT_s3n_DOT_awsAccessKeyId=$accessKey
+export HOODIE_ENV_fs_DOT_s3n_DOT_awsSecretAccessKey=$secretKey
+export HOODIE_ENV_fs_DOT_s3n_DOT_impl=org.apache.hadoop.fs.s3a.S3AFileSystem
+```
+
+
+
+### AWS Libs
+
+AWS hadoop libraries to add to our classpath
+
+ - com.amazonaws:aws-java-sdk:1.10.34
+ - org.apache.hadoop:hadoop-aws:2.7.3
+
+AWS glue data libraries are needed if AWS glue data is used
+
+ - com.amazonaws.glue:aws-glue-datacatalog-hive2-client:1.11.0
+ - com.amazonaws:aws-java-sdk-glue:1.11.475
diff --git a/docs/_docs/0.5.2/0_2_gcs_filesystem.cn.md b/docs/_docs/0.5.2/0_2_gcs_filesystem.cn.md
new file mode 100644
index 0000000..dd77443
--- /dev/null
+++ b/docs/_docs/0.5.2/0_2_gcs_filesystem.cn.md
@@ -0,0 +1,63 @@
+---
+version: 0.5.2
+title: GCS Filesystem
+keywords: hudi, hive, google cloud, storage, spark, presto
+permalink: /cn/docs/0.5.2-gcs_hoodie.html
+summary: In this page, we go over how to configure hudi with Google Cloud Storage.
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+For Hudi storage on GCS, **regional** buckets provide an DFS API with strong consistency.
+
+## GCS Configs
+
+There are two configurations required for Hudi GCS compatibility:
+
+- Adding GCS Credentials for Hudi
+- Adding required jars to classpath
+
+### GCS Credentials
+
+Add the required configs in your core-site.xml from where Hudi can fetch them. Replace the `fs.defaultFS` with your GCS bucket name and Hudi should be able to read/write from the bucket.
+
+```xml
+  <property>
+    <name>fs.defaultFS</name>
+    <value>gs://hudi-bucket</value>
+  </property>
+
+  <property>
+    <name>fs.gs.impl</name>
+    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
+    <description>The FileSystem for gs: (GCS) uris.</description>
+  </property>
+
+  <property>
+    <name>fs.AbstractFileSystem.gs.impl</name>
+    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
+    <description>The AbstractFileSystem for gs: (GCS) uris.</description>
+  </property>
+
+  <property>
+    <name>fs.gs.project.id</name>
+    <value>GCS_PROJECT_ID</value>
+  </property>
+  <property>
+    <name>google.cloud.auth.service.account.enable</name>
+    <value>true</value>
+  </property>
+  <property>
+    <name>google.cloud.auth.service.account.email</name>
+    <value>GCS_SERVICE_ACCOUNT_EMAIL</value>
+  </property>
+  <property>
+    <name>google.cloud.auth.service.account.keyfile</name>
+    <value>GCS_SERVICE_ACCOUNT_KEYFILE</value>
+  </property>
+```
+
+### GCS Libs
+
+GCS hadoop libraries to add to our classpath
+
+- com.google.cloud.bigdataoss:gcs-connector:1.6.0-hadoop2
diff --git a/docs/_docs/0.5.2/0_2_gcs_filesystem.md b/docs/_docs/0.5.2/0_2_gcs_filesystem.md
new file mode 100644
index 0000000..b72652e
--- /dev/null
+++ b/docs/_docs/0.5.2/0_2_gcs_filesystem.md
@@ -0,0 +1,62 @@
+---
+version: 0.5.2
+title: GCS Filesystem
+keywords: hudi, hive, google cloud, storage, spark, presto
+permalink: /docs/0.5.2-gcs_hoodie.html
+summary: In this page, we go over how to configure hudi with Google Cloud Storage.
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+For Hudi storage on GCS, **regional** buckets provide an DFS API with strong consistency.
+
+## GCS Configs
+
+There are two configurations required for Hudi GCS compatibility:
+
+- Adding GCS Credentials for Hudi
+- Adding required jars to classpath
+
+### GCS Credentials
+
+Add the required configs in your core-site.xml from where Hudi can fetch them. Replace the `fs.defaultFS` with your GCS bucket name and Hudi should be able to read/write from the bucket.
+
+```xml
+  <property>
+    <name>fs.defaultFS</name>
+    <value>gs://hudi-bucket</value>
+  </property>
+
+  <property>
+    <name>fs.gs.impl</name>
+    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
+    <description>The FileSystem for gs: (GCS) uris.</description>
+  </property>
+
+  <property>
+    <name>fs.AbstractFileSystem.gs.impl</name>
+    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
+    <description>The AbstractFileSystem for gs: (GCS) uris.</description>
+  </property>
+
+  <property>
+    <name>fs.gs.project.id</name>
+    <value>GCS_PROJECT_ID</value>
+  </property>
+  <property>
+    <name>google.cloud.auth.service.account.enable</name>
+    <value>true</value>
+  </property>
+  <property>
+    <name>google.cloud.auth.service.account.email</name>
+    <value>GCS_SERVICE_ACCOUNT_EMAIL</value>
+  </property>
+  <property>
+    <name>google.cloud.auth.service.account.keyfile</name>
+    <value>GCS_SERVICE_ACCOUNT_KEYFILE</value>
+  </property>
+```
+
+### GCS Libs
+
+GCS hadoop libraries to add to our classpath
+
+- com.google.cloud.bigdataoss:gcs-connector:1.6.0-hadoop2
diff --git a/docs/_docs/0.5.2/0_3_migration_guide.cn.md b/docs/_docs/0.5.2/0_3_migration_guide.cn.md
new file mode 100644
index 0000000..d716f4b
--- /dev/null
+++ b/docs/_docs/0.5.2/0_3_migration_guide.cn.md
@@ -0,0 +1,74 @@
+---
+version: 0.5.2
+title: Migration Guide
+keywords: hudi, migration, use case
+permalink: /cn/docs/0.5.2-migration_guide.html
+summary: In this page, we will discuss some available tools for migrating your existing dataset into a Hudi dataset
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+Hudi maintains metadata such as commit timeline and indexes to manage a dataset. The commit timelines helps to understand the actions happening on a dataset as well as the current state of a dataset. Indexes are used by Hudi to maintain a record key to file id mapping to efficiently locate a record. At the moment, Hudi supports writing only parquet columnar formats.
+To be able to start using Hudi for your existing dataset, you will need to migrate your existing dataset into a Hudi managed dataset. There are a couple of ways to achieve this.
+
+
+## Approaches
+
+
+### Use Hudi for new partitions alone
+
+Hudi can be used to manage an existing dataset without affecting/altering the historical data already present in the
+dataset. Hudi has been implemented to be compatible with such a mixed dataset with a caveat that either the complete
+Hive partition is Hudi managed or not. Thus the lowest granularity at which Hudi manages a dataset is a Hive
+partition. Start using the datasource API or the WriteClient to write to the dataset and make sure you start writing
+to a new partition or convert your last N partitions into Hudi instead of the entire table. Note, since the historical
+ partitions are not managed by HUDI, none of the primitives provided by HUDI work on the data in those partitions. More concretely, one cannot perform upserts or incremental pull on such older partitions not managed by the HUDI dataset.
+Take this approach if your dataset is an append only type of dataset and you do not expect to perform any updates to existing (or non Hudi managed) partitions.
+
+
+### Convert existing dataset to Hudi
+
+Import your existing dataset into a Hudi managed dataset. Since all the data is Hudi managed, none of the limitations
+ of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently
+ make the update available to queries. Note that not only do you get to use all Hudi primitives on this dataset,
+ there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset
+ . You can define the desired file size when converting this dataset and Hudi will ensure it writes out files
+ adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into
+ small files rather than writing new small ones thus maintaining the health of your cluster.
+
+There are a few options when choosing this approach.
+
+**Option 1**
+Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in parquet file format.
+This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data.
+
+**Option 2**
+For huge datasets, this could be as simple as : 
+```java
+for partition in [list of partitions in source dataset] {
+        val inputDF = spark.read.format("any_input_format").load("partition_path")
+        inputDF.write.format("org.apache.hudi").option()....save("basePath")
+}
+```  
+
+**Option 3**
+Write your own custom logic of how to load an existing dataset into a Hudi managed one. Please read about the RDD API
+ [here](/cn/docs/0.5.2-quick-start-guide.html). Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
+fired by via `cd hudi-cli && ./hudi-cli.sh`.
+
+```java
+hudi->hdfsparquetimport
+        --upsert false
+        --srcPath /user/parquet/dataset/basepath
+        --targetPath
+        /user/hoodie/dataset/basepath
+        --tableName hoodie_table
+        --tableType COPY_ON_WRITE
+        --rowKeyField _row_key
+        --partitionPathField partitionStr
+        --parallelism 1500
+        --schemaFilePath /user/table/schema
+        --format parquet
+        --sparkMemory 6g
+        --retry 2
+```
diff --git a/docs/_docs/0.5.2/0_3_migration_guide.md b/docs/_docs/0.5.2/0_3_migration_guide.md
new file mode 100644
index 0000000..013b149
--- /dev/null
+++ b/docs/_docs/0.5.2/0_3_migration_guide.md
@@ -0,0 +1,72 @@
+---
+version: 0.5.2
+title: Migration Guide
+keywords: hudi, migration, use case
+permalink: /docs/0.5.2-migration_guide.html
+summary: In this page, we will discuss some available tools for migrating your existing table into a Hudi table
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+Hudi maintains metadata such as commit timeline and indexes to manage a table. The commit timelines helps to understand the actions happening on a table as well as the current state of a table. Indexes are used by Hudi to maintain a record key to file id mapping to efficiently locate a record. At the moment, Hudi supports writing only parquet columnar formats.
+To be able to start using Hudi for your existing table, you will need to migrate your existing table into a Hudi managed table. There are a couple of ways to achieve this.
+
+
+## Approaches
+
+
+### Use Hudi for new partitions alone
+
+Hudi can be used to manage an existing table without affecting/altering the historical data already present in the
+table. Hudi has been implemented to be compatible with such a mixed table with a caveat that either the complete
+Hive partition is Hudi managed or not. Thus the lowest granularity at which Hudi manages a table is a Hive
+partition. Start using the datasource API or the WriteClient to write to the table and make sure you start writing
+to a new partition or convert your last N partitions into Hudi instead of the entire table. Note, since the historical
+ partitions are not managed by HUDI, none of the primitives provided by HUDI work on the data in those partitions. More concretely, one cannot perform upserts or incremental pull on such older partitions not managed by the HUDI table.
+Take this approach if your table is an append only type of table and you do not expect to perform any updates to existing (or non Hudi managed) partitions.
+
+
+### Convert existing table to Hudi
+
+Import your existing table into a Hudi managed table. Since all the data is Hudi managed, none of the limitations
+ of Approach 1 apply here. Updates spanning any partitions can be applied to this table and Hudi will efficiently
+ make the update available to queries. Note that not only do you get to use all Hudi primitives on this table,
+ there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed table
+ . You can define the desired file size when converting this table and Hudi will ensure it writes out files
+ adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into
+ small files rather than writing new small ones thus maintaining the health of your cluster.
+
+There are a few options when choosing this approach.
+
+**Option 1**
+Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing table is in parquet file format.
+This tool essentially starts a Spark Job to read the existing parquet table and converts it into a HUDI managed table by re-writing all the data.
+
+**Option 2**
+For huge tables, this could be as simple as : 
+```java
+for partition in [list of partitions in source table] {
+        val inputDF = spark.read.format("any_input_format").load("partition_path")
+        inputDF.write.format("org.apache.hudi").option()....save("basePath")
+}
+```  
+
+**Option 3**
+Write your own custom logic of how to load an existing table into a Hudi managed one. Please read about the RDD API
+ [here](/docs/0.5.2-quick-start-guide.html). Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
+fired by via `cd hudi-cli && ./hudi-cli.sh`.
+
+```java
+hudi->hdfsparquetimport
+        --upsert false
+        --srcPath /user/parquet/table/basepath
+        --targetPath /user/hoodie/table/basepath
+        --tableName hoodie_table
+        --tableType COPY_ON_WRITE
+        --rowKeyField _row_key
+        --partitionPathField partitionStr
+        --parallelism 1500
+        --schemaFilePath /user/table/schema
+        --format parquet
+        --sparkMemory 6g
+        --retry 2
+```
diff --git a/docs/_docs/0.5.2/0_4_docker_demo.cn.md b/docs/_docs/0.5.2/0_4_docker_demo.cn.md
new file mode 100644
index 0000000..c1b1340
--- /dev/null
+++ b/docs/_docs/0.5.2/0_4_docker_demo.cn.md
@@ -0,0 +1,1154 @@
+---
+version: 0.5.2
+title: Docker Demo
+keywords: hudi, docker, demo
+permalink: /cn/docs/0.5.2-docker_demo.html
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+## A Demo using docker containers
+
+Lets use a real world example to see how hudi works end to end. For this purpose, a self contained
+data infrastructure is brought up in a local docker cluster within your computer.
+
+The steps have been tested on a Mac laptop
+
+### Prerequisites
+
+  * Docker Setup :  For Mac, Please follow the steps as defined in [https://docs.docker.com/v17.12/docker-for-mac/install/]. For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be killed because of memory issues.
+  * kafkacat : A command-line utility to publish/consume from kafka topics. Use `brew install kafkacat` to install kafkacat
+  * /etc/hosts : The demo references many services running in container by the hostname. Add the following settings to /etc/hosts
+
+
+```java
+   127.0.0.1 adhoc-1
+   127.0.0.1 adhoc-2
+   127.0.0.1 namenode
+   127.0.0.1 datanode1
+   127.0.0.1 hiveserver
+   127.0.0.1 hivemetastore
+   127.0.0.1 kafkabroker
+   127.0.0.1 sparkmaster
+   127.0.0.1 zookeeper
+```
+
+Also, this has not been tested on some environments like Docker on Windows.
+
+
+## Setting up Docker Cluster
+
+
+### Build Hudi
+
+The first step is to build hudi
+```java
+cd <HUDI_WORKSPACE>
+mvn package -DskipTests
+```
+
+### Bringing up Demo Cluster
+
+The next step is to run the docker compose script and setup configs for bringing up the cluster.
+This should pull the docker images from docker hub and setup docker cluster.
+
+```java
+cd docker
+./setup_demo.sh
+....
+....
+....
+Stopping spark-worker-1            ... done
+Stopping hiveserver                ... done
+Stopping hivemetastore             ... done
+Stopping historyserver             ... done
+.......
+......
+Creating network "hudi_demo" with the default driver
+Creating hive-metastore-postgresql ... done
+Creating namenode                  ... done
+Creating zookeeper                 ... done
+Creating kafkabroker               ... done
+Creating hivemetastore             ... done
+Creating historyserver             ... done
+Creating hiveserver                ... done
+Creating datanode1                 ... done
+Creating presto-coordinator-1      ... done
+Creating sparkmaster               ... done
+Creating presto-worker-1           ... done
+Creating adhoc-1                   ... done
+Creating adhoc-2                   ... done
+Creating spark-worker-1            ... done
+Copying spark default config and setting up configs
+Copying spark default config and setting up configs
+Copying spark default config and setting up configs
+$ docker ps
+```
+
+At this point, the docker cluster will be up and running. The demo cluster brings up the following services
+
+   * HDFS Services (NameNode, DataNode)
+   * Spark Master and Worker
+   * Hive Services (Metastore, HiveServer2 along with PostgresDB)
+   * Kafka Broker and a Zookeeper Node (Kakfa will be used as upstream source for the demo)
+   * Adhoc containers to run Hudi/Hive CLI commands
+
+## Demo
+
+Stock Tracker data will be used to showcase both different Hudi Views and the effects of Compaction.
+
+Take a look at the directory `docker/demo/data`. There are 2 batches of stock data - each at 1 minute granularity.
+The first batch contains stocker tracker data for some stock symbols during the first hour of trading window
+(9:30 a.m to 10:30 a.m). The second batch contains tracker data for next 30 mins (10:30 - 11 a.m). Hudi will
+be used to ingest these batches to a dataset which will contain the latest stock tracker data at hour level granularity.
+The batches are windowed intentionally so that the second batch contains updates to some of the rows in the first batch.
+
+### Step 1 : Publish the first batch to Kafka
+
+Upload the first batch to Kafka topic 'stock ticks' `cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P`
+
+To check if the new topic shows up, use
+```java
+kafkacat -b kafkabroker -L -J | jq .
+{
+  "originating_broker": {
+    "id": 1001,
+    "name": "kafkabroker:9092/1001"
+  },
+  "query": {
+    "topic": "*"
+  },
+  "brokers": [
+    {
+      "id": 1001,
+      "name": "kafkabroker:9092"
+    }
+  ],
+  "topics": [
+    {
+      "topic": "stock_ticks",
+      "partitions": [
+        {
+          "partition": 0,
+          "leader": 1001,
+          "replicas": [
+            {
+              "id": 1001
+            }
+          ],
+          "isrs": [
+            {
+              "id": 1001
+            }
+          ]
+        }
+      ]
+    }
+  ]
+}
+
+```
+
+### Step 2: Incrementally ingest data from Kafka topic
+
+Hudi comes with a tool named DeltaStreamer. This tool can connect to variety of data sources (including Kafka) to
+pull changes and apply to Hudi dataset using upsert/insert primitives. Here, we will use the tool to download
+json data from kafka topic and ingest to both COW and MOR tables we initialized in the previous step. This tool
+automatically initializes the datasets in the file-system if they do not exist yet.
+
+```java
+docker exec -it adhoc-2 /bin/bash
+
+# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_cow dataset in HDFS
+spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+
+
+# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_mor dataset in HDFS
+spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props /var/demo/config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider --disable-compaction
+
+
+# As part of the setup (Look at setup_demo.sh), the configs needed for DeltaStreamer is uploaded to HDFS. The configs
+# contain mostly Kafa connectivity settings, the avro-schema to be used for ingesting along with key and partitioning fields.
+
+exit
+```
+
+You can use HDFS web-browser to look at the datasets
+`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow`.
+
+You can explore the new partition folder created in the dataset along with a "deltacommit"
+file under .hoodie which signals a successful commit.
+
+There will be a similar setup when you browse the MOR dataset
+`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_mor`
+
+
+### Step 3: Sync with Hive
+
+At this step, the datasets are available in HDFS. We need to sync with Hive to create new Hive tables and add partitions
+inorder to run Hive queries against those datasets.
+
+```java
+docker exec -it adhoc-2 /bin/bash
+
+# THis command takes in HIveServer URL and COW Hudi Dataset location in HDFS and sync the HDFS state to Hive
+/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh  --jdbc-url jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt --base-path /user/hive/warehouse/stock_ticks_cow --database default --table stock_ticks_cow
+.....
+2018-09-24 22:22:45,568 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_cow
+.....
+
+# Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR storage)
+/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh  --jdbc-url jdbc:hive2://hiveserver:10000 --user hive --pass hive --partitioned-by dt --base-path /user/hive/warehouse/stock_ticks_mor --database default --table stock_ticks_mor
+...
+2018-09-24 22:23:09,171 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_mor
+...
+2018-09-24 22:23:09,559 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_mor_rt
+....
+exit
+```
+After executing the above command, you will notice
+
+1. A hive table named `stock_ticks_cow` created which provides Read-Optimized view for the Copy On Write dataset.
+2. Two new tables `stock_ticks_mor` and `stock_ticks_mor_rt` created for the Merge On Read dataset. The former
+provides the ReadOptimized view for the Hudi dataset and the later provides the realtime-view for the dataset.
+
+
+### Step 4 (a): Run Hive Queries
+
+Run a hive query to find the latest timestamp ingested for stock symbol 'GOOG'. You will notice that both read-optimized
+(for both COW and MOR dataset)and realtime views (for MOR dataset)give the same value "10:29 a.m" as Hudi create a
+parquet file for the first batch of data.
+
+```java
+docker exec -it adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
+# List Tables
+0: jdbc:hive2://hiveserver:10000> show tables;
++---------------------+--+
+|      tab_name       |
++---------------------+--+
+| stock_ticks_cow     |
+| stock_ticks_mor     |
+| stock_ticks_mor_rt  |
++---------------------+--+
+2 rows selected (0.801 seconds)
+0: jdbc:hive2://hiveserver:10000>
+
+
+# Look at partitions that were added
+0: jdbc:hive2://hiveserver:10000> show partitions stock_ticks_mor_rt;
++----------------+--+
+|   partition    |
++----------------+--+
+| dt=2018-08-31  |
++----------------+--+
+1 row selected (0.24 seconds)
+
+
+# COPY-ON-WRITE Queries:
+=========================
+
+
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:29:00  |
++---------+----------------------+--+
+
+Now, run a projection query:
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924221953       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924221953       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+
+# Merge-On-Read Queries:
+==========================
+
+Lets run similar queries against M-O-R dataset. Lets look at both
+ReadOptimized and Realtime views supported by M-O-R dataset
+
+# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:29:00  |
++---------+----------------------+--+
+1 row selected (6.326 seconds)
+
+
+# Run against Realtime View. Notice that the latest timestamp is again 10:29
+
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:29:00  |
++---------+----------------------+--+
+1 row selected (1.606 seconds)
+
+
+# Run projection query against Read Optimized and Realtime tables
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+exit
+exit
+```
+
+### Step 4 (b): Run Spark-SQL Queries
+Hudi support Spark as query processor just like Hive. Here are the same hive queries
+running in spark-sql
+
+```java
+docker exec -it adhoc-1 /bin/bash
+$SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2] --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  --driver-memory 1G --executor-memory 3G --num-executors 1  --packages com.databricks:spark-avro_2.11:4.0.0
+...
+
+Welcome to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
+      /_/
+
+Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
+Type in expressions to have them evaluated.
+Type :help for more information.
+
+scala>
+scala> spark.sql("show tables").show(100, false)
++--------+------------------+-----------+
+|database|tableName         |isTemporary|
++--------+------------------+-----------+
+|default |stock_ticks_cow   |false      |
+|default |stock_ticks_mor   |false      |
+|default |stock_ticks_mor_rt|false      |
++--------+------------------+-----------+
+
+# Copy-On-Write Table
+
+## Run max timestamp query against COW table
+
+scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
+[Stage 0:>                                                          (0 + 1) / 1]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
+SLF4J: Defaulting to no-operation (NOP) logger implementation
+SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
++------+-------------------+
+|symbol|max(ts)            |
++------+-------------------+
+|GOOG  |2018-08-31 10:29:00|
++------+-------------------+
+
+## Projection Query
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG'").show(100, false)
++-------------------+------+-------------------+------+---------+--------+
+|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
++-------------------+------+-------------------+------+---------+--------+
+|20180924221953     |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
+|20180924221953     |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
++-------------------+------+-------------------+------+---------+--------+
+
+# Merge-On-Read Queries:
+==========================
+
+Lets run similar queries against M-O-R dataset. Lets look at both
+ReadOptimized and Realtime views supported by M-O-R dataset
+
+# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG'").show(100, false)
++------+-------------------+
+|symbol|max(ts)            |
++------+-------------------+
+|GOOG  |2018-08-31 10:29:00|
++------+-------------------+
+
+
+# Run against Realtime View. Notice that the latest timestamp is again 10:29
+
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
++------+-------------------+
+|symbol|max(ts)            |
++------+-------------------+
+|GOOG  |2018-08-31 10:29:00|
++------+-------------------+
+
+# Run projection query against Read Optimized and Realtime tables
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG'").show(100, false)
++-------------------+------+-------------------+------+---------+--------+
+|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
++-------------------+------+-------------------+------+---------+--------+
+|20180924222155     |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
+|20180924222155     |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
++-------------------+------+-------------------+------+---------+--------+
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
++-------------------+------+-------------------+------+---------+--------+
+|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
++-------------------+------+-------------------+------+---------+--------+
+|20180924222155     |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
+|20180924222155     |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
++-------------------+------+-------------------+------+---------+--------+
+
+```
+
+### Step 4 (c): Run Presto Queries
+
+Here are the Presto queries for similar Hive and Spark queries. Currently, Hudi does not support Presto queries on realtime views.
+
+```java
+docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
+presto> show catalogs;
+  Catalog
+-----------
+ hive
+ jmx
+ localfile
+ system
+(4 rows)
+
+Query 20190817_134851_00000_j8rcz, FINISHED, 1 node
+Splits: 19 total, 19 done (100.00%)
+0:04 [0 rows, 0B] [0 rows/s, 0B/s]
+
+presto> use hive.default;
+USE
+presto:default> show tables;
+       Table
+--------------------
+ stock_ticks_cow
+ stock_ticks_mor
+ stock_ticks_mor_rt
+(3 rows)
+
+Query 20190822_181000_00001_segyw, FINISHED, 2 nodes
+Splits: 19 total, 19 done (100.00%)
+0:05 [3 rows, 99B] [0 rows/s, 18B/s]
+
+
+# COPY-ON-WRITE Queries:
+=========================
+
+
+presto:default> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
+ symbol |        _col1
+--------+---------------------
+ GOOG   | 2018-08-31 10:29:00
+(1 row)
+
+Query 20190822_181011_00002_segyw, FINISHED, 1 node
+Splits: 49 total, 49 done (100.00%)
+0:12 [197 rows, 613B] [16 rows/s, 50B/s]
+
+presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';
+ _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
+---------------------+--------+---------------------+--------+-----------+----------
+ 20190822180221      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
+ 20190822180221      | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
+(2 rows)
+
+Query 20190822_181141_00003_segyw, FINISHED, 1 node
+Splits: 17 total, 17 done (100.00%)
+0:02 [197 rows, 613B] [109 rows/s, 341B/s]
+
+
+# Merge-On-Read Queries:
+==========================
+
+Lets run similar queries against M-O-R dataset. 
+
+# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
+presto:default> select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';
+ symbol |        _col1
+--------+---------------------
+ GOOG   | 2018-08-31 10:29:00
+(1 row)
+
+Query 20190822_181158_00004_segyw, FINISHED, 1 node
+Splits: 49 total, 49 done (100.00%)
+0:02 [197 rows, 613B] [110 rows/s, 343B/s]
+
+
+presto:default>  select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG';
+ _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
+---------------------+--------+---------------------+--------+-----------+----------
+ 20190822180250      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
+ 20190822180250      | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
+(2 rows)
+
+Query 20190822_181256_00006_segyw, FINISHED, 1 node
+Splits: 17 total, 17 done (100.00%)
+0:02 [197 rows, 613B] [92 rows/s, 286B/s]
+
+presto:default> exit
+```
+
+### Step 5: Upload second batch to Kafka and run DeltaStreamer to ingest
+
+Upload the second batch of data and ingest this batch using delta-streamer. As this batch does not bring in any new
+partitions, there is no need to run hive-sync
+
+```java
+cat docker/demo/data/batch_2.json | kafkacat -b kafkabroker -t stock_ticks -P
+
+# Within Docker container, run the ingestion command
+docker exec -it adhoc-2 /bin/bash
+
+# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_cow dataset in HDFS
+spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+
+
+# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_mor dataset in HDFS
+spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE --storage-type MERGE_ON_READ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  --target-base-path /user/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props /var/demo/config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider --disable-compaction
+
+exit
+```
+
+With Copy-On-Write table, the second ingestion by DeltaStreamer resulted in a new version of Parquet file getting created.
+See `http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow/2018/08/31`
+
+With Merge-On-Read table, the second ingestion merely appended the batch to an unmerged delta (log) file.
+Take a look at the HDFS filesystem to get an idea: `http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_mor/2018/08/31`
+
+### Step 6(a): Run Hive Queries
+
+With Copy-On-Write table, the read-optimized view immediately sees the changes as part of second batch once the batch
+got committed as each ingestion creates newer versions of parquet files.
+
+With Merge-On-Read table, the second ingestion merely appended the batch to an unmerged delta (log) file.
+This is the time, when ReadOptimized and Realtime views will provide different results. ReadOptimized view will still
+return "10:29 am" as it will only read from the Parquet file. Realtime View will do on-the-fly merge and return
+latest committed data which is "10:59 a.m".
+
+```java
+docker exec -it adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
+
+# Copy On Write Table:
+
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+--+
+1 row selected (1.932 seconds)
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924221953       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924224524       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
+
+
+# Merge On Read Table:
+
+# Read Optimized View
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:29:00  |
++---------+----------------------+--+
+1 row selected (1.6 seconds)
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+# Realtime View
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+--+
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924224537       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+exit
+exit
+```
+
+### Step 6(b): Run Spark SQL Queries
+
+Running the same queries in Spark-SQL:
+
+```java
+docker exec -it adhoc-1 /bin/bash
+bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  --driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1  --packages com.databricks:spark-avro_2.11:4.0.0
+
+# Copy On Write Table:
+
+scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
++------+-------------------+
+|symbol|max(ts)            |
++------+-------------------+
+|GOOG  |2018-08-31 10:59:00|
++------+-------------------+
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG'").show(100, false)
+
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924221953       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924224524       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
+
+
+# Merge On Read Table:
+
+# Read Optimized View
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG'").show(100, false)
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:29:00  |
++---------+----------------------+--+
+1 row selected (1.6 seconds)
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG'").show(100, false)
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+# Realtime View
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+--+
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924224537       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+exit
+exit
+```
+
+### Step 6(c): Run Presto Queries
+
+Running the same queries on Presto for ReadOptimized views. 
+
+
+```java
+docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
+presto> use hive.default;
+USE
+
+# Copy On Write Table:
+
+presto:default>select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
+ symbol |        _col1
+--------+---------------------
+ GOOG   | 2018-08-31 10:59:00
+(1 row)
+
+Query 20190822_181530_00007_segyw, FINISHED, 1 node
+Splits: 49 total, 49 done (100.00%)
+0:02 [197 rows, 613B] [125 rows/s, 389B/s]
+
+presto:default>select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
+ _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
+---------------------+--------+---------------------+--------+-----------+----------
+ 20190822180221      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
+ 20190822181433      | GOOG   | 2018-08-31 10:59:00 |   9021 | 1227.1993 | 1227.215
+(2 rows)
+
+Query 20190822_181545_00008_segyw, FINISHED, 1 node
+Splits: 17 total, 17 done (100.00%)
+0:02 [197 rows, 613B] [106 rows/s, 332B/s]
+
+As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
+
+
+# Merge On Read Table:
+
+# Read Optimized View
+presto:default> select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';
+ symbol |        _col1
+--------+---------------------
+ GOOG   | 2018-08-31 10:29:00
+(1 row)
+
+Query 20190822_181602_00009_segyw, FINISHED, 1 node
+Splits: 49 total, 49 done (100.00%)
+0:01 [197 rows, 613B] [139 rows/s, 435B/s]
+
+presto:default>select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG';
+ _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
+---------------------+--------+---------------------+--------+-----------+----------
+ 20190822180250      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
+ 20190822180250      | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
+(2 rows)
+
+Query 20190822_181615_00010_segyw, FINISHED, 1 node
+Splits: 17 total, 17 done (100.00%)
+0:01 [197 rows, 613B] [154 rows/s, 480B/s]
+
+presto:default> exit
+```
+
+
+### Step 7 : Incremental Query for COPY-ON-WRITE Table
+
+With 2 batches of data ingested, lets showcase the support for incremental queries in Hudi Copy-On-Write datasets
+
+Lets take the same projection query example
+
+```java
+docker exec -it adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924064621       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+```
+
+As you notice from the above queries, there are 2 commits - 20180924064621 and 20180924065039 in timeline order.
+When you follow the steps, you will be getting different timestamps for commits. Substitute them
+in place of the above timestamps.
+
+To show the effects of incremental-query, let us assume that a reader has already seen the changes as part of
+ingesting first batch. Now, for the reader to see effect of the second batch, he/she has to keep the start timestamp to
+the commit time of the first batch (20180924064621) and run incremental query
+
+Hudi incremental mode provides efficient scanning for incremental queries by filtering out files that do not have any
+candidate rows using hudi-managed metadata.
+
+```java
+docker exec -it adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
+0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.mode=INCREMENTAL;
+No rows affected (0.009 seconds)
+0: jdbc:hive2://hiveserver:10000>  set hoodie.stock_ticks_cow.consume.max.commits=3;
+No rows affected (0.009 seconds)
+0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.start.timestamp=20180924064621;
+```
+
+With the above setting, file-ids that do not have any updates from the commit 20180924065039 is filtered out without scanning.
+Here is the incremental query :
+
+```java
+0: jdbc:hive2://hiveserver:10000>
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+1 row selected (0.83 seconds)
+0: jdbc:hive2://hiveserver:10000>
+```
+
+### Incremental Query with Spark SQL:
+```java
+docker exec -it adhoc-1 /bin/bash
+bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  --driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1  --packages com.databricks:spark-avro_2.11:4.0.0
+Welcome to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
+      /_/
+
+Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
+Type in expressions to have them evaluated.
+Type :help for more information.
+
+scala> import org.apache.hudi.DataSourceReadOptions
+import org.apache.hudi.DataSourceReadOptions
+
+# In the below query, 20180925045257 is the first commit's timestamp
+scala> val hoodieIncViewDF =  spark.read.format("org.apache.hudi").option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY, DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
+SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
+SLF4J: Defaulting to no-operation (NOP) logger implementation
+SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
+hoodieIncViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 15 more fields]
+
+scala> hoodieIncViewDF.registerTempTable("stock_ticks_cow_incr_tmp1")
+warning: there was one deprecation warning; re-run with -deprecation for details
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow_incr_tmp1 where  symbol = 'GOOG'").show(100, false);
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+```
+
+
+### Step 8: Schedule and Run Compaction for Merge-On-Read dataset
+
+Lets schedule and run a compaction to create a new version of columnar  file so that read-optimized readers will see fresher data.
+Again, You can use Hudi CLI to manually schedule and run compaction
+
+```java
+docker exec -it adhoc-1 /bin/bash
+root@adhoc-1:/opt#   /var/hoodie/ws/hudi-cli/hudi-cli.sh
+============================================
+*                                          *
+*     _    _           _   _               *
+*    | |  | |         | | (_)              *
+*    | |__| |       __| |  -               *
+*    |  __  ||   | / _` | ||               *
+*    | |  | ||   || (_| | ||               *
+*    |_|  |_|\___/ \____/ ||               *
+*                                          *
+============================================
+
+Welcome to Hoodie CLI. Please type help if you are looking for help.
+hudi->connect --path /user/hive/warehouse/stock_ticks_mor
+18/09/24 06:59:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
+18/09/24 06:59:35 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
+18/09/24 06:59:35 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
+18/09/24 06:59:35 INFO table.HoodieTableConfig: Loading dataset properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
+18/09/24 06:59:36 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
+Metadata for table stock_ticks_mor loaded
+
+# Ensure no compactions are present
+
+hoodie:stock_ticks_mor->compactions show all
+18/09/24 06:59:54 INFO timeline.HoodieActiveTimeline: Loaded instants [[20180924064636__clean__COMPLETED], [20180924064636__deltacommit__COMPLETED], [20180924065057__clean__COMPLETED], [20180924065057__deltacommit__COMPLETED]]
+    ___________________________________________________________________
+    | Compaction Instant Time| State    | Total FileIds to be Compacted|
+    |==================================================================|
+
+
+
+
+# Schedule a compaction. This will use Spark Launcher to schedule compaction
+hoodie:stock_ticks_mor->compaction schedule
+....
+Compaction successfully completed for 20180924070031
+
+# Now refresh and check again. You will see that there is a new compaction requested
+
+hoodie:stock_ticks->connect --path /user/hive/warehouse/stock_ticks_mor
+18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
+18/09/24 07:01:16 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
+18/09/24 07:01:16 INFO table.HoodieTableConfig: Loading dataset properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
+18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
+Metadata for table stock_ticks_mor loaded
+
+
+
+hoodie:stock_ticks_mor->compactions show all
+18/09/24 06:34:12 INFO timeline.HoodieActiveTimeline: Loaded instants [[20180924041125__clean__COMPLETED], [20180924041125__deltacommit__COMPLETED], [20180924042735__clean__COMPLETED], [20180924042735__deltacommit__COMPLETED], [==>20180924063245__compaction__REQUESTED]]
+    ___________________________________________________________________
+    | Compaction Instant Time| State    | Total FileIds to be Compacted|
+    |==================================================================|
+    | 20180924070031         | REQUESTED| 1                            |
+
+
+
+
+# Execute the compaction. The compaction instant value passed below must be the one displayed in the above "compactions show all" query
+hoodie:stock_ticks_mor->compaction run --compactionInstant  20180924070031 --parallelism 2 --sparkMemory 1G  --schemaFilePath /var/demo/config/schema.avsc --retry 1  
+....
+Compaction successfully completed for 20180924070031
+
+
+## Now check if compaction is completed
+
+hoodie:stock_ticks_mor->connect --path /user/hive/warehouse/stock_ticks_mor
+18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
+18/09/24 07:03:00 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
+18/09/24 07:03:00 INFO table.HoodieTableConfig: Loading dataset properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
+18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ from /user/hive/warehouse/stock_ticks_mor
+Metadata for table stock_ticks_mor loaded
+
+
+
+hoodie:stock_ticks->compactions show all
+18/09/24 07:03:15 INFO timeline.HoodieActiveTimeline: Loaded instants [[20180924064636__clean__COMPLETED], [20180924064636__deltacommit__COMPLETED], [20180924065057__clean__COMPLETED], [20180924065057__deltacommit__COMPLETED], [20180924070031__commit__COMPLETED]]
+    ___________________________________________________________________
+    | Compaction Instant Time| State    | Total FileIds to be Compacted|
+    |==================================================================|
+    | 20180924070031         | COMPLETED| 1                            |
+
+```
+
+### Step 9: Run Hive Queries including incremental queries
+
+You will see that both ReadOptimized and Realtime Views will show the latest committed data.
+Lets also run the incremental query for MOR table.
+From looking at the below query output, it will be clear that the fist commit time for the MOR table is 20180924064636
+and the second commit time is 20180924070031
+
+```java
+docker exec -it adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
+
+# Read Optimized View
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+--+
+1 row selected (1.6 seconds)
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+# Realtime View
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+--+
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+# Incremental View:
+
+0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.mode=INCREMENTAL;
+No rows affected (0.008 seconds)
+# Max-Commits covers both second batch and compaction commit
+0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.max.commits=3;
+No rows affected (0.007 seconds)
+0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.start.timestamp=20180924064636;
+No rows affected (0.013 seconds)
+# Query:
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064636';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+exit
+exit
+```
+
+### Step 10: Read Optimized and Realtime Views for MOR with Spark-SQL after compaction
+
+```java
+docker exec -it adhoc-1 /bin/bash
+bash-4.4# $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  --driver-memory 1G --master local[2] --executor-memory 3G --num-executors 1  --packages com.databricks:spark-avro_2.11:4.0.0
+
+# Read Optimized View
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG'").show(100, false)
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+--+
+1 row selected (1.6 seconds)
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG'").show(100, false)
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+# Realtime View
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+--+
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+```
+
+### Step 11:  Presto queries over Read Optimized View on MOR dataset after compaction
+
+```java
+docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
+presto> use hive.default;
+USE
+
+# Read Optimized View
+resto:default> select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';
+  symbol |        _col1
+--------+---------------------
+ GOOG   | 2018-08-31 10:59:00
+(1 row)
+
+Query 20190822_182319_00011_segyw, FINISHED, 1 node
+Splits: 49 total, 49 done (100.00%)
+0:01 [197 rows, 613B] [133 rows/s, 414B/s]
+
+presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor where  symbol = 'GOOG';
+ _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
+---------------------+--------+---------------------+--------+-----------+----------
+ 20190822180250      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
+ 20190822181944      | GOOG   | 2018-08-31 10:59:00 |   9021 | 1227.1993 | 1227.215
+(2 rows)
+
+Query 20190822_182333_00012_segyw, FINISHED, 1 node
+Splits: 17 total, 17 done (100.00%)
+0:02 [197 rows, 613B] [98 rows/s, 307B/s]
+
+presto:default>
+
+```
+
+
+This brings the demo to an end.
+
+## Testing Hudi in Local Docker environment
+
+You can bring up a hadoop docker environment containing Hadoop, Hive and Spark services with support for hudi.
+```java
+$ mvn pre-integration-test -DskipTests
+```
+The above command builds docker images for all the services with
+current Hudi source installed at /var/hoodie/ws and also brings up the services using a compose file. We
+currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.3.1) in docker images.
+
+To bring down the containers
+```java
+$ cd hudi-integ-test
+$ mvn docker-compose:down
+```
+
+If you want to bring up the docker containers, use
+```java
+$ cd hudi-integ-test
+$  mvn docker-compose:up -DdetachedMode=true
+```
+
+Hudi is a library that is operated in a broader data analytics/ingestion environment
+involving Hadoop, Hive and Spark. Interoperability with all these systems is a key objective for us. We are
+actively adding integration-tests under __hudi-integ-test/src/test/java__ that makes use of this
+docker environment (See __hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestHoodieSanity.java__ )
+
+
+### Building Local Docker Containers:
+
+The docker images required for demo and running integration test are already in docker-hub. The docker images
+and compose scripts are carefully implemented so that they serve dual-purpose
+
+1. The docker images have inbuilt hudi jar files with environment variable pointing to those jars (HUDI_HADOOP_BUNDLE, ...)
+2. For running integration-tests, we need the jars generated locally to be used for running services within docker. The
+   docker-compose scripts (see `docker/compose/docker-compose_hadoop284_hive233_spark231.yml`) ensures local jars override
+   inbuilt jars by mounting local HUDI workspace over the docker location
+
+This helps avoid maintaining separate docker images and avoids the costly step of building HUDI docker images locally.
+But if users want to test hudi from locations with lower network bandwidth, they can still build local images
+run the script
+`docker/build_local_docker_images.sh` to build local docker images before running `docker/setup_demo.sh`
+
+Here are the commands:
+
+```java
+cd docker
+./build_local_docker_images.sh
+.....
+
+[INFO] Reactor Summary:
+[INFO]
+[INFO] hoodie ............................................. SUCCESS [  1.709 s]
+[INFO] hudi-common ...................................... SUCCESS [  9.015 s]
+[INFO] hudi-hadoop-mr ................................... SUCCESS [  1.108 s]
+[INFO] hudi-client ...................................... SUCCESS [  4.409 s]
+[INFO] hudi-hive ........................................ SUCCESS [  0.976 s]
+[INFO] hudi-spark ....................................... SUCCESS [ 26.522 s]
+[INFO] hudi-utilities ................................... SUCCESS [ 16.256 s]
+[INFO] hudi-cli ......................................... SUCCESS [ 11.341 s]
+[INFO] hudi-hadoop-mr-bundle ............................ SUCCESS [  1.893 s]
+[INFO] hudi-hive-bundle ................................. SUCCESS [ 14.099 s]
+[INFO] hudi-spark-bundle ................................ SUCCESS [ 58.252 s]
+[INFO] hudi-hadoop-docker ............................... SUCCESS [  0.612 s]
+[INFO] hudi-hadoop-base-docker .......................... SUCCESS [04:04 min]
+[INFO] hudi-hadoop-namenode-docker ...................... SUCCESS [  6.142 s]
+[INFO] hudi-hadoop-datanode-docker ...................... SUCCESS [  7.763 s]
+[INFO] hudi-hadoop-history-docker ....................... SUCCESS [  5.922 s]
+[INFO] hudi-hadoop-hive-docker .......................... SUCCESS [ 56.152 s]
+[INFO] hudi-hadoop-sparkbase-docker ..................... SUCCESS [01:18 min]
+[INFO] hudi-hadoop-sparkmaster-docker ................... SUCCESS [  2.964 s]
+[INFO] hudi-hadoop-sparkworker-docker ................... SUCCESS [  3.032 s]
+[INFO] hudi-hadoop-sparkadhoc-docker .................... SUCCESS [  2.764 s]
+[INFO] hudi-integ-test .................................. SUCCESS [  1.785 s]
+[INFO] ------------------------------------------------------------------------
+[INFO] BUILD SUCCESS
+[INFO] ------------------------------------------------------------------------
+[INFO] Total time: 09:15 min
+[INFO] Finished at: 2018-09-10T17:47:37-07:00
+[INFO] Final Memory: 236M/1848M
+[INFO] ------------------------------------------------------------------------
+```
diff --git a/docs/_docs/0.5.2/0_4_docker_demo.md b/docs/_docs/0.5.2/0_4_docker_demo.md
new file mode 100644
index 0000000..099d7fe
--- /dev/null
+++ b/docs/_docs/0.5.2/0_4_docker_demo.md
@@ -0,0 +1,1232 @@
+---
+version: 0.5.2
+title: Docker Demo
+keywords: hudi, docker, demo
+permalink: /docs/0.5.2-docker_demo.html
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+## A Demo using docker containers
+
+Lets use a real world example to see how hudi works end to end. For this purpose, a self contained
+data infrastructure is brought up in a local docker cluster within your computer.
+
+The steps have been tested on a Mac laptop
+
+### Prerequisites
+
+  * Docker Setup :  For Mac, Please follow the steps as defined in [https://docs.docker.com/v17.12/docker-for-mac/install/]. For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be killed because of memory issues.
+  * kafkacat : A command-line utility to publish/consume from kafka topics. Use `brew install kafkacat` to install kafkacat
+  * /etc/hosts : The demo references many services running in container by the hostname. Add the following settings to /etc/hosts
+
+    ```java
+    127.0.0.1 adhoc-1
+    127.0.0.1 adhoc-2
+    127.0.0.1 namenode
+    127.0.0.1 datanode1
+    127.0.0.1 hiveserver
+    127.0.0.1 hivemetastore
+    127.0.0.1 kafkabroker
+    127.0.0.1 sparkmaster
+    127.0.0.1 zookeeper
+    ```
+
+Also, this has not been tested on some environments like Docker on Windows.
+
+
+## Setting up Docker Cluster
+
+
+### Build Hudi
+
+The first step is to build hudi. **Note** This step builds hudi on default supported scala version - 2.11.
+```java
+cd <HUDI_WORKSPACE>
+mvn package -DskipTests
+```
+
+### Bringing up Demo Cluster
+
+The next step is to run the docker compose script and setup configs for bringing up the cluster.
+This should pull the docker images from docker hub and setup docker cluster.
+
+```java
+cd docker
+./setup_demo.sh
+....
+....
+....
+Stopping spark-worker-1            ... done
+Stopping hiveserver                ... done
+Stopping hivemetastore             ... done
+Stopping historyserver             ... done
+.......
+......
+Creating network "compose_default" with the default driver
+Creating volume "compose_namenode" with default driver
+Creating volume "compose_historyserver" with default driver
+Creating volume "compose_hive-metastore-postgresql" with default driver
+Creating hive-metastore-postgresql ... done
+Creating namenode                  ... done
+Creating zookeeper                 ... done
+Creating kafkabroker               ... done
+Creating hivemetastore             ... done
+Creating historyserver             ... done
+Creating hiveserver                ... done
+Creating datanode1                 ... done
+Creating presto-coordinator-1      ... done
+Creating sparkmaster               ... done
+Creating presto-worker-1           ... done
+Creating adhoc-1                   ... done
+Creating adhoc-2                   ... done
+Creating spark-worker-1            ... done
+Copying spark default config and setting up configs
+Copying spark default config and setting up configs
+Copying spark default config and setting up configs
+$ docker ps
+```
+
+At this point, the docker cluster will be up and running. The demo cluster brings up the following services
+
+   * HDFS Services (NameNode, DataNode)
+   * Spark Master and Worker
+   * Hive Services (Metastore, HiveServer2 along with PostgresDB)
+   * Kafka Broker and a Zookeeper Node (Kakfa will be used as upstream source for the demo)
+   * Adhoc containers to run Hudi/Hive CLI commands
+
+## Demo
+
+Stock Tracker data will be used to showcase different Hudi query types and the effects of Compaction.
+
+Take a look at the directory `docker/demo/data`. There are 2 batches of stock data - each at 1 minute granularity.
+The first batch contains stocker tracker data for some stock symbols during the first hour of trading window
+(9:30 a.m to 10:30 a.m). The second batch contains tracker data for next 30 mins (10:30 - 11 a.m). Hudi will
+be used to ingest these batches to a table which will contain the latest stock tracker data at hour level granularity.
+The batches are windowed intentionally so that the second batch contains updates to some of the rows in the first batch.
+
+### Step 1 : Publish the first batch to Kafka
+
+Upload the first batch to Kafka topic 'stock ticks' `cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P`
+
+To check if the new topic shows up, use
+```java
+kafkacat -b kafkabroker -L -J | jq .
+{
+  "originating_broker": {
+    "id": 1001,
+    "name": "kafkabroker:9092/1001"
+  },
+  "query": {
+    "topic": "*"
+  },
+  "brokers": [
+    {
+      "id": 1001,
+      "name": "kafkabroker:9092"
+    }
+  ],
+  "topics": [
+    {
+      "topic": "stock_ticks",
+      "partitions": [
+        {
+          "partition": 0,
+          "leader": 1001,
+          "replicas": [
+            {
+              "id": 1001
+            }
+          ],
+          "isrs": [
+            {
+              "id": 1001
+            }
+          ]
+        }
+      ]
+    }
+  ]
+}
+```
+
+### Step 2: Incrementally ingest data from Kafka topic
+
+Hudi comes with a tool named DeltaStreamer. This tool can connect to variety of data sources (including Kafka) to
+pull changes and apply to Hudi table using upsert/insert primitives. Here, we will use the tool to download
+json data from kafka topic and ingest to both COW and MOR tables we initialized in the previous step. This tool
+automatically initializes the tables in the file-system if they do not exist yet.
+
+```java
+docker exec -it adhoc-2 /bin/bash
+
+# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_cow table in HDFS
+spark-submit \
+  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
+  --table-type COPY_ON_WRITE \
+  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
+  --source-ordering-field ts  \
+  --target-base-path /user/hive/warehouse/stock_ticks_cow \
+  --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties \
+  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+
+# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_mor table in HDFS
+spark-submit \
+  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
+  --table-type MERGE_ON_READ \
+  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
+  --source-ordering-field ts \
+  --target-base-path /user/hive/warehouse/stock_ticks_mor \
+  --target-table stock_ticks_mor \
+  --props /var/demo/config/kafka-source.properties \
+  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
+  --disable-compaction
+
+# As part of the setup (Look at setup_demo.sh), the configs needed for DeltaStreamer is uploaded to HDFS. The configs
+# contain mostly Kafa connectivity settings, the avro-schema to be used for ingesting along with key and partitioning fields.
+
+exit
+```
+
+You can use HDFS web-browser to look at the tables
+`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow`.
+
+You can explore the new partition folder created in the table along with a "commit" / "deltacommit"
+file under .hoodie which signals a successful commit.
+
+There will be a similar setup when you browse the MOR table
+`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_mor`
+
+
+### Step 3: Sync with Hive
+
+At this step, the tables are available in HDFS. We need to sync with Hive to create new Hive tables and add partitions
+inorder to run Hive queries against those tables.
+
+```java
+docker exec -it adhoc-2 /bin/bash
+
+# THis command takes in HIveServer URL and COW Hudi table location in HDFS and sync the HDFS state to Hive
+/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh \
+  --jdbc-url jdbc:hive2://hiveserver:10000 \
+  --user hive \
+  --pass hive \
+  --partitioned-by dt \
+  --base-path /user/hive/warehouse/stock_ticks_cow \
+  --database default \
+  --table stock_ticks_cow
+.....
+2020-01-25 19:51:28,953 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(129)) - Sync complete for stock_ticks_cow
+.....
+
+# Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR table type)
+/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh \
+  --jdbc-url jdbc:hive2://hiveserver:10000 \
+  --user hive \
+  --pass hive \
+  --partitioned-by dt \
+  --base-path /user/hive/warehouse/stock_ticks_mor \
+  --database default \
+  --table stock_ticks_mor
+...
+2020-01-25 19:51:51,066 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(129)) - Sync complete for stock_ticks_mor_ro
+...
+2020-01-25 19:51:51,569 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(129)) - Sync complete for stock_ticks_mor_rt
+....
+
+exit
+```
+After executing the above command, you will notice
+
+1. A hive table named `stock_ticks_cow` created which supports Snapshot and Incremental queries on Copy On Write table.
+2. Two new tables `stock_ticks_mor_rt` and `stock_ticks_mor_ro` created for the Merge On Read table. The former
+supports Snapshot and Incremental queries (providing near-real time data) while the later supports ReadOptimized queries.
+
+
+### Step 4 (a): Run Hive Queries
+
+Run a hive query to find the latest timestamp ingested for stock symbol 'GOOG'. You will notice that both snapshot 
+(for both COW and MOR _rt table) and read-optimized queries (for MOR _ro table) give the same value "10:29 a.m" as Hudi create a
+parquet file for the first batch of data.
+
+```java
+docker exec -it adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 \
+  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
+  --hiveconf hive.stats.autogather=false
+
+# List Tables
+0: jdbc:hive2://hiveserver:10000> show tables;
++---------------------+--+
+|      tab_name       |
++---------------------+--+
+| stock_ticks_cow     |
+| stock_ticks_mor_ro  |
+| stock_ticks_mor_rt  |
++---------------------+--+
+3 rows selected (1.199 seconds)
+0: jdbc:hive2://hiveserver:10000>
+
+
+# Look at partitions that were added
+0: jdbc:hive2://hiveserver:10000> show partitions stock_ticks_mor_rt;
++----------------+--+
+|   partition    |
++----------------+--+
+| dt=2018-08-31  |
++----------------+--+
+1 row selected (0.24 seconds)
+
+
+# COPY-ON-WRITE Queries:
+=========================
+
+
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:29:00  |
++---------+----------------------+--+
+
+Now, run a projection query:
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924221953       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924221953       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+
+# Merge-On-Read Queries:
+==========================
+
+Lets run similar queries against M-O-R table. Lets look at both 
+ReadOptimized and Snapshot(realtime data) queries supported by M-O-R table
+
+# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:29:00  |
++---------+----------------------+--+
+1 row selected (6.326 seconds)
+
+
+# Run Snapshot Query. Notice that the latest timestamp is again 10:29
+
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:29:00  |
++---------+----------------------+--+
+1 row selected (1.606 seconds)
+
+
+# Run Read Optimized and Snapshot project queries
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+exit
+```
+
+### Step 4 (b): Run Spark-SQL Queries
+Hudi support Spark as query processor just like Hive. Here are the same hive queries
+running in spark-sql
+
+```java
+docker exec -it adhoc-1 /bin/bash
+$SPARK_INSTALL/bin/spark-shell \
+  --jars $HUDI_SPARK_BUNDLE \
+  --master local[2] \
+  --driver-class-path $HADOOP_CONF_DIR \
+  --conf spark.sql.hive.convertMetastoreParquet=false \
+  --deploy-mode client \
+  --driver-memory 1G \
+  --executor-memory 3G \
+  --num-executors 1 \
+  --packages org.apache.spark:spark-avro_2.11:2.4.4
+...
+
+Welcome to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
+      /_/
+
+Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
+Type in expressions to have them evaluated.
+Type :help for more information.
+
+scala> spark.sql("show tables").show(100, false)
++--------+------------------+-----------+
+|database|tableName         |isTemporary|
++--------+------------------+-----------+
+|default |stock_ticks_cow   |false      |
+|default |stock_ticks_mor_ro|false      |
+|default |stock_ticks_mor_rt|false      |
++--------+------------------+-----------+
+
+# Copy-On-Write Table
+
+## Run max timestamp query against COW table
+
+scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
+[Stage 0:>                                                          (0 + 1) / 1]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
+SLF4J: Defaulting to no-operation (NOP) logger implementation
+SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
++------+-------------------+
+|symbol|max(ts)            |
++------+-------------------+
+|GOOG  |2018-08-31 10:29:00|
++------+-------------------+
+
+## Projection Query
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG'").show(100, false)
++-------------------+------+-------------------+------+---------+--------+
+|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
++-------------------+------+-------------------+------+---------+--------+
+|20180924221953     |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
+|20180924221953     |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
++-------------------+------+-------------------+------+---------+--------+
+
+# Merge-On-Read Queries:
+==========================
+
+Lets run similar queries against M-O-R table. Lets look at both
+ReadOptimized and Snapshot queries supported by M-O-R table
+
+# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
++------+-------------------+
+|symbol|max(ts)            |
++------+-------------------+
+|GOOG  |2018-08-31 10:29:00|
++------+-------------------+
+
+
+# Run Snapshot Query. Notice that the latest timestamp is again 10:29
+
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
++------+-------------------+
+|symbol|max(ts)            |
++------+-------------------+
+|GOOG  |2018-08-31 10:29:00|
++------+-------------------+
+
+# Run Read Optimized and Snapshot project queries
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
++-------------------+------+-------------------+------+---------+--------+
+|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
++-------------------+------+-------------------+------+---------+--------+
+|20180924222155     |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
+|20180924222155     |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
++-------------------+------+-------------------+------+---------+--------+
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
++-------------------+------+-------------------+------+---------+--------+
+|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
++-------------------+------+-------------------+------+---------+--------+
+|20180924222155     |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
+|20180924222155     |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
++-------------------+------+-------------------+------+---------+--------+
+```
+
+### Step 4 (c): Run Presto Queries
+
+Here are the Presto queries for similar Hive and Spark queries. Currently, Presto does not support snapshot or incremental queries on Hudi tables.
+
+```java
+docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
+presto> show catalogs;
+  Catalog
+-----------
+ hive
+ jmx
+ localfile
+ system
+(4 rows)
+
+Query 20190817_134851_00000_j8rcz, FINISHED, 1 node
+Splits: 19 total, 19 done (100.00%)
+0:04 [0 rows, 0B] [0 rows/s, 0B/s]
+
+presto> use hive.default;
+USE
+presto:default> show tables;
+       Table
+--------------------
+ stock_ticks_cow
+ stock_ticks_mor_ro
+ stock_ticks_mor_rt
+(3 rows)
+
+Query 20190822_181000_00001_segyw, FINISHED, 2 nodes
+Splits: 19 total, 19 done (100.00%)
+0:05 [3 rows, 99B] [0 rows/s, 18B/s]
+
+
+# COPY-ON-WRITE Queries:
+=========================
+
+
+presto:default> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
+ symbol |        _col1
+--------+---------------------
+ GOOG   | 2018-08-31 10:29:00
+(1 row)
+
+Query 20190822_181011_00002_segyw, FINISHED, 1 node
+Splits: 49 total, 49 done (100.00%)
+0:12 [197 rows, 613B] [16 rows/s, 50B/s]
+
+presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';
+ _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
+---------------------+--------+---------------------+--------+-----------+----------
+ 20190822180221      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
+ 20190822180221      | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
+(2 rows)
+
+Query 20190822_181141_00003_segyw, FINISHED, 1 node
+Splits: 17 total, 17 done (100.00%)
+0:02 [197 rows, 613B] [109 rows/s, 341B/s]
+
+
+# Merge-On-Read Queries:
+==========================
+
+Lets run similar queries against M-O-R table. 
+
+# Run ReadOptimized Query. Notice that the latest timestamp is 10:29
+    presto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
+ symbol |        _col1
+--------+---------------------
+ GOOG   | 2018-08-31 10:29:00
+(1 row)
+
+Query 20190822_181158_00004_segyw, FINISHED, 1 node
+Splits: 49 total, 49 done (100.00%)
+0:02 [197 rows, 613B] [110 rows/s, 343B/s]
+
+
+presto:default>  select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
+ _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
+---------------------+--------+---------------------+--------+-----------+----------
+ 20190822180250      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
+ 20190822180250      | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
+(2 rows)
+
+Query 20190822_181256_00006_segyw, FINISHED, 1 node
+Splits: 17 total, 17 done (100.00%)
+0:02 [197 rows, 613B] [92 rows/s, 286B/s]
+
+presto:default> exit
+```
+
+### Step 5: Upload second batch to Kafka and run DeltaStreamer to ingest
+
+Upload the second batch of data and ingest this batch using delta-streamer. As this batch does not bring in any new
+partitions, there is no need to run hive-sync
+
+```java
+cat docker/demo/data/batch_2.json | kafkacat -b kafkabroker -t stock_ticks -P
+
+# Within Docker container, run the ingestion command
+docker exec -it adhoc-2 /bin/bash
+
+# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_cow table in HDFS
+spark-submit \
+  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
+  --table-type COPY_ON_WRITE \
+  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
+  --source-ordering-field ts \
+  --target-base-path /user/hive/warehouse/stock_ticks_cow \
+  --target-table stock_ticks_cow \
+  --props /var/demo/config/kafka-source.properties \
+  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+
+# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_mor table in HDFS
+spark-submit \
+  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
+  --table-type MERGE_ON_READ \
+  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
+  --source-ordering-field ts \
+  --target-base-path /user/hive/warehouse/stock_ticks_mor \
+  --target-table stock_ticks_mor \
+  --props /var/demo/config/kafka-source.properties \
+  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
+  --disable-compaction
+
+exit
+```
+
+With Copy-On-Write table, the second ingestion by DeltaStreamer resulted in a new version of Parquet file getting created.
+See `http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow/2018/08/31`
+
+With Merge-On-Read table, the second ingestion merely appended the batch to an unmerged delta (log) file.
+Take a look at the HDFS filesystem to get an idea: `http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_mor/2018/08/31`
+
+### Step 6 (a): Run Hive Queries
+
+With Copy-On-Write table, the Snapshot query immediately sees the changes as part of second batch once the batch
+got committed as each ingestion creates newer versions of parquet files.
+
+With Merge-On-Read table, the second ingestion merely appended the batch to an unmerged delta (log) file.
+This is the time, when ReadOptimized and Snapshot queries will provide different results. ReadOptimized query will still
+return "10:29 am" as it will only read from the Parquet file. Snapshot query will do on-the-fly merge and return
+latest committed data which is "10:59 a.m".
+
+```java
+docker exec -it adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 \
+  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
+  --hiveconf hive.stats.autogather=false
+
+# Copy On Write Table:
+
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+--+
+1 row selected (1.932 seconds)
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924221953       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924224524       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
+
+
+# Merge On Read Table:
+
+# Read Optimized Query
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:29:00  |
++---------+----------------------+--+
+1 row selected (1.6 seconds)
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+# Snapshot Query
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+--+
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924224537       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+exit
+```
+
+### Step 6 (b): Run Spark SQL Queries
+
+Running the same queries in Spark-SQL:
+
+```java
+docker exec -it adhoc-1 /bin/bash
+$SPARK_INSTALL/bin/spark-shell \
+  --jars $HUDI_SPARK_BUNDLE \
+  --driver-class-path $HADOOP_CONF_DIR \
+  --conf spark.sql.hive.convertMetastoreParquet=false \
+  --deploy-mode client \
+  --driver-memory 1G \
+  --master local[2] \
+  --executor-memory 3G \
+  --num-executors 1 \
+  --packages org.apache.spark:spark-avro_2.11:2.4.4
+
+# Copy On Write Table:
+
+scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
++------+-------------------+
+|symbol|max(ts)            |
++------+-------------------+
+|GOOG  |2018-08-31 10:59:00|
++------+-------------------+
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG'").show(100, false)
+
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924221953       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924224524       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
+
+
+# Merge On Read Table:
+
+# Read Optimized Query
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
++---------+----------------------+
+| symbol  |         _c1          |
++---------+----------------------+
+| GOOG    | 2018-08-31 10:29:00  |
++---------+----------------------+
+1 row selected (1.6 seconds)
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
++----------------------+---------+----------------------+---------+------------+-----------+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+
+| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924222155       | GOOG    | 2018-08-31 10:29:00  | 3391    | 1230.1899  | 1230.085  |
++----------------------+---------+----------------------+---------+------------+-----------+
+
+# Snapshot Query
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
++---------+----------------------+
+| symbol  |         _c1          |
++---------+----------------------+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
++----------------------+---------+----------------------+---------+------------+-----------+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+
+| 20180924222155       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924224537       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+
+
+exit
+```
+
+### Step 6 (c): Run Presto Queries
+
+Running the same queries on Presto for ReadOptimized queries. 
+
+```java
+docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
+presto> use hive.default;
+USE
+
+# Copy On Write Table:
+
+presto:default>select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
+ symbol |        _col1
+--------+---------------------
+ GOOG   | 2018-08-31 10:59:00
+(1 row)
+
+Query 20190822_181530_00007_segyw, FINISHED, 1 node
+Splits: 49 total, 49 done (100.00%)
+0:02 [197 rows, 613B] [125 rows/s, 389B/s]
+
+presto:default>select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
+ _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
+---------------------+--------+---------------------+--------+-----------+----------
+ 20190822180221      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
+ 20190822181433      | GOOG   | 2018-08-31 10:59:00 |   9021 | 1227.1993 | 1227.215
+(2 rows)
+
+Query 20190822_181545_00008_segyw, FINISHED, 1 node
+Splits: 17 total, 17 done (100.00%)
+0:02 [197 rows, 613B] [106 rows/s, 332B/s]
+
+As you can notice, the above queries now reflect the changes that came as part of ingesting second batch.
+
+
+# Merge On Read Table:
+
+# Read Optimized Query
+presto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
+ symbol |        _col1
+--------+---------------------
+ GOOG   | 2018-08-31 10:29:00
+(1 row)
+
+Query 20190822_181602_00009_segyw, FINISHED, 1 node
+Splits: 49 total, 49 done (100.00%)
+0:01 [197 rows, 613B] [139 rows/s, 435B/s]
+
+presto:default>select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
+ _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
+---------------------+--------+---------------------+--------+-----------+----------
+ 20190822180250      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
+ 20190822180250      | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
+(2 rows)
+
+Query 20190822_181615_00010_segyw, FINISHED, 1 node
+Splits: 17 total, 17 done (100.00%)
+0:01 [197 rows, 613B] [154 rows/s, 480B/s]
+
+presto:default> exit
+```
+
+### Step 7 (a): Incremental Query for COPY-ON-WRITE Table
+
+With 2 batches of data ingested, lets showcase the support for incremental queries in Hudi Copy-On-Write tables
+
+Lets take the same projection query example
+
+```java
+docker exec -it adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 \
+  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
+  --hiveconf hive.stats.autogather=false
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924064621       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+```
+
+As you notice from the above queries, there are 2 commits - 20180924064621 and 20180924065039 in timeline order.
+When you follow the steps, you will be getting different timestamps for commits. Substitute them
+in place of the above timestamps.
+
+To show the effects of incremental-query, let us assume that a reader has already seen the changes as part of
+ingesting first batch. Now, for the reader to see effect of the second batch, he/she has to keep the start timestamp to
+the commit time of the first batch (20180924064621) and run incremental query
+
+Hudi incremental mode provides efficient scanning for incremental queries by filtering out files that do not have any
+candidate rows using hudi-managed metadata.
+
+```java
+docker exec -it adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 \
+  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
+  --hiveconf hive.stats.autogather=false
+
+0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.mode=INCREMENTAL;
+No rows affected (0.009 seconds)
+0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.max.commits=3;
+No rows affected (0.009 seconds)
+0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_cow.consume.start.timestamp=20180924064621;
+```
+
+With the above setting, file-ids that do not have any updates from the commit 20180924065039 is filtered out without scanning.
+Here is the incremental query :
+
+```java
+0: jdbc:hive2://hiveserver:10000>
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+1 row selected (0.83 seconds)
+0: jdbc:hive2://hiveserver:10000>
+```
+
+### Step 7 (b): Incremental Query with Spark SQL:
+
+```java
+docker exec -it adhoc-1 /bin/bash
+$SPARK_INSTALL/bin/spark-shell \
+  --jars $HUDI_SPARK_BUNDLE \
+  --driver-class-path $HADOOP_CONF_DIR \
+  --conf spark.sql.hive.convertMetastoreParquet=false \
+  --deploy-mode client \
+  --driver-memory 1G \
+  --master local[2] \
+  --executor-memory 3G \
+  --num-executors 1 \
+  --packages org.apache.spark:spark-avro_2.11:2.4.4
+
+Welcome to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
+      /_/
+
+Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
+Type in expressions to have them evaluated.
+Type :help for more information.
+
+scala> import org.apache.hudi.DataSourceReadOptions
+import org.apache.hudi.DataSourceReadOptions
+
+# In the below query, 20180925045257 is the first commit's timestamp
+scala> val hoodieIncViewDF =  spark.read.format("org.apache.hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL).option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "20180924064621").load("/user/hive/warehouse/stock_ticks_cow")
+SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
+SLF4J: Defaulting to no-operation (NOP) logger implementation
+SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
+hoodieIncViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 15 more fields]
+
+scala> hoodieIncViewDF.registerTempTable("stock_ticks_cow_incr_tmp1")
+warning: there was one deprecation warning; re-run with -deprecation for details
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow_incr_tmp1 where  symbol = 'GOOG'").show(100, false);
++----------------------+---------+----------------------+---------+------------+-----------+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+
+| 20180924065039       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+
+```
+
+### Step 8: Schedule and Run Compaction for Merge-On-Read table
+
+Lets schedule and run a compaction to create a new version of columnar  file so that read-optimized readers will see fresher data.
+Again, You can use Hudi CLI to manually schedule and run compaction
+
+```java
+docker exec -it adhoc-1 /bin/bash
+root@adhoc-1:/opt# /var/hoodie/ws/hudi-cli/hudi-cli.sh
+...
+Table command getting loaded
+HoodieSplashScreen loaded
+===================================================================
+*         ___                          ___                        *
+*        /\__\          ___           /\  \           ___         *
+*       / /  /         /\__\         /  \  \         /\  \        *
+*      / /__/         / /  /        / /\ \  \        \ \  \       *
+*     /  \  \ ___    / /  /        / /  \ \__\       /  \__\      *
+*    / /\ \  /\__\  / /__/  ___   / /__/ \ |__|     / /\/__/      *
+*    \/  \ \/ /  /  \ \  \ /\__\  \ \  \ / /  /  /\/ /  /         *
+*         \  /  /    \ \  / /  /   \ \  / /  /   \  /__/          *
+*         / /  /      \ \/ /  /     \ \/ /  /     \ \__\          *
+*        / /  /        \  /  /       \  /  /       \/__/          *
+*        \/__/          \/__/         \/__/    Apache Hudi CLI    *
+*                                                                 *
+===================================================================
+
+Welcome to Apache Hudi CLI. Please type help if you are looking for help.
+hudi->connect --path /user/hive/warehouse/stock_ticks_mor
+18/09/24 06:59:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
+18/09/24 06:59:35 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
+18/09/24 06:59:35 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
+18/09/24 06:59:35 INFO table.HoodieTableConfig: Loading table properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
+18/09/24 06:59:36 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
+Metadata for table stock_ticks_mor loaded
+hoodie:stock_ticks_mor->compactions show all
+20/02/10 03:41:32 INFO timeline.HoodieActiveTimeline: Loaded instants [[20200210015059__clean__COMPLETED], [20200210015059__deltacommit__COMPLETED], [20200210022758__clean__COMPLETED], [20200210022758__deltacommit__COMPLETED], [==>20200210023843__compaction__REQUESTED]]
+___________________________________________________________________
+| Compaction Instant Time| State    | Total FileIds to be Compacted|
+|==================================================================|
+
+# Schedule a compaction. This will use Spark Launcher to schedule compaction
+hoodie:stock_ticks_mor->compaction schedule
+....
+Compaction successfully completed for 20180924070031
+
+# Now refresh and check again. You will see that there is a new compaction requested
+
+hoodie:stock_ticks->connect --path /user/hive/warehouse/stock_ticks_mor
+18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
+18/09/24 07:01:16 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
+18/09/24 07:01:16 INFO table.HoodieTableConfig: Loading table properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
+18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
+Metadata for table stock_ticks_mor loaded
+
+hoodie:stock_ticks_mor->compactions show all
+18/09/24 06:34:12 INFO timeline.HoodieActiveTimeline: Loaded instants [[20180924041125__clean__COMPLETED], [20180924041125__deltacommit__COMPLETED], [20180924042735__clean__COMPLETED], [20180924042735__deltacommit__COMPLETED], [==>20180924063245__compaction__REQUESTED]]
+___________________________________________________________________
+| Compaction Instant Time| State    | Total FileIds to be Compacted|
+|==================================================================|
+| 20180924070031         | REQUESTED| 1                            |
+
+# Execute the compaction. The compaction instant value passed below must be the one displayed in the above "compactions show all" query
+hoodie:stock_ticks_mor->compaction run --compactionInstant  20180924070031 --parallelism 2 --sparkMemory 1G  --schemaFilePath /var/demo/config/schema.avsc --retry 1  
+....
+Compaction successfully completed for 20180924070031
+
+## Now check if compaction is completed
+
+hoodie:stock_ticks_mor->connect --path /user/hive/warehouse/stock_ticks_mor
+18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
+18/09/24 07:03:00 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://namenode:8020], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1261652683_11, ugi=root (auth:SIMPLE)]]]
+18/09/24 07:03:00 INFO table.HoodieTableConfig: Loading table properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
+18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
+Metadata for table stock_ticks_mor loaded
+
+hoodie:stock_ticks->compactions show all
+18/09/24 07:03:15 INFO timeline.HoodieActiveTimeline: Loaded instants [[20180924064636__clean__COMPLETED], [20180924064636__deltacommit__COMPLETED], [20180924065057__clean__COMPLETED], [20180924065057__deltacommit__COMPLETED], [20180924070031__commit__COMPLETED]]
+___________________________________________________________________
+| Compaction Instant Time| State    | Total FileIds to be Compacted|
+|==================================================================|
+| 20180924070031         | COMPLETED| 1                            |
+
+```
+
+### Step 9: Run Hive Queries including incremental queries
+
+You will see that both ReadOptimized and Snapshot queries will show the latest committed data.
+Lets also run the incremental query for MOR table.
+From looking at the below query output, it will be clear that the fist commit time for the MOR table is 20180924064636
+and the second commit time is 20180924070031
+
+```java
+docker exec -it adhoc-2 /bin/bash
+beeline -u jdbc:hive2://hiveserver:10000 \
+  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
+  --hiveconf hive.stats.autogather=false
+
+# Read Optimized Query
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+--+
+1 row selected (1.6 seconds)
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+# Snapshot Query
+0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
+WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
++---------+----------------------+--+
+| symbol  |         _c1          |
++---------+----------------------+--+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+--+
+
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+# Incremental Query:
+
+0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.mode=INCREMENTAL;
+No rows affected (0.008 seconds)
+# Max-Commits covers both second batch and compaction commit
+0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.max.commits=3;
+No rows affected (0.007 seconds)
+0: jdbc:hive2://hiveserver:10000> set hoodie.stock_ticks_mor.consume.start.timestamp=20180924064636;
+No rows affected (0.013 seconds)
+# Query:
+0: jdbc:hive2://hiveserver:10000> select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064636';
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+--+
+
+exit
+```
+
+### Step 10: Read Optimized and Snapshot queries for MOR with Spark-SQL after compaction
+
+```java
+docker exec -it adhoc-1 /bin/bash
+$SPARK_INSTALL/bin/spark-shell \
+  --jars $HUDI_SPARK_BUNDLE \
+  --driver-class-path $HADOOP_CONF_DIR \
+  --conf spark.sql.hive.convertMetastoreParquet=false \
+  --deploy-mode client \
+  --driver-memory 1G \
+  --master local[2] \
+  --executor-memory 3G \
+  --num-executors 1 \
+  --packages org.apache.spark:spark-avro_2.11:2.4.4
+
+# Read Optimized Query
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
++---------+----------------------+
+| symbol  |        max(ts)       |
++---------+----------------------+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+
+1 row selected (1.6 seconds)
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
++----------------------+---------+----------------------+---------+------------+-----------+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+
+| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+
+
+# Snapshot Query
+scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
++---------+----------------------+
+| symbol  |     max(ts)          |
++---------+----------------------+
+| GOOG    | 2018-08-31 10:59:00  |
++---------+----------------------+
+
+scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
++----------------------+---------+----------------------+---------+------------+-----------+
+| _hoodie_commit_time  | symbol  |          ts          | volume  |    open    |   close   |
++----------------------+---------+----------------------+---------+------------+-----------+
+| 20180924064636       | GOOG    | 2018-08-31 09:59:00  | 6330    | 1230.5     | 1230.02   |
+| 20180924070031       | GOOG    | 2018-08-31 10:59:00  | 9021    | 1227.1993  | 1227.215  |
++----------------------+---------+----------------------+---------+------------+-----------+
+```
+
+### Step 11:  Presto Read Optimized queries on MOR table after compaction
+
+```java
+docker exec -it presto-worker-1 presto --server presto-coordinator-1:8090
+presto> use hive.default;
+USE
+
+# Read Optimized Query
+resto:default> select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
+  symbol |        _col1
+--------+---------------------
+ GOOG   | 2018-08-31 10:59:00
+(1 row)
+
+Query 20190822_182319_00011_segyw, FINISHED, 1 node
+Splits: 49 total, 49 done (100.00%)
+0:01 [197 rows, 613B] [133 rows/s, 414B/s]
+
+presto:default> select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
+ _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
+---------------------+--------+---------------------+--------+-----------+----------
+ 20190822180250      | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
+ 20190822181944      | GOOG   | 2018-08-31 10:59:00 |   9021 | 1227.1993 | 1227.215
+(2 rows)
+
+Query 20190822_182333_00012_segyw, FINISHED, 1 node
+Splits: 17 total, 17 done (100.00%)
+0:02 [197 rows, 613B] [98 rows/s, 307B/s]
+
+presto:default>
+```
+
+
+This brings the demo to an end.
+
+## Testing Hudi in Local Docker environment
+
+You can bring up a hadoop docker environment containing Hadoop, Hive and Spark services with support for hudi.
+```java
+$ mvn pre-integration-test -DskipTests
+```
+The above command builds docker images for all the services with
+current Hudi source installed at /var/hoodie/ws and also brings up the services using a compose file. We
+currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.4.4) in docker images.
+
+To bring down the containers
+```java
+$ cd hudi-integ-test
+$ mvn docker-compose:down
+```
+
+If you want to bring up the docker containers, use
+```java
+$ cd hudi-integ-test
+$ mvn docker-compose:up -DdetachedMode=true
+```
+
+Hudi is a library that is operated in a broader data analytics/ingestion environment
+involving Hadoop, Hive and Spark. Interoperability with all these systems is a key objective for us. We are
+actively adding integration-tests under __hudi-integ-test/src/test/java__ that makes use of this
+docker environment (See __hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestHoodieSanity.java__ )
+
+
+### Building Local Docker Containers:
+
+The docker images required for demo and running integration test are already in docker-hub. The docker images
+and compose scripts are carefully implemented so that they serve dual-purpose
+
+1. The docker images have inbuilt hudi jar files with environment variable pointing to those jars (HUDI_HADOOP_BUNDLE, ...)
+2. For running integration-tests, we need the jars generated locally to be used for running services within docker. The
+   docker-compose scripts (see `docker/compose/docker-compose_hadoop284_hive233_spark231.yml`) ensures local jars override
+   inbuilt jars by mounting local HUDI workspace over the docker location
+
+This helps avoid maintaining separate docker images and avoids the costly step of building HUDI docker images locally.
+But if users want to test hudi from locations with lower network bandwidth, they can still build local images
+run the script
+`docker/build_local_docker_images.sh` to build local docker images before running `docker/setup_demo.sh`
+
+Here are the commands:
+
+```java
+cd docker
+./build_local_docker_images.sh
+.....
+
+[INFO] Reactor Summary:
+[INFO]
+[INFO] hoodie ............................................. SUCCESS [  1.709 s]
+[INFO] hudi-common ...................................... SUCCESS [  9.015 s]
+[INFO] hudi-hadoop-mr ................................... SUCCESS [  1.108 s]
+[INFO] hudi-client ...................................... SUCCESS [  4.409 s]
+[INFO] hudi-hive ........................................ SUCCESS [  0.976 s]
+[INFO] hudi-spark ....................................... SUCCESS [ 26.522 s]
+[INFO] hudi-utilities ................................... SUCCESS [ 16.256 s]
+[INFO] hudi-cli ......................................... SUCCESS [ 11.341 s]
+[INFO] hudi-hadoop-mr-bundle ............................ SUCCESS [  1.893 s]
+[INFO] hudi-hive-bundle ................................. SUCCESS [ 14.099 s]
+[INFO] hudi-spark-bundle ................................ SUCCESS [ 58.252 s]
+[INFO] hudi-hadoop-docker ............................... SUCCESS [  0.612 s]
+[INFO] hudi-hadoop-base-docker .......................... SUCCESS [04:04 min]
+[INFO] hudi-hadoop-namenode-docker ...................... SUCCESS [  6.142 s]
+[INFO] hudi-hadoop-datanode-docker ...................... SUCCESS [  7.763 s]
+[INFO] hudi-hadoop-history-docker ....................... SUCCESS [  5.922 s]
+[INFO] hudi-hadoop-hive-docker .......................... SUCCESS [ 56.152 s]
+[INFO] hudi-hadoop-sparkbase-docker ..................... SUCCESS [01:18 min]
+[INFO] hudi-hadoop-sparkmaster-docker ................... SUCCESS [  2.964 s]
+[INFO] hudi-hadoop-sparkworker-docker ................... SUCCESS [  3.032 s]
+[INFO] hudi-hadoop-sparkadhoc-docker .................... SUCCESS [  2.764 s]
+[INFO] hudi-integ-test .................................. SUCCESS [  1.785 s]
+[INFO] ------------------------------------------------------------------------
+[INFO] BUILD SUCCESS
+[INFO] ------------------------------------------------------------------------
+[INFO] Total time: 09:15 min
+[INFO] Finished at: 2018-09-10T17:47:37-07:00
+[INFO] Final Memory: 236M/1848M
+[INFO] ------------------------------------------------------------------------
+```
diff --git a/docs/_docs/0.5.2/1_1_quick_start_guide.cn.md b/docs/_docs/0.5.2/1_1_quick_start_guide.cn.md
new file mode 100644
index 0000000..a8f4d49
--- /dev/null
+++ b/docs/_docs/0.5.2/1_1_quick_start_guide.cn.md
@@ -0,0 +1,162 @@
+---
+version: 0.5.2
+title: "Quick-Start Guide"
+permalink: /cn/docs/0.5.2-quick-start-guide.html
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+本指南通过使用spark-shell简要介绍了Hudi功能。使用Spark数据源,我们将通过代码段展示如何插入和更新的Hudi默认存储类型数据集:
+[写时复制](/cn/docs/0.5.2-concepts.html#copy-on-write-storage)。每次写操作之后,我们还将展示如何读取快照和增量读取数据。 
+
+## 设置spark-shell
+Hudi适用于Spark-2.x版本。您可以按照[此处](https://spark.apache.org/downloads.html)的说明设置spark。
+在提取的目录中,使用spark-shell运行Hudi:
+
+```scala
+bin/spark-shell --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
+```
+
+设置表名、基本路径和数据生成器来为本指南生成记录。
+
+```scala
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_cow_table"
+val basePath = "file:///tmp/hudi_cow_table"
+val dataGen = new DataGenerator
+```
+
+[数据生成器](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L50)
+可以基于[行程样本模式](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57)
+生成插入和更新的样本。
+
+## 插入数据 {#inserts}
+生成一些新的行程样本,将其加载到DataFrame中,然后将DataFrame写入Hudi数据集中,如下所示。
+
+```scala
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+    options(getQuickstartWriteConfigs).
+    option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+    option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+    option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+    option(TABLE_NAME, tableName).
+    mode(Overwrite).
+    save(basePath);
+```
+
+`mode(Overwrite)`覆盖并重新创建数据集(如果已经存在)。
+您可以检查在`/tmp/hudi_cow_table/<region>/<country>/<city>/`下生成的数据。我们提供了一个记录键
+([schema](#sample-schema)中的`uuid`),分区字段(`region/county/city`)和组合逻辑([schema](#sample-schema)中的`ts`)
+以确保行程记录在每个分区中都是唯一的。更多信息请参阅
+[对Hudi中的数据进行建模](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi),
+有关将数据提取到Hudi中的方法的信息,请参阅[写入Hudi数据集](/cn/docs/0.5.2-writing_data.html)。
+这里我们使用默认的写操作:`插入更新`。 如果您的工作负载没有`更新`,也可以使用更快的`插入`或`批量插入`操作。
+想了解更多信息,请参阅[写操作](/cn/docs/0.5.2-writing_data.html#write-operations)
+
+## 查询数据 {#query}
+
+将数据文件加载到DataFrame中。
+
+```scala
+val roViewDF = spark.
+    read.
+    format("org.apache.hudi").
+    load(basePath + "/*/*/*/*")
+roViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_ro_table where fare > 20.0").show()
+spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_ro_table").show()
+```
+
+该查询提供已提取数据的读取优化视图。由于我们的分区路径(`region/country/city`)是嵌套的3个级别
+从基本路径开始,我们使用了`load(basePath + "/*/*/*/*")`。
+有关支持的所有存储类型和视图的更多信息,请参考[存储类型和视图](/cn/docs/0.5.2-concepts.html#storage-types--views)。
+
+## 更新数据 {#updates}
+
+这类似于插入新数据。使用数据生成器生成对现有行程的更新,加载到DataFrame中并将DataFrame写入hudi数据集。
+
+```scala
+val updates = convertToStringList(dataGen.generateUpdates(10))
+val df = spark.read.json(spark.sparkContext.parallelize(updates, 2));
+df.write.format("org.apache.hudi").
+    options(getQuickstartWriteConfigs).
+    option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+    option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+    option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+    option(TABLE_NAME, tableName).
+    mode(Append).
+    save(basePath);
+```
+
+注意,保存模式现在为`追加`。通常,除非您是第一次尝试创建数据集,否则请始终使用追加模式。
+[查询](#query)现在再次查询数据将显示更新的行程。每个写操作都会生成一个新的由时间戳表示的[commit](/cn/docs/0.5.2-concepts.html)
+。在之前提交的相同的`_hoodie_record_key`中寻找`_hoodie_commit_time`, `rider`, `driver`字段变更。
+
+## 增量查询
+
+Hudi还提供了获取给定提交时间戳以来已更改的记录流的功能。
+这可以通过使用Hudi的增量视图并提供所需更改的开始时间来实现。
+如果我们需要给定提交之后的所有更改(这是常见的情况),则无需指定结束时间。
+
+```scala
+// reload data
+spark.
+    read.
+    format("org.apache.hudi").
+    load(basePath + "/*/*/*/*").
+    createOrReplaceTempView("hudi_ro_table")
+
+val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from  hudi_ro_table order by commitTime").map(k => k.getString(0)).take(50)
+val beginTime = commits(commits.length - 2) // commit time we are interested in
+
+// 增量查询数据
+val incViewDF = spark.
+    read.
+    format("org.apache.hudi").
+    option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
+    option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
+    load(basePath);
+incViewDF.registerTempTable("hudi_incr_table")
+spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_incr_table where fare > 20.0").show()
+```
+
+这将提供在开始时间提交之后发生的所有更改,其中包含票价大于20.0的过滤器。关于此功能的独特之处在于,它现在使您可以在批量数据上创作流式管道。
+
+## 特定时间点查询
+
+让我们看一下如何查询特定时间的数据。可以通过将结束时间指向特定的提交时间,将开始时间指向"000"(表示最早的提交时间)来表示特定时间。
+
+```scala
+val beginTime = "000" // Represents all commits > this time.
+val endTime = commits(commits.length - 2) // commit time we are interested in
+
+// 增量查询数据
+val incViewDF = spark.read.format("org.apache.hudi").
+    option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
+    option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
+    option(END_INSTANTTIME_OPT_KEY, endTime).
+    load(basePath);
+incViewDF.registerTempTable("hudi_incr_table")
+spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_incr_table where fare > 20.0").show()
+```
+
+## 从这开始下一步?
+
+您也可以通过[自己构建hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)来快速开始,
+并在spark-shell命令中使用`--jars <path to hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar`,
+而不是`--packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating`
+
+
+这里我们使用Spark演示了Hudi的功能。但是,Hudi可以支持多种存储类型/视图,并且可以从Hive,Spark,Presto等查询引擎中查询Hudi数据集。
+我们制作了一个基于Docker设置、所有依赖系统都在本地运行的[演示视频](https://www.youtube.com/watch?v=VhNgUsxdrD0),
+我们建议您复制相同的设置然后按照[这里](/cn/docs/0.5.2-docker_demo.html)的步骤自己运行这个演示。
+另外,如果您正在寻找将现有数据迁移到Hudi的方法,请参考[迁移指南](/cn/docs/0.5.2-migration_guide.html)。
diff --git a/docs/_docs/0.5.2/1_1_quick_start_guide.md b/docs/_docs/0.5.2/1_1_quick_start_guide.md
new file mode 100644
index 0000000..d7bd0ff
--- /dev/null
+++ b/docs/_docs/0.5.2/1_1_quick_start_guide.md
@@ -0,0 +1,220 @@
+---
+version: 0.5.2
+title: "Quick-Start Guide"
+permalink: /docs/0.5.2-quick-start-guide.html
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+This guide provides a quick peek at Hudi's capabilities using spark-shell. Using Spark datasources, we will walk through 
+code snippets that allows you to insert and update a Hudi table of default table type: 
+[Copy on Write](/docs/0.5.2-concepts.html#copy-on-write-table). 
+After each write operation we will also show how to read the data both snapshot and incrementally.
+
+## Setup spark-shell
+
+Hudi works with Spark-2.x versions. You can follow instructions [here](https://spark.apache.org/downloads.html) for setting up spark. 
+From the extracted directory run spark-shell with Hudi as:
+
+```scala
+spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
+  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
+  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
+```
+
+<div class="notice--info">
+  <h4>Please note the following: </h4>
+<ul>
+  <li>spark-avro module needs to be specified in --packages as it is not included with spark-shell by default</li>
+  <li>spark-avro and spark versions must match (we have used 2.4.4 for both above)</li>
+  <li>we have used hudi-spark-bundle built for scala 2.11 since the spark-avro module used also depends on 2.11. 
+         If spark-avro_2.12 is used, correspondingly hudi-spark-bundle_2.12 needs to be used. </li>
+</ul>
+</div>
+
+Setup table name, base path and a data generator to generate records for this guide.
+
+```scala
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+```
+
+The [DataGenerator](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L50) 
+can generate sample inserts and updates based on the the sample trip schema [here](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57)
+{: .notice--info}
+
+
+## Insert data
+
+Generate some new trips, load them into a DataFrame and write the DataFrame into the Hudi table as below.
+
+```scala
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME, tableName).
+  mode(Overwrite).
+  save(basePath)
+``` 
+
+`mode(Overwrite)` overwrites and recreates the table if it already exists.
+You can check the data generated under `/tmp/hudi_trips_cow/<region>/<country>/<city>/`. We provided a record key 
+(`uuid` in [schema](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)), partition field (`region/county/city`) and combine logic (`ts` in 
+[schema](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L58)) to ensure trip records are unique within each partition. For more info, refer to 
+[Modeling data stored in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi)
+and for info on ways to ingest data into Hudi, refer to [Writing Hudi Tables](/docs/0.5.2-writing_data.html).
+Here we are using the default write operation : `upsert`. If you have a workload without updates, you can also issue 
+`insert` or `bulk_insert` operations which could be faster. To know more, refer to [Write operations](/docs/0.5.2-writing_data#write-operations)
+{: .notice--info}
+ 
+## Query data 
+
+Load the data files into a DataFrame.
+
+```scala
+val tripsSnapshotDF = spark.
+  read.
+  format("hudi").
+  load(basePath + "/*/*/*/*")
+tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
+
+spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
+spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()
+```
+
+This query provides snapshot querying of the ingested data. Since our partition path (`region/country/city`) is 3 levels nested 
+from base path we ve used `load(basePath + "/*/*/*/*")`. 
+Refer to [Table types and queries](/docs/0.5.2-concepts#table-types--queries) for more info on all table types and query types supported.
+{: .notice--info}
+
+## Update data
+
+This is similar to inserting new data. Generate updates to existing trips using the data generator, load into a DataFrame 
+and write DataFrame into the hudi table.
+
+```scala
+val updates = convertToStringList(dataGen.generateUpdates(10))
+val df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
+df.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME, tableName).
+  mode(Append).
+  save(basePath)
+```
+
+Notice that the save mode is now `Append`. In general, always use append mode unless you are trying to create the table for the first time.
+[Querying](#query-data) the data again will now show updated trips. Each write operation generates a new [commit](http://hudi.incubator.apache.org/docs/concepts.html) 
+denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, `driver` fields for the same `_hoodie_record_key`s in previous commit. 
+{: .notice--info}
+
+## Incremental query
+
+Hudi also provides capability to obtain a stream of records that changed since given commit timestamp. 
+This can be achieved using Hudi's incremental querying and providing a begin time from which changes need to be streamed. 
+We do not need to specify endTime, if we want all changes after the given commit (as is the common case). 
+
+```scala
+// reload data
+spark.
+  read.
+  format("hudi").
+  load(basePath + "/*/*/*/*").
+  createOrReplaceTempView("hudi_trips_snapshot")
+
+val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from  hudi_trips_snapshot order by commitTime").map(k => k.getString(0)).take(50)
+val beginTime = commits(commits.length - 2) // commit time we are interested in
+
+// incrementally query data
+val tripsIncrementalDF = spark.read.format("hudi").
+  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
+  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
+  load(basePath)
+tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
+
+spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_incremental where fare > 20.0").show()
+``` 
+
+This will give all changes that happened after the beginTime commit with the filter of fare > 20.0. The unique thing about this
+feature is that it now lets you author streaming pipelines on batch data.
+{: .notice--info}
+
+## Point in time query
+
+Lets look at how to query data as of a specific time. The specific time can be represented by pointing endTime to a 
+specific commit time and beginTime to "000" (denoting earliest possible commit time). 
+
+```scala
+val beginTime = "000" // Represents all commits > this time.
+val endTime = commits(commits.length - 2) // commit time we are interested in
+
+//incrementally query data
+val tripsPointInTimeDF = spark.read.format("hudi").
+  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
+  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
+  option(END_INSTANTTIME_OPT_KEY, endTime).
+  load(basePath)
+tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
+spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_point_in_time where fare > 20.0").show()
+```
+
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in.
+
+```scala
+// fetch total records count
+spark.sql("select uuid, partitionPath from hudi_trips_snapshot").count()
+// fetch two records to be deleted
+val ds = spark.sql("select uuid, partitionPath from hudi_trips_snapshot").limit(2)
+
+// issue deletes
+val deletes = dataGen.generateDeletes(ds.collectAsList())
+val df = spark.read.json(spark.sparkContext.parallelize(deletes, 2));
+df.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(OPERATION_OPT_KEY,"delete").
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME, tableName).
+  mode(Append).
+  save(basePath)
+
+// run the same read query as above.
+val roAfterDeleteViewDF = spark.
+  read.
+  format("hudi").
+  load(basePath + "/*/*/*/*")
+roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
+// fetch should return (total - 2) records
+spark.sql("select uuid, partitionPath from hudi_trips_snapshot").count()
+```
+Note: Only `Append` mode is supported for delete operation.
+
+## Where to go from here?
+
+You can also do the quickstart by [building hudi yourself](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source), 
+and using `--jars <path to hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` in the spark-shell command above
+instead of `--packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`. Hudi also supports scala 2.12. Refer [build with scala 2.12](https://github.com/apache/incubator-hudi#build-with-scala-212)
+for more info.
+
+Also, we used Spark here to show case the capabilities of Hudi. However, Hudi can support multiple table types/query types and 
+Hudi tables can be queried from query engines like Hive, Spark, Presto and much more. We have put together a 
+[demo video](https://www.youtube.com/watch?v=VhNgUsxdrD0) that show cases all of this on a docker based setup with all 
+dependent systems running locally. We recommend you replicate the same setup and run the demo yourself, by following 
+steps [here](/docs/0.5.2-docker_demo.html) to get a taste for it. Also, if you are looking for ways to migrate your existing data 
+to Hudi, refer to [migration guide](/docs/0.5.2-migration_guide.html). 
diff --git a/docs/_docs/0.5.2/1_2_structure.md b/docs/_docs/0.5.2/1_2_structure.md
new file mode 100644
index 0000000..c65ab4d
--- /dev/null
+++ b/docs/_docs/0.5.2/1_2_structure.md
@@ -0,0 +1,22 @@
+---
+version: 0.5.2
+title: Structure
+keywords: big data, stream processing, cloud, hdfs, storage, upserts, change capture
+permalink: /docs/0.5.2-structure.html
+summary: "Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing."
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical tables over DFS ([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) or cloud stores) and provides three types of queries.
+
+ * **Read Optimized query** - Provides excellent query performance on pure columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
+ * **Incremental query** - Provides a change stream out of the dataset to feed downstream jobs/ETLs.
+ * **Snapshot query** - Provides queries on real-time data, using a combination of columnar & row based storage (e.g Parquet + [Avro](http://avro.apache.org/docs/current/mr.html))
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_intro_1.png" alt="hudi_intro_1.png" />
+</figure>
+
+By carefully managing how data is laid out in storage & how it’s exposed to queries, Hudi is able to power a rich data ecosystem where external sources can be ingested in near real-time and made available for interactive SQL Engines like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), while at the same time capable of being consumed incrementally from processing/ETL frameworks like [Hive](https://hive.apache.org/) & [Spark](https://spark.apache.org/docs/latest/) t [...]
+
+Hudi broadly consists of a self contained Spark library to build tables and integrations with existing query engines for data access. See [quickstart](/docs/0.5.2-quick-start-guide) for a demo.
diff --git a/docs/_docs/0.5.2/1_3_use_cases.cn.md b/docs/_docs/0.5.2/1_3_use_cases.cn.md
new file mode 100644
index 0000000..4a15190
--- /dev/null
+++ b/docs/_docs/0.5.2/1_3_use_cases.cn.md
@@ -0,0 +1,69 @@
+---
+version: 0.5.2
+title: 使用案例
+keywords: hudi, data ingestion, etl, real time, use cases
+permalink: /cn/docs/0.5.2-use_cases.html
+summary: "Following are some sample use-cases for Hudi, which illustrate the benefits in terms of faster processing & increased efficiency"
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+以下是一些使用Hudi的示例,说明了加快处理速度和提高效率的好处
+
+## 近实时摄取
+
+将外部源(如事件日志、数据库、外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
+尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(如果不是全部)Hadoop部署中都使用零散的方式解决,即使用多个不同的摄取工具。
+
+
+对于RDBMS摄取,Hudi提供 __通过更新插入达到更快加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
+DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)及[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)更快/更有效率。
+
+
+对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / [Voldemort](http://www.project-voldemort.com/voldemort/) / [HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
+毫无疑问, __全量加载不可行__,如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
+
+
+即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 这采取了综合方式解决[HDFS小文件问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop集群造成严重损害。
+
+在所有源中,通过`commits`这一概念,Hudi增加了以原子方式向消费者发布新数据的功能,这种功能十分必要。
+
+## 近实时分析
+
+通常,实时[数据集市](https://en.wikipedia.org/wiki/Data_mart)由专业(实时)数据分析存储提供支持,例如[Druid](http://druid.io/)或[Memsql](http://www.memsql.com/)或[OpenTSDB](http://opentsdb.net/)。
+这对于较小规模的数据量来说绝对是完美的([相比于这样安装Hadoop](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)),这种情况需要在亚秒级响应查询,例如系统监控或交互式实时分析。
+但是,由于Hadoop上的数据太陈旧了,通常这些系统会被滥用于非交互式查询,这导致利用率不足和硬件/许可证成本的浪费。
+
+另一方面,Hadoop上的交互式SQL解决方案(如Presto和SparkSQL)表现出色,在 __几秒钟内完成查询__。
+通过将 __数据新鲜度提高到几分钟__,Hudi可以提供一个更有效的替代方案,并支持存储在DFS中的 __数量级更大的数据集__ 的实时分析。
+此外,Hudi没有外部依赖(如专用于实时分析的HBase集群),因此可以在更新的分析上实现更快的分析,而不会增加操作开销。
+
+
+## 增量处理管道
+
+Hadoop提供的一个基本能力是构建一系列数据集,这些数据集通过表示为工作流的DAG相互派生。
+工作流通常取决于多个上游工作流输出的新数据,新数据的可用性传统上由新的DFS文件夹/Hive分区指示。
+让我们举一个具体的例子来说明这点。上游工作流`U`可以每小时创建一个Hive分区,在每小时结束时(processing_time)使用该小时的数据(event_time),提供1小时的有效新鲜度。
+然后,下游工作流`D`在`U`结束后立即启动,并在下一个小时内自行处理,将有效延迟时间增加到2小时。
+
+上面的示例忽略了迟到的数据,即`processing_time`和`event_time`分开时。
+不幸的是,在今天的后移动和前物联网世界中,__来自间歇性连接的移动设备和传感器的延迟数据是常态,而不是异常__。
+在这种情况下,保证正确性的唯一补救措施是[重新处理最后几个小时](https://falcon.apache.org/FalconDocumentation.html#Handling_late_input_data)的数据,
+每小时一遍又一遍,这可能会严重影响整个生态系统的效率。例如; 试想一下,在数百个工作流中每小时重新处理TB数据。
+
+Hudi通过以单个记录为粒度的方式(而不是文件夹/分区)从上游 Hudi数据集`HU`消费新数据(包括迟到数据),来解决上面的问题。
+应用处理逻辑,并使用下游Hudi数据集`HD`高效更新/协调迟到数据。在这里,`HU`和`HD`可以以更频繁的时间被连续调度
+比如15分钟,并且`HD`提供端到端30分钟的延迟。
+
+为实现这一目标,Hudi采用了类似于[Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations)、发布/订阅系统等流处理框架,以及像[Kafka](http://kafka.apache.org/documentation/#theconsumer)
+或[Oracle XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187)等数据库复制技术的类似概念。
+如果感兴趣,可以在[这里](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)找到有关增量处理(相比于流处理和批处理)好处的更详细解释。
+
+## DFS的数据分发
+
+一个常用场景是先在Hadoop上处理数据,然后将其分发回在线服务存储层,以供应用程序使用。
+例如,一个Spark管道可以[确定Hadoop上的紧急制动事件](https://eng.uber.com/telematics/)并将它们加载到服务存储层(如ElasticSearch)中,供Uber应用程序使用以增加安全驾驶。这种用例中,通常架构会在Hadoop和服务存储之间引入`队列`,以防止目标服务存储被压垮。
+对于队列的选择,一种流行的选择是Kafka,这个模型经常导致 __在DFS上存储相同数据的冗余(用于计算结果的离线分析)和Kafka(用于分发)__
+
+通过将每次运行的Spark管道更新插入的输出转换为Hudi数据集,Hudi可以再次有效地解决这个问题,然后可以以增量方式获取尾部数据(就像Kafka topic一样)然后写入服务存储层。
diff --git a/docs/_docs/0.5.2/1_3_use_cases.md b/docs/_docs/0.5.2/1_3_use_cases.md
new file mode 100644
index 0000000..de33ab7
--- /dev/null
+++ b/docs/_docs/0.5.2/1_3_use_cases.md
@@ -0,0 +1,68 @@
+---
+version: 0.5.2
+title: "Use Cases"
+keywords: hudi, data ingestion, etl, real time, use cases
+permalink: /docs/0.5.2-use_cases.html
+summary: "Following are some sample use-cases for Hudi, which illustrate the benefits in terms of faster processing & increased efficiency"
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+## Near Real-Time Ingestion
+
+Ingesting data from external sources like (event logs, databases, external sources) into a [Hadoop Data Lake](http://martinfowler.com/bliki/DataLake.html) is a well known problem.
+In most (if not all) Hadoop deployments, it is unfortunately solved in a piecemeal fashion, using a medley of ingestion tools,
+even though this data is arguably the most valuable for the entire organization.
+
+For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed costly & inefficient bulk loads. For e.g, you can read the MySQL BIN log or [Sqoop Incremental Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports) and apply them to an
+equivalent Hudi table on DFS. This would be much faster/efficient than a [bulk merge job](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)
+or [complicated handcrafted merge workflows](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)
+
+For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / [Voldemort](http://www.project-voldemort.com/voldemort/) / [HBase](https://hbase.apache.org/), even moderately big installations store billions of rows.
+It goes without saying that __full bulk loads are simply infeasible__ and more efficient approaches are needed if ingestion is to keep up with the typically high update volumes.
+
+Even for immutable data sources like [Kafka](https://kafka.apache.org) , Hudi helps __enforces a minimum file size on HDFS__, which improves NameNode health by solving one of the [age old problems in Hadoop land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a holistic way. This is all the more important for event streams, since typically its higher volume (eg: click streams) and if not managed well, can cause serious damage to your Hadoop cluster.
+
+Across all sources, Hudi adds the much needed ability to atomically publish new data to consumers via notion of commits, shielding them from partial ingestion failures
+
+
+## Near Real-time Analytics
+
+Typically, real-time [datamarts](https://en.wikipedia.org/wiki/Data_mart) are powered by specialized analytical stores such as [Druid](http://druid.io/) or [Memsql](http://www.memsql.com/) or [even OpenTSDB](http://opentsdb.net/) .
+This is absolutely perfect for lower scale ([relative to Hadoop installations like this](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)) data, that needs sub-second query responses such as system monitoring or interactive real-time analysis.
+But, typically these systems end up getting abused for less interactive queries also since data on Hadoop is intolerably stale. This leads to under utilization & wasteful hardware/license costs.
+
+On the other hand, interactive SQL solutions on Hadoop such as Presto & SparkSQL excel in __queries that finish within few seconds__.
+By bringing __data freshness to a few minutes__, Hudi can provide a much efficient alternative, as well unlock real-time analytics on __several magnitudes larger tables__ stored in DFS.
+Also, Hudi has no external dependencies (like a dedicated HBase cluster, purely used for real-time analytics) and thus enables faster analytics on much fresher analytics, without increasing the operational overhead.
+
+
+## Incremental Processing Pipelines
+
+One fundamental ability Hadoop provides is to build a chain of tables derived from each other via DAGs expressed as workflows.
+Workflows often depend on new data being output by multiple upstream workflows and traditionally, availability of new data is indicated by a new DFS Folder/Hive Partition.
+Let's take a concrete example to illustrate this. An upstream workflow `U` can create a Hive partition for every hour, with data for that hour (event_time) at the end of each hour (processing_time), providing effective freshness of 1 hour.
+Then, a downstream workflow `D`, kicks off immediately after `U` finishes, and does its own processing for the next hour, increasing the effective latency to 2 hours.
+
+The above paradigm simply ignores late arriving data i.e when `processing_time` and `event_time` drift apart.
+Unfortunately, in today's post-mobile & pre-IoT world, __late data from intermittently connected mobile devices & sensors are the norm, not an anomaly__.
+In such cases, the only remedy to guarantee correctness is to [reprocess the last few hours](https://falcon.apache.org/FalconDocumentation.html#Handling_late_input_data) worth of data,
+over and over again each hour, which can significantly hurt the efficiency across the entire ecosystem. For e.g; imagine reprocessing TBs worth of data every hour across hundreds of workflows.
+
+Hudi comes to the rescue again, by providing a way to consume new data (including late data) from an upsteam Hudi table `HU` at a record granularity (not folders/partitions),
+apply the processing logic, and efficiently update/reconcile late data with a downstream Hudi table `HD`. Here, `HU` and `HD` can be continuously scheduled at a much more frequent schedule
+like 15 mins, and providing an end-end latency of 30 mins at `HD`.
+
+To achieve this, Hudi has embraced similar concepts from stream processing frameworks like [Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations) , Pub/Sub systems like [Kafka](http://kafka.apache.org/documentation/#theconsumer)
+or database replication technologies like [Oracle XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187).
+For the more curious, a more detailed explanation of the benefits of Incremental Processing (compared to Stream Processing & Batch Processing) can be found [here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)
+
+
+## Data Dispersal From DFS
+
+A popular use-case for Hadoop, is to crunch data and then disperse it back to an online serving store, to be used by an application.
+For e.g, a Spark Pipeline can [determine hard braking events on Hadoop](https://eng.uber.com/telematics/) and load them into a serving store like ElasticSearch, to be used by the Uber application to increase safe driving. Typical architectures for this employ a `queue` between Hadoop and serving store, to prevent overwhelming the target serving store.
+A popular choice for this queue is Kafka and this model often results in __redundant storage of same data on DFS (for offline analysis on computed results) and Kafka (for dispersal)__
+
+Once again Hudi can efficiently solve this problem, by having the Spark Pipeline upsert output from
+each run into a Hudi table, which can then be incrementally tailed (just like a Kafka topic) for new data & written into the serving store.
diff --git a/docs/_docs/0.5.2/1_4_powered_by.cn.md b/docs/_docs/0.5.2/1_4_powered_by.cn.md
new file mode 100644
index 0000000..65bf47e
--- /dev/null
+++ b/docs/_docs/0.5.2/1_4_powered_by.cn.md
@@ -0,0 +1,59 @@
+---
+version: 0.5.2
+title: 演讲 & Hudi 用户
+keywords: hudi, talks, presentation
+permalink: /cn/docs/0.5.2-powered_by.html
+last_modified_at: 2019-12-31T15:59:57-04:00
+language: cn
+---
+
+## 已使用
+
+### Uber
+
+Hudi最初由[Uber](https://uber.com)开发,用于实现[低延迟、高效率的数据库摄取](http://www.slideshare.net/vinothchandar/hadoop-strata-talk-uber-your-hadoop-has-arrived/32)。
+Hudi自2016年8月开始在生产环境上线,在Hadoop上驱动约100个非常关键的业务表,支撑约几百TB的数据规模(前10名包括行程、乘客、司机)。
+Hudi还支持几个增量的Hive ETL管道,并且目前已集成到Uber的数据分发系统中。
+
+### EMIS Health
+
+[EMIS Health](https://www.emishealth.com/)是英国最大的初级保健IT软件提供商,其数据集包括超过5000亿的医疗保健记录。HUDI用于管理生产中的分析数据集,并使其与上游源保持同步。Presto用于查询以HUDI格式写入的数据。
+
+### Yields.io
+
+Yields.io是第一个使用AI在企业范围内进行自动模型验证和实时监控的金融科技平台。他们的数据湖由Hudi管理,他们还积极使用Hudi为增量式、跨语言/平台机器学习构建基础架构。
+
+### Yotpo
+
+Hudi在Yotpo有不少用途。首先,在他们的[开源ETL框架](https://github.com/YotpoLtd/metorikku)中集成了Hudi作为CDC管道的输出写入程序,即从数据库binlog生成的事件流到Kafka然后再写入S3。
+
+## 演讲 & 报告
+
+1. ["Hoodie: Incremental processing on Hadoop at Uber"](https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56511) -  By Vinoth Chandar & Prasanna Rajaperumal
+   Mar 2017, Strata + Hadoop World, San Jose, CA
+
+2. ["Hoodie: An Open Source Incremental Processing Framework From Uber"](http://www.dataengconf.com/hoodie-an-open-source-incremental-processing-framework-from-uber) - By Vinoth Chandar.
+   Apr 2017, DataEngConf, San Francisco, CA [Slides](https://www.slideshare.net/vinothchandar/hoodie-dataengconf-2017) [Video](https://www.youtube.com/watch?v=7Wudjc-v7CA)
+
+
+3. ["Incremental Processing on Large Analytical Datasets"](https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/) - By Prasanna Rajaperumal
+   June 2017, Spark Summit 2017, San Francisco, CA. [Slides](https://www.slideshare.net/databricks/incremental-processing-on-large-analytical-datasets-with-prasanna-rajaperumal-and-vinoth-chandar) [Video](https://www.youtube.com/watch?v=3HS0lQX-cgo&feature=youtu.be)
+
+4. ["Hudi: Unifying storage and serving for batch and near-real-time analytics"](https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/70937) - By Nishith Agarwal & Balaji Vardarajan
+   September 2018, Strata Data Conference, New York, NY
+
+5. ["Hudi: Large-Scale, Near Real-Time Pipelines at Uber"](https://databricks.com/session/hudi-near-real-time-spark-pipelines-at-petabyte-scale) - By Vinoth Chandar & Nishith Agarwal
+   October 2018, Spark+AI Summit Europe, London, UK
+
+6. ["Powering Uber's global network analytics pipelines in real-time with Apache Hudi"](https://www.youtube.com/watch?v=1w3IpavhSWA) - By Ethan Guo & Nishith Agarwal, April 2019, Data Council SF19, San Francisco, CA.
+
+7. ["Building highly efficient data lakes using Apache Hudi (Incubating)"](https://www.slideshare.net/ChesterChen/sf-big-analytics-20190612-building-highly-efficient-data-lakes-using-apache-hudi) - By Vinoth Chandar 
+   June 2019, SF Big Analytics Meetup, San Mateo, CA
+   
+8. ["Apache Hudi (Incubating) - The Past, Present and Future Of Efficient Data Lake Architectures"](https://docs.google.com/presentation/d/1FHhsvh70ZP6xXlHdVsAI0g__B_6Mpto5KQFlZ0b8-mM) - By Vinoth Chandar & Balaji Varadarajan
+   September 2019, ApacheCon NA 19, Las Vegas, NV, USA
+
+## 文章
+
+1. ["The Case for incremental processing on Hadoop"](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop) - O'reilly Ideas article by Vinoth Chandar
+2. ["Hoodie: Uber Engineering's Incremental Processing Framework on Hadoop"](https://eng.uber.com/hoodie/) - Engineering Blog By Prasanna Rajaperumal
diff --git a/docs/_docs/0.5.2/1_4_powered_by.md b/docs/_docs/0.5.2/1_4_powered_by.md
new file mode 100644
index 0000000..4f6d94d
--- /dev/null
+++ b/docs/_docs/0.5.2/1_4_powered_by.md
@@ -0,0 +1,70 @@
+---
+version: 0.5.2
+title: "Talks & Powered By"
+keywords: hudi, talks, presentation
+permalink: /docs/0.5.2-powered_by.html
+last_modified_at: 2019-12-31T15:59:57-04:00
+---
+
+## Adoption
+
+### Uber
+
+Apache Hudi was originally developed at [Uber](https://uber.com), to achieve [low latency database ingestion, with high efficiency](http://www.slideshare.net/vinothchandar/hadoop-strata-talk-uber-your-hadoop-has-arrived/32).
+It has been in production since Aug 2016, powering the massive [100PB data lake](https://eng.uber.com/uber-big-data-platform/), including highly business critical tables like core trips,riders,partners. It also 
+powers several incremental Hive ETL pipelines and being currently integrated into Uber's data dispersal system.
+
+### Amazon Web Services
+Amazon Web Services is the World's leading cloud services provider. Apache Hudi is [pre-installed](https://aws.amazon.com/emr/features/hudi/) with the AWS Elastic Map Reduce 
+offering, providing means for AWS users to perform record-level updates/deletes and manage storage efficiently.
+
+### EMIS Health
+
+[EMIS Health](https://www.emishealth.com/) is the largest provider of Primary Care IT software in the UK with datasets including more than 500Bn healthcare records. HUDI is used to manage their analytics dataset in production and keeping them up-to-date with their upstream source. Presto is being used to query the data written in HUDI format.
+
+### Yields.io
+
+Yields.io is the first FinTech platform that uses AI for automated model validation and real-time monitoring on an enterprise-wide scale. Their data lake is managed by Hudi. They are also actively building their infrastructure for incremental, cross language/platform machine learning using Hudi.
+
+### Yotpo
+
+Using Hudi at Yotpo for several usages. Firstly, integrated Hudi as a writer in their open source ETL framework https://github.com/YotpoLtd/metorikku and using as an output writer for a CDC pipeline, with events that are being generated from a database binlog streams to Kafka and then are written to S3. 
+ 
+### Tathastu.ai
+
+[Tathastu.ai](https://www.tathastu.ai) offers the largest AI/ML playground of consumer data for data scientists, AI experts and technologists to build upon. They have built a CDC pipeline using Apache Hudi and Debezium. Data from Hudi datasets is being queried using Hive, Presto and Spark.
+
+## Talks & Presentations
+
+1. ["Hoodie: Incremental processing on Hadoop at Uber"](https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56511) -  By Vinoth Chandar & Prasanna Rajaperumal
+   Mar 2017, Strata + Hadoop World, San Jose, CA
+
+2. ["Hoodie: An Open Source Incremental Processing Framework From Uber"](http://www.dataengconf.com/hoodie-an-open-source-incremental-processing-framework-from-uber) - By Vinoth Chandar.
+   Apr 2017, DataEngConf, San Francisco, CA [Slides](https://www.slideshare.net/vinothchandar/hoodie-dataengconf-2017) [Video](https://www.youtube.com/watch?v=7Wudjc-v7CA)
+
+3. ["Incremental Processing on Large Analytical Datasets"](https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/) - By Prasanna Rajaperumal
+   June 2017, Spark Summit 2017, San Francisco, CA. [Slides](https://www.slideshare.net/databricks/incremental-processing-on-large-analytical-datasets-with-prasanna-rajaperumal-and-vinoth-chandar) [Video](https://www.youtube.com/watch?v=3HS0lQX-cgo&feature=youtu.be)
+
+4. ["Hudi: Unifying storage and serving for batch and near-real-time analytics"](https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/70937) - By Nishith Agarwal & Balaji Vardarajan
+   September 2018, Strata Data Conference, New York, NY
+
+5. ["Hudi: Large-Scale, Near Real-Time Pipelines at Uber"](https://databricks.com/session/hudi-near-real-time-spark-pipelines-at-petabyte-scale) - By Vinoth Chandar & Nishith Agarwal
+   October 2018, Spark+AI Summit Europe, London, UK
+
+6. ["Powering Uber's global network analytics pipelines in real-time with Apache Hudi"](https://www.youtube.com/watch?v=1w3IpavhSWA) - By Ethan Guo & Nishith Agarwal, April 2019, Data Council SF19, San Francisco, CA.
+
+7. ["Building highly efficient data lakes using Apache Hudi (Incubating)"](https://www.slideshare.net/ChesterChen/sf-big-analytics-20190612-building-highly-efficient-data-lakes-using-apache-hudi) - By Vinoth Chandar 
+   June 2019, SF Big Analytics Meetup, San Mateo, CA
+
+8. ["Apache Hudi (Incubating) - The Past, Present and Future Of Efficient Data Lake Architectures"](https://docs.google.com/presentation/d/1FHhsvh70ZP6xXlHdVsAI0g__B_6Mpto5KQFlZ0b8-mM) - By Vinoth Chandar & Balaji Varadarajan
+   September 2019, ApacheCon NA 19, Las Vegas, NV, USA
+  
+9. ["Insert, upsert, and delete data in Amazon S3 using Amazon EMR"](https://www.portal.reinvent.awsevents.com/connect/sessionDetail.ww?SESSION_ID=98662&csrftkn=YS67-AG7B-QIAV-ZZBK-E6TT-MD4Q-1HEP-747P) - By Paul Codding & Vinoth Chandar
+   December 2019, AWS re:Invent 2019, Las Vegas, NV, USA  
+       
+10. ["Building Robust CDC Pipeline With Apache Hudi And Debezium"](https://www.slideshare.net/SyedKather/building-robust-cdc-pipeline-with-apache-hudi-and-debezium) - By Pratyaksh, Purushotham, Syed and Shaik December 2019, Hadoop Summit Bangalore, India
+
+## Articles
+
+1. ["The Case for incremental processing on Hadoop"](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop) - O'reilly Ideas article by Vinoth Chandar
+2. ["Hoodie: Uber Engineering's Incremental Processing Framework on Hadoop"](https://eng.uber.com/hoodie/) - Engineering Blog By Prasanna Rajaperumal
diff --git a/docs/_docs/0.5.2/1_5_comparison.cn.md b/docs/_docs/0.5.2/1_5_comparison.cn.md
new file mode 100644
index 0000000..67ea8d1
--- /dev/null
+++ b/docs/_docs/0.5.2/1_5_comparison.cn.md
@@ -0,0 +1,50 @@
+---
+version: 0.5.2
+title: 对比
+keywords: apache, hudi, kafka, kudu, hive, hbase, stream processing
+permalink: /cn/docs/0.5.2-comparison.html
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
+通过将Hudi与一些相关系统进行对比,来了解Hudi如何适应当前的大数据生态系统,并知晓这些系统在设计中做的不同权衡仍将非常有用。
+
+## Kudu
+
+[Apache Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。
+一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。
+因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。
+
+Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。
+与之不同的是,Hudi旨在与底层Hadoop兼容的文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache Spark来完成繁重的工作。
+因此,Hudi可以像其他Spark作业一样轻松扩展,而Kudu则需要硬件和运营支持,特别是HBase或Vertica等数据存储系统。
+到目前为止,我们还没有做任何直接的基准测试来比较Kudu和Hudi(鉴于RTTable正在进行中)。
+但是,如果我们要使用[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines),
+我们预期Hudi在摄取parquet上有更卓越的性能。
+
+## Hive事务
+
+[Hive事务/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)是另一项类似的工作,它试图实现在ORC文件格式之上的存储`读取时合并`。
+可以理解,此功能与Hive以及[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP)之类的其他工作紧密相关。
+Hive事务不提供Hudi提供的读取优化存储选项或增量拉取。
+在实现选择方面,Hudi充分利用了类似Spark的处理框架的功能,而Hive事务特性则在用户或Hive Metastore启动的Hive任务/查询的下实现。
+根据我们的生产经验,与其他方法相比,将Hudi作为库嵌入到现有的Spark管道中要容易得多,并且操作不会太繁琐。
+Hudi还设计用于与Presto/Spark等非Hive引擎合作,并计划引入除parquet以外的文件格式。
+
+## HBase
+
+尽管[HBase](https://hbase.apache.org)最终是OLTP工作负载的键值存储层,但由于与Hadoop的相似性,用户通常倾向于将HBase与分析相关联。
+鉴于HBase经过严格的写优化,它支持开箱即用的亚秒级更新,Hive-on-HBase允许用户查询该数据。 但是,就分析工作负载的实际性能而言,Parquet/ORC之类的混合列式存储格式可以轻松击败HBase,因为这些工作负载主要是读取繁重的工作。
+Hudi弥补了更快的数据与分析存储格式之间的差距。从运营的角度来看,与管理分析使用的HBase region服务器集群相比,为用户提供可更快给出数据的库更具可扩展性。
+最终,HBase不像Hudi这样重点支持`提交时间`、`增量拉取`之类的增量处理原语。
+
+## 流式处理
+
+一个普遍的问题:"Hudi与流处理系统有何关系?",我们将在这里尝试回答。简而言之,Hudi可以与当今的批处理(`写时复制存储`)和流处理(`读时合并存储`)作业集成,以将计算结果存储在Hadoop中。
+对于Spark应用程序,这可以通过将Hudi库与Spark/Spark流式DAG直接集成来实现。在非Spark处理系统(例如Flink、Hive)情况下,可以在相应的系统中进行处理,然后通过Kafka主题/DFS中间文件将其发送到Hudi表中。从概念上讲,数据处理
+管道仅由三个部分组成:`输入`,`处理`,`输出`,用户最终针对输出运行查询以便使用管道的结果。Hudi可以充当将数据存储在DFS上的输入或输出。Hudi在给定流处理管道上的适用性最终归结为你的查询在Presto/SparkSQL/Hive的适用性。
+
+更高级的用例围绕[增量处理](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)的概念展开,
+甚至在`处理`引擎内部也使用Hudi来加速典型的批处理管道。例如:Hudi可用作DAG内的状态存储(类似Flink使用的[rocksDB(https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend))。
+这是路线图上的一个项目并将最终以[Beam Runner](https://issues.apache.org/jira/browse/HUDI-60)的形式呈现。
diff --git a/docs/_docs/0.5.2/1_5_comparison.md b/docs/_docs/0.5.2/1_5_comparison.md
new file mode 100644
index 0000000..81681d2
--- /dev/null
+++ b/docs/_docs/0.5.2/1_5_comparison.md
@@ -0,0 +1,58 @@
+---
+version: 0.5.2
+title: "Comparison"
+keywords: apache, hudi, kafka, kudu, hive, hbase, stream processing
+permalink: /docs/0.5.2-comparison.html
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+Apache Hudi fills a big void for processing data on top of DFS, and thus mostly co-exists nicely with these technologies. However,
+it would be useful to understand how Hudi fits into the current big data ecosystem, contrasting it with a few related systems
+and bring out the different tradeoffs these systems have accepted in their design.
+
+## Kudu
+
+[Apache Kudu](https://kudu.apache.org) is a storage system that has similar goals as Hudi, which is to bring real-time analytics on petabytes of data via first
+class support for `upserts`. A key differentiator is that Kudu also attempts to serve as a datastore for OLTP workloads, something that Hudi does not aspire to be.
+Consequently, Kudu does not support incremental pulling (as of early 2017), something Hudi does to enable incremental processing use cases.
+
+
+Kudu diverges from a distributed file system abstraction and HDFS altogether, with its own set of storage servers talking to each  other via RAFT.
+Hudi, on the other hand, is designed to work with an underlying Hadoop compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of storage servers,
+instead relying on Apache Spark to do the heavy-lifting. Thu, Hudi can be scaled easily, just like other Spark jobs, while Kudu would require hardware
+& operational support, typical to datastores like HBase or Vertica. We have not at this point, done any head to head benchmarks against Kudu (given RTTable is WIP).
+But, if we were to go with results shared by [CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines) ,
+we expect Hudi to positioned at something that ingests parquet with superior performance.
+
+
+## Hive Transactions
+
+[Hive Transactions/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions) is another similar effort, which tries to implement storage like
+`merge-on-read`, on top of ORC file format. Understandably, this feature is heavily tied to Hive and other efforts like [LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP).
+Hive transactions does not offer the read-optimized storage option or the incremental pulling, that Hudi does. In terms of implementation choices, Hudi leverages
+the full power of a processing framework like Spark, while Hive transactions feature is implemented underneath by Hive tasks/queries kicked off by user or the Hive metastore.
+Based on our production experience, embedding Hudi as a library into existing Spark pipelines was much easier and less operationally heavy, compared with the other approach.
+Hudi is also designed to work with non-hive enginers like Presto/Spark and will incorporate file formats other than parquet over time.
+
+## HBase
+
+Even though [HBase](https://hbase.apache.org) is ultimately a key-value store for OLTP workloads, users often tend to associate HBase with analytics given the proximity to Hadoop.
+Given HBase is heavily write-optimized, it supports sub-second upserts out-of-box and Hive-on-HBase lets users query that data. However, in terms of actual performance for analytical workloads,
+hybrid columnar storage formats like Parquet/ORC handily beat HBase, since these workloads are predominantly read-heavy. Hudi bridges this gap between faster data and having
+analytical storage formats. From an operational perspective, arming users with a library that provides faster data, is more scalable, than managing a big farm of HBase region servers,
+just for analytics. Finally, HBase does not support incremental processing primitives like `commit times`, `incremental pull` as first class citizens like Hudi.
+
+## Stream Processing
+
+A popular question, we get is : "How does Hudi relate to stream processing systems?", which we will try to answer here. Simply put, Hudi can integrate with
+batch (`copy-on-write table`) and streaming (`merge-on-read table`) jobs of today, to store the computed results in Hadoop. For Spark apps, this can happen via direct
+integration of Hudi library with Spark/Spark streaming DAGs. In case of Non-Spark processing systems (eg: Flink, Hive), the processing can be done in the respective systems
+and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In more conceptual level, data processing
+pipelines just consist of three components : `source`, `processing`, `sink`, with users ultimately running queries against the sink to use the results of the pipeline.
+Hudi can act as either a source or sink, that stores data on DFS. Applicability of Hudi to a given stream processing pipeline ultimately boils down to suitability
+of Presto/SparkSQL/Hive for your queries.
+
+More advanced use cases revolve around the concepts of [incremental processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop), which effectively
+uses Hudi even inside the `processing` engine to speed up typical batch pipelines. For e.g: Hudi can be used as a state store inside a processing DAG (similar
+to how [rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend) is used by Flink). This is an item on the roadmap
+and will eventually happen as a [Beam Runner](https://issues.apache.org/jira/browse/HUDI-60)
diff --git a/docs/_docs/0.5.2/2_1_concepts.cn.md b/docs/_docs/0.5.2/2_1_concepts.cn.md
new file mode 100644
index 0000000..ce7a760
--- /dev/null
+++ b/docs/_docs/0.5.2/2_1_concepts.cn.md
@@ -0,0 +1,156 @@
+---
+version: 0.5.2
+title: 概念
+keywords: hudi, design, storage, views, timeline
+permalink: /cn/docs/0.5.2-concepts.html
+summary: "Here we introduce some basic concepts & give a broad technical overview of Hudi"
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+Apache Hudi(发音为“Hudi”)在DFS的数据集上提供以下流原语
+
+ * 插入更新           (如何改变数据集?)
+ * 增量拉取           (如何获取变更的数据?)
+
+在本节中,我们将讨论重要的概念和术语,这些概念和术语有助于理解并有效使用这些原语。
+
+## 时间轴
+在它的核心,Hudi维护一条包含在不同的`即时`时间所有对数据集操作的`时间轴`,从而提供,从不同时间点出发得到不同的视图下的数据集。Hudi即时包含以下组件
+
+ * `操作类型` : 对数据集执行的操作类型
+ * `即时时间` : 即时时间通常是一个时间戳(例如:20190117010349),该时间戳按操作开始时间的顺序单调增加。
+ * `状态` : 即时的状态
+
+Hudi保证在时间轴上执行的操作的原子性和基于即时时间的时间轴一致性。
+
+执行的关键操作包括
+
+ * `COMMITS` - 一次提交表示将一组记录**原子写入**到数据集中。
+ * `CLEANS` - 删除数据集中不再需要的旧文件版本的后台活动。
+ * `DELTA_COMMIT` - 增量提交是指将一批记录**原子写入**到MergeOnRead存储类型的数据集中,其中一些/所有数据都可以只写到增量日志中。
+ * `COMPACTION` - 协调Hudi中差异数据结构的后台活动,例如:将更新从基于行的日志文件变成列格式。在内部,压缩表现为时间轴上的特殊提交。
+ * `ROLLBACK` - 表示提交/增量提交不成功且已回滚,删除在写入过程中产生的所有部分文件。
+ * `SAVEPOINT` - 将某些文件组标记为"已保存",以便清理程序不会将其删除。在发生灾难/数据恢复的情况下,它有助于将数据集还原到时间轴上的某个点。
+
+任何给定的即时都可以处于以下状态之一
+
+ * `REQUESTED` - 表示已调度但尚未启动的操作。
+ * `INFLIGHT` - 表示当前正在执行该操作。
+ * `COMPLETED` - 表示在时间轴上完成了该操作。
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_timeline.png" alt="hudi_timeline.png" />
+</figure>
+
+上面的示例显示了在Hudi数据集上大约10:00到10:20之间发生的更新事件,大约每5分钟一次,将提交元数据以及其他后台清理/压缩保留在Hudi时间轴上。
+观察的关键点是:提交时间指示数据的`到达时间`(上午10:20),而实际数据组织则反映了实际时间或`事件时间`,即数据所反映的(从07:00开始的每小时时段)。在权衡数据延迟和完整性时,这是两个关键概念。
+
+如果有延迟到达的数据(事件时间为9:00的数据在10:20达到,延迟 >1 小时),我们可以看到upsert将新数据生成到更旧的时间段/文件夹中。
+在时间轴的帮助下,增量查询可以只提取10:00以后成功提交的新数据,并非常高效地只消费更改过的文件,且无需扫描更大的文件范围,例如07:00后的所有时间段。
+
+## 文件组织
+Hudi将DFS上的数据集组织到`基本路径`下的目录结构中。数据集分为多个分区,这些分区是包含该分区的数据文件的文件夹,这与Hive表非常相似。
+每个分区被相对于基本路径的特定`分区路径`区分开来。
+
+在每个分区内,文件被组织为`文件组`,由`文件id`唯一标识。
+每个文件组包含多个`文件切片`,其中每个切片包含在某个提交/压缩即时时间生成的基本列文件(`*.parquet`)以及一组日志文件(`*.log*`),该文件包含自生成基本文件以来对基本文件的插入/更新。
+Hudi采用MVCC设计,其中压缩操作将日志和基本文件合并以产生新的文件片,而清理操作则将未使用的/较旧的文件片删除以回收DFS上的空间。
+
+Hudi通过索引机制将给定的hoodie键(记录键+分区路径)映射到文件组,从而提供了高效的Upsert。
+一旦将记录的第一个版本写入文件,记录键和文件组/文件id之间的映射就永远不会改变。 简而言之,映射的文件组包含一组记录的所有版本。
+
+## 存储类型和视图
+Hudi存储类型定义了如何在DFS上对数据进行索引和布局以及如何在这种组织之上实现上述原语和时间轴活动(即如何写入数据)。
+反过来,`视图`定义了基础数据如何暴露给查询(即如何读取数据)。
+
+| 存储类型  | 支持的视图 |
+|-------------- |------------------|
+| 写时复制 | 读优化 + 增量   |
+| 读时合并 | 读优化 + 增量 + 近实时 |
+
+### 存储类型
+Hudi支持以下存储类型。
+
+  - [写时复制](#copy-on-write-storage) : 仅使用列文件格式(例如parquet)存储数据。通过在写入过程中执行同步合并以更新版本并重写文件。
+
+  - [读时合并](#merge-on-read-storage) : 使用列式(例如parquet)+ 基于行(例如avro)的文件格式组合来存储数据。 更新记录到增量文件中,然后进行同步或异步压缩以生成列文件的新版本。
+    
+下表总结了这两种存储类型之间的权衡
+
+| 权衡 | 写时复制 | 读时合并 |
+|-------------- |------------------| ------------------|
+| 数据延迟 | 更高   | 更低 |
+| 更新代价(I/O) | 更高(重写整个parquet文件) | 更低(追加到增量日志) |
+| Parquet文件大小 | 更小(高更新代价(I/o)) | 更大(低更新代价) |
+| 写放大 | 更高 | 更低(取决于压缩策略) |
+
+
+### 视图
+Hudi支持以下存储数据的视图
+
+ - **读优化视图** : 在此视图上的查询将查看给定提交或压缩操作中数据集的最新快照。
+    该视图仅将最新文件切片中的基本/列文件暴露给查询,并保证与非Hudi列式数据集相比,具有相同的列式查询性能。
+ - **增量视图** : 对该视图的查询只能看到从某个提交/压缩后写入数据集的新数据。该视图有效地提供了更改流,来支持增量数据管道。
+ - **实时视图** : 在此视图上的查询将查看某个增量提交操作中数据集的最新快照。该视图通过动态合并最新的基本文件(例如parquet)和增量文件(例如avro)来提供近实时数据集(几分钟的延迟)。
+
+
+下表总结了不同视图之间的权衡。
+
+| 权衡 | 读优化 | 实时 |
+|-------------- |------------------| ------------------|
+| 数据延迟 | 更高   | 更低 |
+| 查询延迟 | 更低(原始列式性能)| 更高(合并列式 + 基于行的增量) |
+
+
+## 写时复制存储 {#copy-on-write-storage}
+
+写时复制存储中的文件片仅包含基本/列文件,并且每次提交都会生成新版本的基本文件。
+换句话说,我们压缩每个提交,从而所有的数据都是以列数据的形式储存。在这种情况下,写入数据非常昂贵(我们需要重写整个列数据文件,即使只有一个字节的新数据被提交),而读取数据的成本则没有增加。
+这种视图有利于读取繁重的分析工作。
+
+以下内容说明了将数据写入写时复制存储并在其上运行两个查询时,它是如何工作的。
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_cow.png" alt="hudi_cow.png" />
+</figure>
+
+
+随着数据的写入,对现有文件组的更新将为该文件组生成一个带有提交即时时间标记的新切片,而插入分配一个新文件组并写入该文件组的第一个切片。
+这些文件切片及其提交即时时间在上面用颜色编码。
+针对这样的数据集运行SQL查询(例如:`select count(*)`统计该分区中的记录数目),首先检查时间轴上的最新提交并过滤每个文件组中除最新文件片以外的所有文件片。
+如您所见,旧查询不会看到以粉红色标记的当前进行中的提交的文件,但是在该提交后的新查询会获取新数据。因此,查询不受任何写入失败/部分写入的影响,仅运行在已提交数据上。
+
+写时复制存储的目的是从根本上改善当前管理数据集的方式,通过以下方法来实现
+
+  - 优先支持在文件级原子更新数据,而无需重写整个表/分区
+  - 能够只读取更新的部分,而不是进行低效的扫描或搜索
+  - 严格控制文件大小来保持出色的查询性能(小的文件会严重损害查询性能)。
+
+## 读时合并存储 {#merge-on-read-storage}
+
+读时合并存储是写时复制的升级版,从某种意义上说,它仍然可以通过读优化表提供数据集的读取优化视图(写时复制的功能)。
+此外,它将每个文件组的更新插入存储到基于行的增量日志中,通过文件id,将增量日志和最新版本的基本文件进行合并,从而提供近实时的数据查询。因此,此存储类型智能地平衡了读和写的成本,以提供近乎实时的查询。
+这里最重要的一点是压缩器,它现在可以仔细挑选需要压缩到其列式基础文件中的增量日志(根据增量日志的文件大小),以保持查询性能(较大的增量日志将会提升近实时的查询时间,并同时需要更长的合并时间)。
+
+以下内容说明了存储的工作方式,并显示了对近实时表和读优化表的查询。
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_mor.png" alt="hudi_mor.png" style="max-width: 100%" />
+</figure>
+
+此示例中发生了很多有趣的事情,这些带出了该方法的微妙之处。
+
+ - 现在,我们每1分钟左右就有一次提交,这是其他存储类型无法做到的。
+ - 现在,在每个文件id组中,都有一个增量日志,其中包含对基础列文件中记录的更新。
+ 在示例中,增量日志包含10:05至10:10的所有数据。与以前一样,基本列式文件仍使用提交进行版本控制。
+ 因此,如果只看一眼基本文件,那么存储布局看起来就像是写时复制表的副本。
+ - 定期压缩过程会从增量日志中合并这些更改,并生成基础文件的新版本,就像示例中10:05发生的情况一样。
+ - 有两种查询同一存储的方式:读优化(RO)表和近实时(RT)表,具体取决于我们选择查询性能还是数据新鲜度。
+ - 对于RO表来说,提交数据在何时可用于查询将有些许不同。 请注意,以10:10运行的(在RO表上的)此类查询将不会看到10:05之后的数据,而在RT表上的查询总会看到最新的数据。
+ - 何时触发压缩以及压缩什么是解决这些难题的关键。
+ 通过实施压缩策略,在该策略中,与较旧的分区相比,我们会积极地压缩最新的分区,从而确保RO表能够以一致的方式看到几分钟内发布的数据。
+
+读时合并存储上的目的是直接在DFS上启用近实时处理,而不是将数据复制到专用系统,后者可能无法处理大数据量。
+该存储还有一些其他方面的好处,例如通过避免数据的同步合并来减少写放大,即批量数据中每1字节数据需要的写入数据量。
diff --git a/docs/_docs/0.5.2/2_1_concepts.md b/docs/_docs/0.5.2/2_1_concepts.md
new file mode 100644
index 0000000..e37bd97
--- /dev/null
+++ b/docs/_docs/0.5.2/2_1_concepts.md
@@ -0,0 +1,173 @@
+---
+version: 0.5.2
+title: "Concepts"
+keywords: hudi, design, table, queries, timeline
+permalink: /docs/0.5.2-concepts.html
+summary: "Here we introduce some basic concepts & give a broad technical overview of Hudi"
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives over hadoop compatible storages
+
+ * Update/Delete Records      (how do I change records in a table?)
+ * Change Streams             (how do I fetch records that changed?)
+
+In this section, we will discuss key concepts & terminologies that are important to understand, to be able to effectively use these primitives.
+
+## Timeline
+At its core, Hudi maintains a `timeline` of all actions performed on the table at different `instants` of time that helps provide instantaneous views of the table,
+while also efficiently supporting retrieval of data in the order of arrival. A Hudi instant consists of the following components 
+
+ * `Instant action` : Type of action performed on the table
+ * `Instant time` : Instant time is typically a timestamp (e.g: 20190117010349), which monotonically increases in the order of action's begin time.
+ * `state` : current state of the instant
+ 
+Hudi guarantees that the actions performed on the timeline are atomic & timeline consistent based on the instant time.
+
+Key actions performed include
+
+ * `COMMITS` - A commit denotes an **atomic write** of a batch of records into a table.
+ * `CLEANS` - Background activity that gets rid of older versions of files in the table, that are no longer needed.
+ * `DELTA_COMMIT` - A delta commit refers to an **atomic write** of a batch of records into a  MergeOnRead type table, where some/all of the data could be just written to delta logs.
+ * `COMPACTION` - Background activity to reconcile differential data structures within Hudi e.g: moving updates from row based log files to columnar formats. Internally, compaction manifests as a special commit on the timeline
+ * `ROLLBACK` - Indicates that a commit/delta commit was unsuccessful & rolled back, removing any partial files produced during such a write
+ * `SAVEPOINT` - Marks certain file groups as "saved", such that cleaner will not delete them. It helps restore the table to a point on the timeline, in case of disaster/data recovery scenarios.
+
+Any given instant can be 
+in one of the following states
+
+ * `REQUESTED` - Denotes an action has been scheduled, but has not initiated
+ * `INFLIGHT` - Denotes that the action is currently being performed
+ * `COMPLETED` - Denotes completion of an action on the timeline
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_timeline.png" alt="hudi_timeline.png" />
+</figure>
+
+Example above shows upserts happenings between 10:00 and 10:20 on a Hudi table, roughly every 5 mins, leaving commit metadata on the Hudi timeline, along
+with other background cleaning/compactions. One key observation to make is that the commit time indicates the `arrival time` of the data (10:20AM), while the actual data
+organization reflects the actual time or `event time`, the data was intended for (hourly buckets from 07:00). These are two key concepts when reasoning about tradeoffs between latency and completeness of data.
+
+When there is late arriving data (data intended for 9:00 arriving >1 hr late at 10:20), we can see the upsert producing new data into even older time buckets/folders.
+With the help of the timeline, an incremental query attempting to get all new data that was committed successfully since 10:00 hours, is able to very efficiently consume
+only the changed files without say scanning all the time buckets > 07:00.
+
+## File management
+Hudi organizes a table into a directory structure under a `basepath` on DFS. Table is broken up into partitions, which are folders containing data files for that partition,
+very similar to Hive tables. Each partition is uniquely identified by its `partitionpath`, which is relative to the basepath.
+
+Within each partition, files are organized into `file groups`, uniquely identified by a `file id`. Each file group contains several
+`file slices`, where each slice contains a base file (`*.parquet`) produced at a certain commit/compaction instant time,
+ along with set of log files (`*.log.*`) that contain inserts/updates to the base file since the base file was produced. 
+Hudi adopts a MVCC design, where compaction action merges logs and base files to produce new file slices and cleaning action gets rid of 
+unused/older file slices to reclaim space on DFS. 
+
+## Index
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file id, via an indexing mechanism. 
+This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the 
+mapped file group contains all versions of a group of records.
+
+## Table Types & Queries
+Hudi table types define how data is indexed & laid out on the DFS and how the above primitives and timeline activities are implemented on top of such organization (i.e how data is written). 
+In turn, `query types` define how the underlying data is exposed to the queries (i.e how data is read). 
+
+| Table Type    | Supported Query types |
+|-------------- |------------------|
+| Copy On Write | Snapshot Queries + Incremental Queries  |
+| Merge On Read | Snapshot Queries + Incremental Queries + Read Optimized Queries |
+
+### Table Types
+Hudi supports the following table types.
+
+  - [Copy On Write](#copy-on-write-table) : Stores data using exclusively columnar file formats (e.g parquet). Updates simply version & rewrite the files by performing a synchronous merge during write.
+  - [Merge On Read](#merge-on-read-table) : Stores data using a combination of columnar (e.g parquet) + row based (e.g avro) file formats. Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or asynchronously.
+    
+Following table summarizes the trade-offs between these two table types
+
+| Trade-off     | CopyOnWrite      | MergeOnRead |
+|-------------- |------------------| ------------------|
+| Data Latency | Higher   | Lower |
+| Update cost (I/O) | Higher (rewrite entire parquet) | Lower (append to delta log) |
+| Parquet File Size | Smaller (high update(I/0) cost) | Larger (low update cost) |
+| Write Amplification | Higher | Lower (depending on compaction strategy) |
+
+
+### Query types
+Hudi supports the following query types
+
+ - **Snapshot Queries** : Queries see the latest snapshot of the table as of a given commit or compaction action. In case of merge on read table, it exposes near-real time data(few mins) by merging 
+    the base and delta files of the latest file slice on-the-fly. For copy on write table,  it provides a drop-in replacement for existing parquet tables, while providing upsert/delete and other write side features. 
+ - **Incremental Queries** : Queries only see new data written to the table, since a given commit/compaction. This effectively provides change streams to enable incremental data pipelines. 
+ - **Read Optimized Queries** : Queries see the latest snapshot of table as of a given commit/compaction action. Exposes only the base/columnar files in latest file slices and guarantees the 
+    same columnar query performance compared to a non-hudi columnar table.
+
+Following table summarizes the trade-offs between the different query types.
+
+| Trade-off     | Snapshot    | Read Optimized |
+|-------------- |-------------| ------------------|
+| Data Latency  | Lower | Higher
+| Query Latency | Higher (merge base / columnar file + row based delta / log files) | Lower (raw base / columnar file performance)
+
+
+## Copy On Write Table
+
+File slices in Copy-On-Write table only contain the base/columnar file and each commit produces new versions of base files. 
+In other words, we implicitly compact on every commit, such that only columnar data exists. As a result, the write amplification 
+(number of bytes written for 1 byte of incoming data) is much higher, where read amplification is zero. 
+This is a much desired property for analytical workloads, which is predominantly read-heavy.
+
+Following illustrates how this works conceptually, when data written into copy-on-write table  and two queries running on top of it.
+
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_cow.png" alt="hudi_cow.png" />
+</figure>
+
+
+As data gets written, updates to existing file groups produce a new slice for that file group stamped with the commit instant time, 
+while inserts allocate a new file group and write its first slice for that file group. These file slices and their commit instant times are color coded above.
+SQL queries running against such a table (eg: `select count(*)` counting the total records in that partition), first checks the timeline for the latest commit
+and filters all but latest file slices of each file group. As you can see, an old query does not see the current inflight commit's files color coded in pink,
+but a new query starting after the commit picks up the new data. Thus queries are immune to any write failures/partial writes and only run on committed data.
+
+The intention of copy on write table, is to fundamentally improve how tables are managed today through
+
+  - First class support for atomically updating data at file-level, instead of rewriting whole tables/partitions
+  - Ability to incremental consume changes, as opposed to wasteful scans or fumbling with heuristics
+  - Tight control of file sizes to keep query performance excellent (small files hurt query performance considerably).
+
+
+## Merge On Read Table
+
+Merge on read table is a superset of copy on write, in the sense it still supports read optimized queries of the table by exposing only the base/columnar files in latest file slices.
+Additionally, it stores incoming upserts for each file group, onto a row based delta log, to support snapshot queries by applying the delta log, 
+onto the latest version of each file id on-the-fly during query time. Thus, this table type attempts to balance read and write amplification intelligently, to provide near real-time data.
+The most significant change here, would be to the compactor, which now carefully chooses which delta log files need to be compacted onto
+their columnar base file, to keep the query performance in check (larger delta log files would incur longer merge times with merge data on query side)
+
+Following illustrates how the table works, and shows two types of queries - snapshot query and read optimized query.
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_mor.png" alt="hudi_mor.png" style="max-width: 100%" />
+</figure>
+
+There are lot of interesting things happening in this example, which bring out the subtleties in the approach.
+
+ - We now have commits every 1 minute or so, something we could not do in the other table type.
+ - Within each file id group, now there is an delta log file, which holds incoming updates to records in the base columnar files. In the example, the delta log files hold
+ all the data from 10:05 to 10:10. The base columnar files are still versioned with the commit, as before.
+ Thus, if one were to simply look at base files alone, then the table layout looks exactly like a copy on write table.
+ - A periodic compaction process reconciles these changes from the delta log and produces a new version of base file, just like what happened at 10:05 in the example.
+ - There are two ways of querying the same underlying table: Read Optimized query and Snapshot query, depending on whether we chose query performance or freshness of data.
+ - The semantics around when data from a commit is available to a query changes in a subtle way for a read optimized query. Note, that such a query
+ running at 10:10, wont see data after 10:05 above, while a snapshot query always sees the freshest data.
+ - When we trigger compaction & what it decides to compact hold all the key to solving these hard problems. By implementing a compacting
+ strategy, where we aggressively compact the latest partitions compared to older partitions, we could ensure the read optimized queries see data
+ published within X minutes in a consistent fashion.
+
+The intention of merge on read table is to enable near real-time processing directly on top of DFS, as opposed to copying
+data out to specialized systems, which may not be able to handle the data volume. There are also a few secondary side benefits to 
+this table such as reduced write amplification by avoiding synchronous merge of data, i.e, the amount of data written per 1 bytes of data in a batch
+
+
diff --git a/docs/_docs/0.5.2/2_2_writing_data.cn.md b/docs/_docs/0.5.2/2_2_writing_data.cn.md
new file mode 100644
index 0000000..acdb08c
--- /dev/null
+++ b/docs/_docs/0.5.2/2_2_writing_data.cn.md
@@ -0,0 +1,224 @@
+---
+version: 0.5.2
+title: 写入 Hudi 数据集
+keywords: hudi, incremental, batch, stream, processing, Hive, ETL, Spark SQL
+permalink: /cn/docs/0.5.2-writing_data.html
+summary: In this page, we will discuss some available tools for incrementally ingesting & storing data.
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+这一节我们将介绍使用[DeltaStreamer](#deltastreamer)工具从外部源甚至其他Hudi数据集摄取新更改的方法,
+以及通过使用[Hudi数据源](#datasource-writer)的upserts加快大型Spark作业的方法。
+对于此类数据集,我们可以使用各种查询引擎[查询](/cn/docs/0.5.2-querying_data.html)它们。
+
+## 写操作
+
+在此之前,了解Hudi数据源及delta streamer工具提供的三种不同的写操作以及如何最佳利用它们可能会有所帮助。
+这些操作可以在针对数据集发出的每个提交/增量提交中进行选择/更改。
+
+ - **UPSERT(插入更新)** :这是默认操作,在该操作中,通过查找索引,首先将输入记录标记为插入或更新。
+ 在运行启发式方法以确定如何最好地将这些记录放到存储上,如优化文件大小之类后,这些记录最终会被写入。
+ 对于诸如数据库更改捕获之类的用例,建议该操作,因为输入几乎肯定包含更新。
+ - **INSERT(插入)** :就使用启发式方法确定文件大小而言,此操作与插入更新(UPSERT)非常相似,但此操作完全跳过了索引查找步骤。
+ 因此,对于日志重复数据删除等用例(结合下面提到的过滤重复项的选项),它可以比插入更新快得多。
+ 插入也适用于这种用例,这种情况数据集可以允许重复项,但只需要Hudi的事务写/增量提取/存储管理功能。
+ - **BULK_INSERT(批插入)** :插入更新和插入操作都将输入记录保存在内存中,以加快存储优化启发式计算的速度(以及其它未提及的方面)。
+ 所以对Hudi数据集进行初始加载/引导时这两种操作会很低效。批量插入提供与插入相同的语义,但同时实现了基于排序的数据写入算法,
+ 该算法可以很好地扩展数百TB的初始负载。但是,相比于插入和插入更新能保证文件大小,批插入在调整文件大小上只能尽力而为。
+
+## DeltaStreamer
+
+`HoodieDeltaStreamer`实用工具 (hudi-utilities-bundle中的一部分) 提供了从DFS或Kafka等不同来源进行摄取的方式,并具有以下功能。
+
+ - 从Kafka单次摄取新事件,从Sqoop、HiveIncrementalPuller输出或DFS文件夹中的多个文件
+ [增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
+ - 支持json、avro或自定义记录类型的传入数据
+ - 管理检查点,回滚和恢复
+ - 利用DFS或Confluent [schema注册表](https://github.com/confluentinc/schema-registry)的Avro模式。
+ - 支持自定义转换操作
+
+命令行选项更详细地描述了这些功能:
+
+```java
+[hoodie]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+Usage: <main class> [options]
+  Options:
+    --commit-on-errors
+        Commit even when some records failed to be written
+      Default: false
+    --enable-hive-sync
+          Enable syncing to hive
+       Default: false
+    --filter-dupes
+          Should duplicate records from source be dropped/filtered outbefore 
+          insert/bulk-insert 
+      Default: false
+    --help, -h
+    --hudi-conf
+          Any configuration that can be set in the properties file (using the CLI 
+          parameter "--propsFilePath") can also be passed command line using this 
+          parameter 
+          Default: []
+    --op
+      Takes one of these values : UPSERT (default), INSERT (use when input is
+      purely new data/inserts to gain speed)
+      Default: UPSERT
+      Possible Values: [UPSERT, INSERT, BULK_INSERT]
+    --payload-class
+      subclass of HoodieRecordPayload, that works off a GenericRecord.
+      Implement your own, if you want to do something other than overwriting
+      existing value
+      Default: org.apache.hudi.OverwriteWithLatestAvroPayload
+    --props
+      path to properties file on localfs or dfs, with configurations for
+      Hudi client, schema provider, key generator and data source. For
+      Hudi client props, sane defaults are used, but recommend use to
+      provide basic things like metrics endpoints, hive configs etc. For
+      sources, referto individual classes, for supported properties.
+      Default: file:///Users/vinoth/bin/hoodie/src/test/resources/delta-streamer-config/dfs-source.properties
+    --schemaprovider-class
+      subclass of org.apache.hudi.utilities.schema.SchemaProvider to attach
+      schemas to input & target table data, built in options:
+      FilebasedSchemaProvider
+      Default: org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+    --source-class
+      Subclass of org.apache.hudi.utilities.sources to read data. Built-in
+      options: org.apache.hudi.utilities.sources.{JsonDFSSource (default),
+      AvroDFSSource, JsonKafkaSource, AvroKafkaSource, HiveIncrPullSource}
+      Default: org.apache.hudi.utilities.sources.JsonDFSSource
+    --source-limit
+      Maximum amount of data to read from source. Default: No limit For e.g:
+      DFSSource => max bytes to read, KafkaSource => max events to read
+      Default: 9223372036854775807
+    --source-ordering-field
+      Field within source record to decide how to break ties between records
+      with same key in input data. Default: 'ts' holding unix timestamp of
+      record
+      Default: ts
+    --spark-master
+      spark master to use.
+      Default: local[2]
+  * --target-base-path
+      base path for the target Hudi dataset. (Will be created if did not
+      exist first time around. If exists, expected to be a Hudi dataset)
+  * --target-table
+      name of the target table in Hive
+    --transformer-class
+      subclass of org.apache.hudi.utilities.transform.Transformer. UDF to
+      transform raw source dataset to a target dataset (conforming to target
+      schema) before writing. Default : Not set. E:g -
+      org.apache.hudi.utilities.transform.SqlQueryBasedTransformer (which
+      allows a SQL query template to be passed as a transformation function)
+```
+
+该工具采用层次结构组成的属性文件,并具有可插拔的接口,用于提取数据、生成密钥和提供模式。
+从Kafka和DFS摄取数据的示例配置在这里:`hudi-utilities/src/test/resources/delta-streamer-config`。
+
+例如:当您让Confluent Kafka、Schema注册表启动并运行后,可以用这个命令产生一些测试数据
+([impressions.avro](https://docs.confluent.io/current/ksql/docs/tutorials/generate-custom-test-data.html),
+由schema-registry代码库提供)
+
+```java
+[confluent-5.0.0]$ bin/ksql-datagen schema=../impressions.avro format=avro topic=impressions key=impressionid
+```
+
+然后用如下命令摄取这些数据。
+
+```java
+[hoodie]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+  --props file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties \
+  --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \
+  --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
+  --source-ordering-field impresssiontime \
+  --target-base-path file:///tmp/hudi-deltastreamer-op --target-table uber.impressions \
+  --op BULK_INSERT
+```
+
+在某些情况下,您可能需要预先将现有数据集迁移到Hudi。 请参考[迁移指南](/cn/docs/0.5.2-migration_guide.html)。
+
+## Datasource Writer
+
+`hudi-spark`模块提供了DataSource API,可以将任何DataFrame写入(也可以读取)到Hudi数据集中。
+以下是在指定需要使用的字段名称的之后,如何插入更新DataFrame的方法,这些字段包括
+`recordKey => _row_key`、`partitionPath => partition`和`precombineKey => timestamp`
+
+```java
+inputDF.write()
+       .format("org.apache.hudi")
+       .options(clientOpts) // 可以传入任何Hudi客户端参数
+       .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+       .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
+       .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+       .option(HoodieWriteConfig.TABLE_NAME, tableName)
+       .mode(SaveMode.Append)
+       .save(basePath);
+```
+
+## 与Hive同步
+
+上面的两个工具都支持将数据集的最新模式同步到Hive Metastore,以便查询新的列和分区。
+如果需要从命令行或在独立的JVM中运行它,Hudi提供了一个`HiveSyncTool`,
+在构建了hudi-hive模块之后,可以按以下方式调用它。
+
+```java
+cd hudi-hive
+./run_sync_tool.sh
+ [hudi-hive]$ ./run_sync_tool.sh --help
+Usage: <main class> [options]
+  Options:
+  * --base-path
+       Basepath of Hudi dataset to sync
+  * --database
+       name of the target database in Hive
+    --help, -h
+       Default: false
+  * --jdbc-url
+       Hive jdbc connect url
+  * --pass
+       Hive password
+  * --table
+       name of the target table in Hive
+  * --user
+       Hive username
+```
+
+## 删除数据 
+
+通过允许用户指定不同的数据记录负载实现,Hudi支持对存储在Hudi数据集中的数据执行两种类型的删除。
+
+ - **Soft Deletes(软删除)** :使用软删除时,用户希望保留键,但仅使所有其他字段的值都为空。
+ 通过确保适当的字段在数据集模式中可以为空,并在将这些字段设置为null之后直接向数据集插入更新这些记录,即可轻松实现这一点。
+ - **Hard Deletes(硬删除)** :这种更强形式的删除是从数据集中彻底删除记录在存储上的任何痕迹。 
+ 这可以通过触发一个带有自定义负载实现的插入更新来实现,这种实现可以使用总是返回Optional.Empty作为组合值的DataSource或DeltaStreamer。 
+ Hudi附带了一个内置的`org.apache.hudi.EmptyHoodieRecordPayload`类,它就是实现了这一功能。
+ 
+```java
+ deleteDF // 仅包含要删除的记录的DataFrame
+   .write().format("org.apache.hudi")
+   .option(...) // 根据设置需要添加HUDI参数,例如记录键、分区路径和其他参数
+   // 指定record_key,partition_key,precombine_fieldkey和常规参数
+   .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY, "org.apache.hudi.EmptyHoodieRecordPayload")
+ 
+```
+
+## 存储管理
+
+Hudi还对存储在Hudi数据集中的数据执行几个关键的存储管理功能。在DFS上存储数据的关键方面是管理文件大小和数量以及回收存储空间。 
+例如,HDFS在处理小文件上性能很差,这会对Name Node的内存及RPC施加很大的压力,并可能破坏整个集群的稳定性。
+通常,查询引擎可在较大的列文件上提供更好的性能,因为它们可以有效地摊销获得列统计信息等的成本。
+即使在某些云数据存储上,列出具有大量小文件的目录也常常比较慢。
+
+以下是一些有效管理Hudi数据集存储的方法。
+
+ - Hudi中的[小文件处理功能](/cn/docs/0.5.2-configurations.html#compactionSmallFileSize),可以分析传入的工作负载并将插入内容分配到现有文件组中,
+ 而不是创建新文件组。新文件组会生成小文件。
+ - 可以[配置](/cn/docs/0.5.2-configurations.html#retainCommits)Cleaner来清理较旧的文件片,清理的程度可以调整,
+ 具体取决于查询所需的最长时间和增量拉取所需的回溯。
+ - 用户还可以调整[基础/parquet文件](/cn/docs/0.5.2-configurations.html#limitFileSize)、[日志文件](/cn/docs/0.5.2-configurations.html#logFileMaxSize)的大小
+ 和预期的[压缩率](/cn/docs/0.5.2-configurations.html#parquetCompressionRatio),使足够数量的插入被分到同一个文件组中,最终产生大小合适的基础文件。
+ - 智能调整[批插入并行度](/cn/docs/0.5.2-configurations.html#withBulkInsertParallelism),可以产生大小合适的初始文件组。
+ 实际上,正确执行此操作非常关键,因为文件组一旦创建后就不能删除,只能如前所述对其进行扩展。
+ - 对于具有大量更新的工作负载,[读取时合并存储](/cn/docs/0.5.2-concepts.html#merge-on-read-storage)提供了一种很好的机制,
+ 可以快速将其摄取到较小的文件中,之后通过压缩将它们合并为较大的基础文件。
diff --git a/docs/_docs/0.5.2/2_2_writing_data.md b/docs/_docs/0.5.2/2_2_writing_data.md
new file mode 100644
index 0000000..3dc85d0
--- /dev/null
+++ b/docs/_docs/0.5.2/2_2_writing_data.md
@@ -0,0 +1,253 @@
+---
+version: 0.5.2
+title: Writing Hudi Tables
+keywords: hudi, incremental, batch, stream, processing, Hive, ETL, Spark SQL
+permalink: /docs/0.5.2-writing_data.html
+summary: In this page, we will discuss some available tools for incrementally ingesting & storing data.
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+In this section, we will cover ways to ingest new changes from external sources or even other Hudi tables using the [DeltaStreamer](#deltastreamer) tool, as well as 
+speeding up large Spark jobs via upserts using the [Hudi datasource](#datasource-writer). Such tables can then be [queried](/docs/0.5.2-querying_data.html) using various query engines.
+
+
+## Write Operations
+
+Before that, it may be helpful to understand the 3 different write operations provided by Hudi datasource or the delta streamer tool and how best to leverage them. These operations
+can be chosen/changed across each commit/deltacommit issued against the table.
+
+
+ - **UPSERT** : This is the default operation where the input records are first tagged as inserts or updates by looking up the index and 
+ the records are ultimately written after heuristics are run to determine how best to pack them on storage to optimize for things like file sizing. 
+ This operation is recommended for use-cases like database change capture where the input almost certainly contains updates.
+ - **INSERT** : This operation is very similar to upsert in terms of heuristics/file sizing but completely skips the index lookup step. Thus, it can be a lot faster than upserts 
+ for use-cases like log de-duplication (in conjunction with options to filter duplicates mentioned below). This is also suitable for use-cases where the table can tolerate duplicates, but just 
+ need the transactional writes/incremental pull/storage management capabilities of Hudi.
+ - **BULK_INSERT** : Both upsert and insert operations keep input records in memory to speed up storage heuristics computations faster (among other things) and thus can be cumbersome for 
+ initial loading/bootstrapping a Hudi table at first. Bulk insert provides the same semantics as insert, while implementing a sort-based data writing algorithm, which can scale very well for several hundred TBs 
+ of initial load. However, this just does a best-effort job at sizing files vs guaranteeing file sizes like inserts/upserts do. 
+
+
+## DeltaStreamer
+
+The `HoodieDeltaStreamer` utility (part of hudi-utilities-bundle) provides the way to ingest from different sources such as DFS or Kafka, with the following capabilities.
+
+ - Exactly once ingestion of new events from Kafka, [incremental imports](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports) from Sqoop or output of `HiveIncrementalPuller` or files under a DFS folder
+ - Support json, avro or a custom record types for the incoming data
+ - Manage checkpoints, rollback & recovery 
+ - Leverage Avro schemas from DFS or Confluent [schema registry](https://github.com/confluentinc/schema-registry).
+ - Support for plugging in transformations
+
+Command line options describe capabilities in more detail
+
+```java
+[hoodie]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+Usage: <main class> [options]
+Options:
+    --checkpoint
+      Resume Delta Streamer from this checkpoint.
+    --commit-on-errors
+      Commit even when some records failed to be written
+      Default: false
+    --compact-scheduling-minshare
+      Minshare for compaction as defined in
+      https://spark.apache.org/docs/latest/job-scheduling.html
+      Default: 0
+    --compact-scheduling-weight
+      Scheduling weight for compaction as defined in
+      https://spark.apache.org/docs/latest/job-scheduling.html
+      Default: 1
+    --continuous
+      Delta Streamer runs in continuous mode running source-fetch -> Transform
+      -> Hudi Write in loop
+      Default: false
+    --delta-sync-scheduling-minshare
+      Minshare for delta sync as defined in
+      https://spark.apache.org/docs/latest/job-scheduling.html
+      Default: 0
+    --delta-sync-scheduling-weight
+      Scheduling weight for delta sync as defined in
+      https://spark.apache.org/docs/latest/job-scheduling.html
+      Default: 1
+    --disable-compaction
+      Compaction is enabled for MoR table by default. This flag disables it
+      Default: false
+    --enable-hive-sync
+      Enable syncing to hive
+      Default: false
+    --filter-dupes
+      Should duplicate records from source be dropped/filtered out before
+      insert/bulk-insert
+      Default: false
+    --help, -h
+
+    --hoodie-conf
+      Any configuration that can be set in the properties file (using the CLI
+      parameter "--propsFilePath") can also be passed command line using this
+      parameter
+      Default: []
+    --max-pending-compactions
+      Maximum number of outstanding inflight/requested compactions. Delta Sync
+      will not happen unlessoutstanding compactions is less than this number
+      Default: 5
+    --min-sync-interval-seconds
+      the min sync interval of each sync in continuous mode
+      Default: 0
+    --op
+      Takes one of these values : UPSERT (default), INSERT (use when input is
+      purely new data/inserts to gain speed)
+      Default: UPSERT
+      Possible Values: [UPSERT, INSERT, BULK_INSERT]
+    --payload-class
+      subclass of HoodieRecordPayload, that works off a GenericRecord.
+      Implement your own, if you want to do something other than overwriting
+      existing value
+      Default: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
+    --props
+      path to properties file on localfs or dfs, with configurations for
+      hoodie client, schema provider, key generator and data source. For
+      hoodie client props, sane defaults are used, but recommend use to
+      provide basic things like metrics endpoints, hive configs etc. For
+      sources, referto individual classes, for supported properties.
+      Default: file:///Users/vinoth/bin/hoodie/src/test/resources/delta-streamer-config/dfs-source.properties
+    --schemaprovider-class
+      subclass of org.apache.hudi.utilities.schema.SchemaProvider to attach
+      schemas to input & target table data, built in options:
+      org.apache.hudi.utilities.schema.FilebasedSchemaProvider.Source (See
+      org.apache.hudi.utilities.sources.Source) implementation can implement
+      their own SchemaProvider. For Sources that return Dataset<Row>, the
+      schema is obtained implicitly. However, this CLI option allows
+      overriding the schemaprovider returned by Source.
+    --source-class
+      Subclass of org.apache.hudi.utilities.sources to read data. Built-in
+      options: org.apache.hudi.utilities.sources.{JsonDFSSource (default),
+      AvroDFSSource, JsonKafkaSource, AvroKafkaSource, HiveIncrPullSource}
+      Default: org.apache.hudi.utilities.sources.JsonDFSSource
+    --source-limit
+      Maximum amount of data to read from source. Default: No limit For e.g:
+      DFS-Source => max bytes to read, Kafka-Source => max events to read
+      Default: 9223372036854775807
+    --source-ordering-field
+      Field within source record to decide how to break ties between records
+      with same key in input data. Default: 'ts' holding unix timestamp of
+      record
+      Default: ts
+    --spark-master
+      spark master to use.
+      Default: local[2]
+  * --table-type
+      Type of table. COPY_ON_WRITE (or) MERGE_ON_READ
+  * --target-base-path
+      base path for the target hoodie table. (Will be created if did not exist
+      first time around. If exists, expected to be a hoodie table)
+  * --target-table
+      name of the target table in Hive
+    --transformer-class
+      subclass of org.apache.hudi.utilities.transform.Transformer. Allows
+      transforming raw source Dataset to a target Dataset (conforming to
+      target schema) before writing. Default : Not set. E:g -
+      org.apache.hudi.utilities.transform.SqlQueryBasedTransformer (which
+      allows a SQL query templated to be passed as a transformation function)
+```
+
+The tool takes a hierarchically composed property file and has pluggable interfaces for extracting data, key generation and providing schema. Sample configs for ingesting from kafka and dfs are
+provided under `hudi-utilities/src/test/resources/delta-streamer-config`.
+
+For e.g: once you have Confluent Kafka, Schema registry up & running, produce some test data using ([impressions.avro](https://docs.confluent.io/current/ksql/docs/tutorials/generate-custom-test-data.html) provided by schema-registry repo)
+
+```java
+[confluent-5.0.0]$ bin/ksql-datagen schema=../impressions.avro format=avro topic=impressions key=impressionid
+```
+
+and then ingest it as follows.
+
+```java
+[hoodie]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+  --props file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties \
+  --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \
+  --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
+  --source-ordering-field impresssiontime \
+  --target-base-path file:\/\/\/tmp/hudi-deltastreamer-op \ 
+  --target-table uber.impressions \
+  --op BULK_INSERT
+```
+
+In some cases, you may want to migrate your existing table into Hudi beforehand. Please refer to [migration guide](/docs/0.5.2-migration_guide.html). 
+
+## Datasource Writer
+
+The `hudi-spark` module offers the DataSource API to write (and also read) any data frame into a Hudi table.
+Following is how we can upsert a dataframe, while specifying the field names that need to be used
+for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey => timestamp`
+
+
+```java
+inputDF.write()
+       .format("org.apache.hudi")
+       .options(clientOpts) // any of the Hudi client opts can be passed in as well
+       .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+       .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
+       .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+       .option(HoodieWriteConfig.TABLE_NAME, tableName)
+       .mode(SaveMode.Append)
+       .save(basePath);
+```
+
+## Syncing to Hive
+
+Both tools above support syncing of the table's latest schema to Hive metastore, such that queries can pick up new columns and partitions.
+In case, its preferable to run this from commandline or in an independent jvm, Hudi provides a `HiveSyncTool`, which can be invoked as below, 
+once you have built the hudi-hive module. Following is how we sync the above Datasource Writer written table to Hive metastore.
+
+```java
+cd hudi-hive
+./run_sync_tool.sh  --jdbc-url jdbc:hive2:\/\/hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path <basePath> --database default --table <tableName>
+```
+
+Starting with Hudi 0.5.1 version read optimized version of merge-on-read tables are suffixed '_ro' by default. For backwards compatibility with older Hudi versions, 
+an optional HiveSyncConfig - `--skip-ro-suffix`, has been provided to turn off '_ro' suffixing if desired. Explore other hive sync options using the following command:
+
+```java
+cd hudi-hive
+./run_sync_tool.sh
+ [hudi-hive]$ ./run_sync_tool.sh --help
+```
+
+## Deletes 
+
+Hudi supports implementing two types of deletes on data stored in Hudi tables, by enabling the user to specify a different record payload implementation. 
+For more info refer to [Delete support in Hudi](https://cwiki.apache.org/confluence/x/6IqvC).
+
+ - **Soft Deletes** : With soft deletes, user wants to retain the key but just null out the values for all other fields. 
+ This can be simply achieved by ensuring the appropriate fields are nullable in the table schema and simply upserting the table after setting these fields to null.
+ - **Hard Deletes** : A stronger form of delete is to physically remove any trace of the record from the table. This can be achieved by issuing an upsert with a custom payload implementation
+ via either DataSource or DeltaStreamer which always returns Optional.Empty as the combined value. Hudi ships with a built-in `org.apache.hudi.EmptyHoodieRecordPayload` class that does exactly this.
+ 
+```java
+ deleteDF // dataframe containing just records to be deleted
+   .write().format("org.apache.hudi")
+   .option(...) // Add HUDI options like record-key, partition-path and others as needed for your setup
+   // specify record_key, partition_key, precombine_fieldkey & usual params
+   .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY, "org.apache.hudi.EmptyHoodieRecordPayload")
+ 
+```
+
+
+## Optimized DFS Access
+
+Hudi also performs several key storage management functions on the data stored in a Hudi table. A key aspect of storing data on DFS is managing file sizes and counts
+and reclaiming storage space. For e.g HDFS is infamous for its handling of small files, which exerts memory/RPC pressure on the Name Node and can potentially destabilize
+the entire cluster. In general, query engines provide much better performance on adequately sized columnar files, since they can effectively amortize cost of obtaining 
+column statistics etc. Even on some cloud data stores, there is often cost to listing directories with large number of small files.
+
+Here are some ways to efficiently manage the storage of your Hudi tables.
+
+ - The [small file handling feature](/docs/0.5.2-configurations.html#compactionSmallFileSize) in Hudi, profiles incoming workload 
+   and distributes inserts to existing file groups instead of creating new file groups, which can lead to small files. 
+ - Cleaner can be [configured](/docs/0.5.2-configurations.html#retainCommits) to clean up older file slices, more or less aggressively depending on maximum time for queries to run & lookback needed for incremental pull
+ - User can also tune the size of the [base/parquet file](/docs/0.5.2-configurations.html#limitFileSize), [log files](/docs/0.5.2-configurations.html#logFileMaxSize) & expected [compression ratio](/docs/0.5.2-configurations.html#parquetCompressionRatio), 
+   such that sufficient number of inserts are grouped into the same file group, resulting in well sized base files ultimately.
+ - Intelligently tuning the [bulk insert parallelism](/docs/0.5.2-configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups
+   once created cannot be deleted, but simply expanded as explained before.
+ - For workloads with heavy updates, the [merge-on-read table](/docs/0.5.2-concepts.html#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction.
diff --git a/docs/_docs/0.5.2/2_3_querying_data.cn.md b/docs/_docs/0.5.2/2_3_querying_data.cn.md
new file mode 100644
index 0000000..74afcef
--- /dev/null
+++ b/docs/_docs/0.5.2/2_3_querying_data.cn.md
@@ -0,0 +1,178 @@
+---
+version: 0.5.2
+title: 查询 Hudi 数据集
+keywords: hudi, hive, spark, sql, presto
+permalink: /cn/docs/0.5.2-querying_data.html
+summary: In this page, we go over how to enable SQL queries on Hudi built tables.
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+从概念上讲,Hudi物理存储一次数据到DFS上,同时在其上提供三个逻辑视图,如[之前](/cn/docs/0.5.2-concepts.html#views)所述。
+数据集同步到Hive Metastore后,它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包,
+就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。
+
+具体来说,在写入过程中传递了两个由[table name](/cn/docs/0.5.2-configurations.html#TABLE_NAME_OPT_KEY)命名的Hive表。
+例如,如果`table name = hudi_tbl`,我们得到
+
+ - `hudi_tbl` 实现了由 `HoodieParquetInputFormat` 支持的数据集的读优化视图,从而提供了纯列式数据。
+ - `hudi_tbl_rt` 实现了由 `HoodieParquetRealtimeInputFormat` 支持的数据集的实时视图,从而提供了基础数据和日志数据的合并视图。
+
+如概念部分所述,[增量处理](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)所需要的
+一个关键原语是`增量拉取`(以从数据集中获取更改流/日志)。您可以增量提取Hudi数据集,这意味着自指定的即时时间起,
+您可以只获得全部更新和新行。 这与插入更新一起使用,对于构建某些数据管道尤其有用,包括将1个或多个源Hudi表(数据流/事实)以增量方式拉出(流/事实)
+并与其他表(数据集/维度)结合以[写出增量](/cn/docs/0.5.2-writing_data.html)到目标Hudi数据集。增量视图是通过查询上表之一实现的,并具有特殊配置,
+该特殊配置指示查询计划仅需要从数据集中获取增量数据。
+
+接下来,我们将详细讨论在每个查询引擎上如何访问所有三个视图。
+
+## Hive
+
+为了使Hive能够识别Hudi数据集并正确查询,
+HiveServer2需要在其[辅助jars路径](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)中提供`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar`。 
+这将确保输入格式类及其依赖项可用于查询计划和执行。
+
+### 读优化表 {#hive-ro-view}
+除了上述设置之外,对于beeline cli访问,还需要将`hive.input.format`变量设置为`org.apache.hudi.hadoop.HoodieParquetInputFormat`输入格式的完全限定路径名。
+对于Tez,还需要将`hive.tez.input.format`设置为`org.apache.hadoop.hive.ql.io.HiveInputFormat`。
+
+### 实时表 {#hive-rt-view}
+除了在HiveServer2上安装Hive捆绑jars之外,还需要将其放在整个集群的hadoop/hive安装中,这样查询也可以使用自定义RecordReader。
+
+### 增量拉取 {#hive-incr-pull}
+
+`HiveIncrementalPuller`允许通过HiveQL从大型事实/维表中增量提取更改,
+结合了Hive(可靠地处理复杂的SQL查询)和增量原语的好处(通过增量拉取而不是完全扫描来加快查询速度)。
+该工具使用Hive JDBC运行hive查询并将其结果保存在临时表中,这个表可以被插入更新。
+Upsert实用程序(`HoodieDeltaStreamer`)具有目录结构所需的所有状态,以了解目标表上的提交时间应为多少。
+例如:`/app/incremental-hql/intermediate/{source_table_name}_temp/{last_commit_included}`。
+已注册的Delta Hive表的格式为`{tmpdb}.{source_table}_{last_commit_included}`。
+
+以下是HiveIncrementalPuller的配置选项
+
+| **配置** | **描述** | **默认值** |
+|-------|--------|--------|
+|hiveUrl| 要连接的Hive Server 2的URL |  |
+|hiveUser| Hive Server 2 用户名 |  |
+|hivePass| Hive Server 2 密码 |  |
+|queue| YARN 队列名称 |  |
+|tmp| DFS中存储临时增量数据的目录。目录结构将遵循约定。请参阅以下部分。  |  |
+|extractSQLFile| 在源表上要执行的提取数据的SQL。提取的数据将是自特定时间点以来已更改的所有行。 |  |
+|sourceTable| 源表名称。在Hive环境属性中需要设置。 |  |
+|targetTable| 目标表名称。中间存储目录结构需要。  |  |
+|sourceDataPath| 源DFS基本路径。这是读取Hudi元数据的地方。 |  |
+|targetDataPath| 目标DFS基本路径。 这是计算fromCommitTime所必需的。 如果显式指定了fromCommitTime,则不需要设置这个参数。 |  |
+|tmpdb| 用来创建中间临时增量表的数据库 | hoodie_temp |
+|fromCommitTime| 这是最重要的参数。 这是从中提取更改的记录的时间点。 |  |
+|maxCommits| 要包含在拉取中的提交数。将此设置为-1将包括从fromCommitTime开始的所有提交。将此设置为大于0的值,将包括在fromCommitTime之后仅更改指定提交次数的记录。如果您需要一次赶上两次提交,则可能需要这样做。| 3 |
+|help| 实用程序帮助 |  |
+
+
+设置fromCommitTime=0和maxCommits=-1将提取整个源数据集,可用于启动Backfill。
+如果目标数据集是Hudi数据集,则该实用程序可以确定目标数据集是否没有提交或延迟超过24小时(这是可配置的),
+它将自动使用Backfill配置,因为增量应用最近24小时的更改会比Backfill花费更多的时间。
+该工具当前的局限性在于缺乏在混合模式(正常模式和增量模式)下自联接同一表的支持。
+
+**关于使用Fetch任务执行的Hive查询的说明:**
+由于Fetch任务为每个分区调用InputFormat.listStatus(),每个listStatus()调用都会列出Hoodie元数据。
+为了避免这种情况,如下操作可能是有用的,即使用Hive session属性对增量查询禁用Fetch任务:
+`set hive.fetch.task.conversion = none;`。这将确保Hive查询使用Map Reduce执行,
+合并分区(用逗号分隔),并且对所有这些分区仅调用一次InputFormat.listStatus()。
+
+## Spark
+
+Spark可将Hudi jars和捆绑包轻松部署和管理到作业/笔记本中。简而言之,通过Spark有两种方法可以访问Hudi数据集。
+
+ - **Hudi DataSource**:支持读取优化和增量拉取,类似于标准数据源(例如:`spark.read.parquet`)的工作方式。
+ - **以Hive表读取**:支持所有三个视图,包括实时视图,依赖于自定义的Hudi输入格式(再次类似Hive)。
+ 
+通常,您的spark作业需要依赖`hudi-spark`或`hudi-spark-bundle-x.y.z.jar`,
+它们必须位于驱动程序和执行程序的类路径上(提示:使用`--jars`参数)。
+ 
+### 读优化表 {#spark-ro-view}
+
+要使用SparkSQL将RO表读取为Hive表,只需按如下所示将路径过滤器推入sparkContext。
+对于Hudi表,该方法保留了Spark内置的读取Parquet文件的优化功能,例如进行矢量化读取。
+
+```scala
+spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]);
+```
+
+如果您希望通过数据源在DFS上使用全局路径,则只需执行以下类似操作即可得到Spark DataFrame。
+
+```scala
+Dataset<Row> hoodieROViewDF = spark.read().format("org.apache.hudi")
+// pass any path glob, can include hudi & non-hudi datasets
+.load("/glob/path/pattern");
+```
+ 
+### 实时表 {#spark-rt-view}
+当前,实时表只能在Spark中作为Hive表进行查询。为了做到这一点,设置`spark.sql.hive.convertMetastoreParquet = false`,
+迫使Spark回退到使用Hive Serde读取数据(计划/执行仍然是Spark)。
+
+```scala
+$ spark-shell --jars hudi-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path /etc/hive/conf  --packages com.databricks:spark-avro_2.11:4.0.0 --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g  --master yarn-client
+
+scala> sqlContext.sql("select count(*) from hudi_rt where datestr = '2016-10-02'").show()
+```
+
+### 增量拉取 {#spark-incr-pull}
+`hudi-spark`模块提供了DataSource API,这是一种从Hudi数据集中提取数据并通过Spark处理数据的更优雅的方法。
+如下所示是一个示例增量拉取,它将获取自`beginInstantTime`以来写入的所有记录。
+
+```java
+ Dataset<Row> hoodieIncViewDF = spark.read()
+     .format("org.apache.hudi")
+     .option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY(),
+             DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
+     .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
+            <beginInstantTime>)
+     .load(tablePath); // For incremental view, pass in the root/base path of dataset
+```
+
+请参阅[设置](/cn/docs/0.5.2-configurations.html#spark-datasource)部分,以查看所有数据源选项。
+
+另外,`HoodieReadClient`通过Hudi的隐式索引提供了以下功能。
+
+| **API** | **描述** |
+|-------|--------|
+| read(keys) | 使用Hudi自己的索通过快速查找将与键对应的数据作为DataFrame读出 |
+| filterExists() | 从提供的RDD[HoodieRecord]中过滤出已经存在的记录。对删除重复数据有用 |
+| checkExists(keys) | 检查提供的键是否存在于Hudi数据集中 |
+
+
+## Presto
+
+Presto是一种常用的查询引擎,可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。
+这需要在整个安装过程中将`hudi-presto-bundle` jar放入`<presto_install>/plugin/hive-hadoop2/`中。
+
+## Impala(此功能还未正式发布)
+
+### 读优化表
+
+Impala可以在HDFS上查询Hudi读优化表,作为一种 [EXTERNAL TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#external_tables) 的形式。  
+可以通过以下方式在Impala上建立Hudi读优化表:
+```
+CREATE EXTERNAL TABLE database.table_name
+LIKE PARQUET '/path/to/load/xxx.parquet'
+STORED AS HUDIPARQUET
+LOCATION '/path/to/load';
+```
+Impala可以利用合理的文件分区来提高查询的效率。
+如果想要建立分区的表,文件夹命名需要根据此种方式`year=2020/month=1`.
+Impala使用`=`来区分分区名和分区值.  
+可以通过以下方式在Impala上建立分区Hudi读优化表:
+```
+CREATE EXTERNAL TABLE database.table_name
+LIKE PARQUET '/path/to/load/xxx.parquet'
+PARTITION BY (year int, month int, day int)
+STORED AS HUDIPARQUET
+LOCATION '/path/to/load';
+ALTER TABLE database.table_name RECOVER PARTITIONS;
+```
+在Hudi成功写入一个新的提交后, 刷新Impala表来得到最新的结果.
+```
+REFRESH database.table_name
+```
+
diff --git a/docs/_docs/0.5.2/2_3_querying_data.md b/docs/_docs/0.5.2/2_3_querying_data.md
new file mode 100644
index 0000000..242c84a
--- /dev/null
+++ b/docs/_docs/0.5.2/2_3_querying_data.md
@@ -0,0 +1,201 @@
+---
+version: 0.5.2
+title: Querying Hudi Tables
+keywords: hudi, hive, spark, sql, presto
+permalink: /docs/0.5.2-querying_data.html
+summary: In this page, we go over how to enable SQL queries on Hudi built tables.
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained [before](/docs/0.5.2-concepts.html#query-types). 
+Once the table is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi
+bundle has been installed, the table can be queried by popular query engines like Hive, Spark SQL, Spark Datasource API and Presto.
+
+Specifically, following Hive tables are registered based off [table name](/docs/0.5.2-configurations.html#TABLE_NAME_OPT_KEY) 
+and [table type](/docs/0.5.2-configurations.html#TABLE_TYPE_OPT_KEY) configs passed during write.   
+
+If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get: 
+ - `hudi_trips` supports snapshot query and incremental query on the table backed by `HoodieParquetInputFormat`, exposing purely columnar data.
+
+
+If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
+ - `hudi_trips_rt` supports snapshot query and incremental query (providing near-real time data) on the table  backed by `HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_trips_ro` supports read optimized query on the table backed by `HoodieParquetInputFormat`, exposing purely columnar data stored in base files.
+
+As discussed in the concepts section, the one key capability needed for [incrementally processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
+is obtaining a change stream/log from a table. Hudi tables can be queried incrementally, which means you can get ALL and ONLY the updated & new rows 
+since a specified instant time. This, together with upserts, is particularly useful for building data pipelines where 1 or more source Hudi tables are incrementally queried (streams/facts),
+joined with other tables (tables/dimensions), to [write out deltas](/docs/0.5.2-writing_data.html) to a target Hudi table. Incremental queries are realized by querying one of the tables above, 
+with special configurations that indicates to query planning that only incremental data needs to be fetched out of the table. 
+
+
+## Support Matrix
+
+Following tables show whether a given query is supported on specific query engine.
+
+### Copy-On-Write tables
+  
+|Query Engine|Snapshot Queries|Incremental Queries|
+|------------|--------|-----------|
+|**Hive**|Y|Y|
+|**Spark SQL**|Y|Y|
+|**Spark Datasource**|Y|Y|
+|**Presto**|Y|N|
+|**Impala**|Y|N|
+
+
+Note that `Read Optimized` queries are not applicable for COPY_ON_WRITE tables.
+
+### Merge-On-Read tables
+
+|Query Engine|Snapshot Queries|Incremental Queries|Read Optimized Queries|
+|------------|--------|-----------|--------------|
+|**Hive**|Y|Y|Y|
+|**Spark SQL**|Y|Y|Y|
+|**Spark Datasource**|N|N|Y|
+|**Presto**|N|N|Y|
+|**Impala**|N|N|N|
+
+
+In sections, below we will discuss specific setup to access different query types from different query engines. 
+
+## Hive
+
+In order for Hive to recognize Hudi tables and query correctly, 
+ - the HiveServer2 needs to be provided with the `hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` in its [aux jars path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr). This will ensure the input format 
+classes with its dependencies are available for query planning & execution. 
+ - For MERGE_ON_READ tables, additionally the bundle needs to be put on the hadoop/hive installation across the cluster, so that queries can pick up the custom RecordReader as well.
+
+In addition to setup above, for beeline cli access, the `hive.input.format` variable needs to be set to the fully qualified path name of the 
+inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez, additionally the `hive.tez.input.format` needs to be set 
+to `org.apache.hadoop.hive.ql.io.HiveInputFormat`. Then proceed to query the table like any other Hive table.
+
+### Incremental query
+`HiveIncrementalPuller` allows incrementally extracting changes from large fact/dimension tables via HiveQL, combining the benefits of Hive (reliably process complex SQL queries) and 
+incremental primitives (speed up querying tables incrementally instead of scanning fully). The tool uses Hive JDBC to run the hive query and saves its results in a temp table.
+that can later be upserted. Upsert utility (`HoodieDeltaStreamer`) has all the state it needs from the directory structure to know what should be the commit time on the target table.
+e.g: `/app/incremental-hql/intermediate/{source_table_name}_temp/{last_commit_included}`.The Delta Hive table registered will be of the form `{tmpdb}.{source_table}_{last_commit_included}`.
+
+The following are the configuration options for HiveIncrementalPuller
+
+| **Config** | **Description** | **Default** |
+|-------|--------|--------|
+|hiveUrl| Hive Server 2 URL to connect to |  |
+|hiveUser| Hive Server 2 Username |  |
+|hivePass| Hive Server 2 Password |  |
+|queue| YARN Queue name |  |
+|tmp| Directory where the temporary delta data is stored in DFS. The directory structure will follow conventions. Please see the below section.  |  |
+|extractSQLFile| The SQL to execute on the source table to extract the data. The data extracted will be all the rows that changed since a particular point in time. |  |
+|sourceTable| Source Table Name. Needed to set hive environment properties. |  |
+|sourceDb| Source DB name. Needed to set hive environment properties.| |
+|targetTable| Target Table Name. Needed for the intermediate storage directory structure.  |  |
+|targetDb| Target table's DB name.| |
+|tmpdb| The database to which the intermediate temp delta table will be created | hoodie_temp |
+|fromCommitTime| This is the most important parameter. This is the point in time from which the changed records are queried from.  |  |
+|maxCommits| Number of commits to include in the query. Setting this to -1 will include all the commits from fromCommitTime. Setting this to a value > 0, will include records that ONLY changed in the specified number of commits after fromCommitTime. This may be needed if you need to catch up say 2 commits at a time. | 3 |
+|help| Utility Help |  |
+
+
+Setting fromCommitTime=0 and maxCommits=-1 will fetch the entire source table and can be used to initiate backfills. If the target table is a Hudi table,
+then the utility can determine if the target table has no commits or is behind more than 24 hour (this is configurable),
+it will automatically use the backfill configuration, since applying the last 24 hours incrementally could take more time than doing a backfill. The current limitation of the tool
+is the lack of support for self-joining the same table in mixed mode (snapshot and incremental modes).
+
+**NOTE on Hive incremental queries that are executed using Fetch task:**
+Since Fetch tasks invoke InputFormat.listStatus() per partition, Hoodie metadata can be listed in
+every such listStatus() call. In order to avoid this, it might be useful to disable fetch tasks
+using the hive session property for incremental queries: `set hive.fetch.task.conversion=none;` This
+would ensure Map Reduce execution is chosen for a Hive query, which combines partitions (comma
+separated) and calls InputFormat.listStatus() only once with all those partitions.
+
+## Spark SQL
+Once the Hudi tables have been registered to the Hive metastore, it can be queried using the Spark-Hive integration. It supports all query types across both Hudi table types, 
+relying on the custom Hudi input formats again like Hive. Typically notebook users and spark-shell users leverage spark sql for querying Hudi tables. Please add hudi-spark-bundle as described above via --jars or --packages.
+ 
+By default, Spark SQL will try to use its own parquet reader instead of Hive SerDe when reading from Hive metastore parquet tables. However, for MERGE_ON_READ tables which has 
+both parquet and avro data, this default setting needs to be turned off using set `spark.sql.hive.convertMetastoreParquet=false`. 
+This will force Spark to fallback to using the Hive Serde to read the data (planning/executions is still Spark). 
+
+```java
+$ spark-shell --driver-class-path /etc/hive/conf  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g  --master yarn-client
+
+scala> sqlContext.sql("select count(*) from hudi_trips_mor_rt where datestr = '2016-10-02'").show()
+scala> sqlContext.sql("select count(*) from hudi_trips_mor_rt where datestr = '2016-10-02'").show()
+```
+
+For COPY_ON_WRITE tables, either Hive SerDe can be used by turning off `spark.sql.hive.convertMetastoreParquet=false` as described above or Spark's built in support can be leveraged. 
+If using spark's built in support, additionally a path filter needs to be pushed into sparkContext as follows. This method retains Spark built-in optimizations for reading parquet files like vectorized reading on Hudi Hive tables.
+
+```scala
+spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]);
+```
+
+## Spark Datasource
+
+The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how standard 
+datasources work (e.g: `spark.read.parquet`). Both snapshot querying and incremental querying are supported here. Typically spark jobs require adding `--jars <path to jar>/hudi-spark-bundle_2.11-<hudi version>.jar` to classpath of drivers 
+and executors. Alternatively, hudi-spark-bundle can also fetched via the `--packages` options (e.g: `--packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`).
+
+
+### Incremental query {#spark-incr-query}
+Of special interest to spark pipelines, is Hudi's ability to support incremental queries, like below. A sample incremental query, that will obtain all records written since `beginInstantTime`, looks like below.
+Thanks to Hudi's support for record level change streams, these incremental pipelines often offer 10x efficiency over batch counterparts, by only processing the changed records.
+The following snippet shows how to obtain all records changed after `beginInstantTime` and run some SQL on them.
+
+```java
+ Dataset<Row> hudiIncQueryDF = spark.read()
+     .format("org.apache.hudi")
+     .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+     .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(), <beginInstantTime>)
+     .load(tablePath); // For incremental query, pass in the root/base path of table
+     
+hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental")
+spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_incremental where fare > 20.0").show()
+```
+
+For examples, refer to [Setup spark-shell in quickstart](/docs/0.5.2-quick-start-guide.html#setup-spark-shell). 
+Please refer to [configurations](/docs/0.5.2-configurations.html#spark-datasource) section, to view all datasource options.
+
+Additionally, `HoodieReadClient` offers the following functionality using Hudi's implicit indexing.
+
+| **API** | **Description** |
+|-------|--------|
+| read(keys) | Read out the data corresponding to the keys as a DataFrame, using Hudi's own index for faster lookup |
+| filterExists() | Filter out already existing records from the provided `RDD[HoodieRecord]`. Useful for de-duplication |
+| checkExists(keys) | Check if the provided keys exist in a Hudi table |
+
+## Presto
+
+Presto is a popular query engine, providing interactive query performance. Presto currently supports snapshot queries on COPY_ON_WRITE and read optimized queries 
+on MERGE_ON_READ Hudi tables. This requires the `hudi-presto-bundle` jar to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation.
+
+## Impala (Not Officially Released)
+
+### Snapshot Query
+
+Impala is able to query Hudi Copy-on-write table as an [EXTERNAL TABLE](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_tables.html#external_tables) on HDFS.  
+
+To create a Hudi read optimized table on Impala:
+```
+CREATE EXTERNAL TABLE database.table_name
+LIKE PARQUET '/path/to/load/xxx.parquet'
+STORED AS HUDIPARQUET
+LOCATION '/path/to/load';
+```
+Impala is able to take advantage of the physical partition structure to improve the query performance.
+To create a partitioned table, the folder should follow the naming convention like `year=2020/month=1`.
+Impala use `=` to separate partition name and partition value.  
+To create a partitioned Hudi read optimized table on Impala:
+```
+CREATE EXTERNAL TABLE database.table_name
+LIKE PARQUET '/path/to/load/xxx.parquet'
+PARTITION BY (year int, month int, day int)
+STORED AS HUDIPARQUET
+LOCATION '/path/to/load';
+ALTER TABLE database.table_name RECOVER PARTITIONS;
+```
+After Hudi made a new commit, refresh the Impala table to get the latest results.
+```
+REFRESH database.table_name
+```
diff --git a/docs/_docs/0.5.2/2_4_configurations.cn.md b/docs/_docs/0.5.2/2_4_configurations.cn.md
new file mode 100644
index 0000000..67f29a7
--- /dev/null
+++ b/docs/_docs/0.5.2/2_4_configurations.cn.md
@@ -0,0 +1,474 @@
+---
+version: 0.5.2
+title: 配置
+keywords: garbage collection, hudi, jvm, configs, tuning
+permalink: /cn/docs/0.5.2-configurations.html
+summary: "Here we list all possible configurations and what they mean"
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+该页面介绍了几种配置写入或读取Hudi数据集的作业的方法。
+简而言之,您可以在几个级别上控制行为。
+
+- **[Spark数据源配置](#spark-datasource)** : 这些配置控制Hudi Spark数据源,提供如下功能:
+   定义键和分区、选择写操作、指定如何合并记录或选择要读取的视图类型。
+- **[WriteClient 配置](#writeclient-configs)** : 在内部,Hudi数据源使用基于RDD的`HoodieWriteClient` API
+   真正执行对存储的写入。 这些配置可对文件大小、压缩(compression)、并行度、压缩(compaction)、写入模式、清理等底层方面进行完全控制。
+   尽管Hudi提供了合理的默认设置,但在不同情形下,可能需要对这些配置进行调整以针对特定的工作负载进行优化。
+- **[RecordPayload 配置](#PAYLOAD_CLASS_OPT_KEY)** : 这是Hudi提供的最底层的定制。
+   RecordPayload定义了如何根据传入的新记录和存储的旧记录来产生新值以进行插入更新。
+   Hudi提供了诸如`OverwriteWithLatestAvroPayload`的默认实现,该实现仅使用最新或最后写入的记录来更新存储。
+   在数据源和WriteClient级别,都可以将其重写为扩展`HoodieRecordPayload`类的自定义类。
+ 
+## 与云存储连接
+
+无论使用RDD/WriteClient API还是数据源,以下信息都有助于配置对云存储的访问。
+
+ * [AWS S3](/cn/docs/0.5.2-s3_hoodie.html) <br/>
+   S3和Hudi协同工作所需的配置。
+ * [Google Cloud Storage](/cn/docs/0.5.2-gcs_hoodie.html) <br/>
+   GCS和Hudi协同工作所需的配置。
+
+## Spark数据源配置 {#spark-datasource}
+
+可以通过将以下选项传递到`option(k,v)`方法中来配置使用数据源的Spark作业。
+实际的数据源级别配置在下面列出。
+
+### 写选项
+
+另外,您可以使用`options()`或`option(k,v)`方法直接传递任何WriteClient级别的配置。
+
+```java
+inputDF.write()
+.format("org.apache.hudi")
+.options(clientOpts) // 任何Hudi客户端选项都可以传入
+.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
+.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+.option(HoodieWriteConfig.TABLE_NAME, tableName)
+.mode(SaveMode.Append)
+.save(basePath);
+```
+
+用于通过`write.format.option(...)`写入数据集的选项
+
+#### TABLE_NAME_OPT_KEY {#TABLE_NAME_OPT_KEY}
+  属性:`hoodie.datasource.write.table.name` [必须]<br/>
+  <span style="color:grey">Hive表名,用于将数据集注册到其中。</span>
+  
+#### OPERATION_OPT_KEY {#OPERATION_OPT_KEY}
+  属性:`hoodie.datasource.write.operation`, 默认值:`upsert`<br/>
+  <span style="color:grey">是否为写操作进行插入更新、插入或批量插入。使用`bulkinsert`将新数据加载到表中,之后使用`upsert`或`insert`。
+  批量插入使用基于磁盘的写入路径来扩展以加载大量输入,而无需对其进行缓存。</span>
+  
+#### STORAGE_TYPE_OPT_KEY {#STORAGE_TYPE_OPT_KEY}
+  属性:`hoodie.datasource.write.storage.type`, 默认值:`COPY_ON_WRITE` <br/>
+  <span style="color:grey">此写入的基础数据的存储类型。两次写入之间不能改变。</span>
+  
+#### PRECOMBINE_FIELD_OPT_KEY {#PRECOMBINE_FIELD_OPT_KEY}
+  属性:`hoodie.datasource.write.precombine.field`, 默认值:`ts` <br/>
+  <span style="color:grey">实际写入之前在preCombining中使用的字段。
+  当两个记录具有相同的键值时,我们将使用Object.compareTo(..)从precombine字段中选择一个值最大的记录。</span>
+
+#### PAYLOAD_CLASS_OPT_KEY {#PAYLOAD_CLASS_OPT_KEY}
+  属性:`hoodie.datasource.write.payload.class`, 默认值:`org.apache.hudi.OverwriteWithLatestAvroPayload` <br/>
+  <span style="color:grey">使用的有效载荷类。如果您想在插入更新或插入时使用自己的合并逻辑,请重写此方法。
+  这将使得`PRECOMBINE_FIELD_OPT_VAL`设置的任何值无效</span>
+  
+#### RECORDKEY_FIELD_OPT_KEY {#RECORDKEY_FIELD_OPT_KEY}
+  属性:`hoodie.datasource.write.recordkey.field`, 默认值:`uuid` <br/>
+  <span style="color:grey">记录键字段。用作`HoodieKey`中`recordKey`部分的值。
+  实际值将通过在字段值上调用.toString()来获得。可以使用点符号指定嵌套字段,例如:`a.b.c`</span>
+
+#### PARTITIONPATH_FIELD_OPT_KEY {#PARTITIONPATH_FIELD_OPT_KEY}
+  属性:`hoodie.datasource.write.partitionpath.field`, 默认值:`partitionpath` <br/>
+  <span style="color:grey">分区路径字段。用作`HoodieKey`中`partitionPath`部分的值。
+  通过调用.toString()获得实际的值</span>
+
+#### KEYGENERATOR_CLASS_OPT_KEY {#KEYGENERATOR_CLASS_OPT_KEY}
+  属性:`hoodie.datasource.write.keygenerator.class`, 默认值:`org.apache.hudi.SimpleKeyGenerator` <br/>
+  <span style="color:grey">键生成器类,实现从输入的`Row`对象中提取键</span>
+  
+#### COMMIT_METADATA_KEYPREFIX_OPT_KEY {#COMMIT_METADATA_KEYPREFIX_OPT_KEY}
+  属性:`hoodie.datasource.write.commitmeta.key.prefix`, 默认值:`_` <br/>
+  <span style="color:grey">以该前缀开头的选项键会自动添加到提交/增量提交的元数据中。
+  这对于与hudi时间轴一致的方式存储检查点信息很有用</span>
+
+#### INSERT_DROP_DUPS_OPT_KEY {#INSERT_DROP_DUPS_OPT_KEY}
+  属性:`hoodie.datasource.write.insert.drop.duplicates`, 默认值:`false` <br/>
+  <span style="color:grey">如果设置为true,则在插入操作期间从传入DataFrame中过滤掉所有重复记录。</span>
+  
+#### HIVE_SYNC_ENABLED_OPT_KEY {#HIVE_SYNC_ENABLED_OPT_KEY}
+  属性:`hoodie.datasource.hive_sync.enable`, 默认值:`false` <br/>
+  <span style="color:grey">设置为true时,将数据集注册并同步到Apache Hive Metastore</span>
+  
+#### HIVE_DATABASE_OPT_KEY {#HIVE_DATABASE_OPT_KEY}
+  属性:`hoodie.datasource.hive_sync.database`, 默认值:`default` <br/>
+  <span style="color:grey">要同步到的数据库</span>
+  
+#### HIVE_TABLE_OPT_KEY {#HIVE_TABLE_OPT_KEY}
+  属性:`hoodie.datasource.hive_sync.table`, [Required] <br/>
+  <span style="color:grey">要同步到的表</span>
+  
+#### HIVE_USER_OPT_KEY {#HIVE_USER_OPT_KEY}
+  属性:`hoodie.datasource.hive_sync.username`, 默认值:`hive` <br/>
+  <span style="color:grey">要使用的Hive用户名</span>
+  
+#### HIVE_PASS_OPT_KEY {#HIVE_PASS_OPT_KEY}
+  属性:`hoodie.datasource.hive_sync.password`, 默认值:`hive` <br/>
+  <span style="color:grey">要使用的Hive密码</span>
+  
+#### HIVE_URL_OPT_KEY {#HIVE_URL_OPT_KEY}
+  属性:`hoodie.datasource.hive_sync.jdbcurl`, 默认值:`jdbc:hive2://localhost:10000` <br/>
+  <span style="color:grey">Hive metastore url</span>
+  
+#### HIVE_PARTITION_FIELDS_OPT_KEY {#HIVE_PARTITION_FIELDS_OPT_KEY}
+  属性:`hoodie.datasource.hive_sync.partition_fields`, 默认值:` ` <br/>
+  <span style="color:grey">数据集中用于确定Hive分区的字段。</span>
+  
+#### HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY {#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY}
+  属性:`hoodie.datasource.hive_sync.partition_extractor_class`, 默认值:`org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor` <br/>
+  <span style="color:grey">用于将分区字段值提取到Hive分区列中的类。</span>
+  
+#### HIVE_ASSUME_DATE_PARTITION_OPT_KEY {#HIVE_ASSUME_DATE_PARTITION_OPT_KEY}
+  属性:`hoodie.datasource.hive_sync.assume_date_partitioning`, 默认值:`false` <br/>
+  <span style="color:grey">假设分区格式是yyyy/mm/dd</span>
+
+### 读选项
+
+用于通过`read.format.option(...)`读取数据集的选项
+
+#### VIEW_TYPE_OPT_KEY {#VIEW_TYPE_OPT_KEY}
+属性:`hoodie.datasource.view.type`, 默认值:`read_optimized` <br/>
+<span style="color:grey">是否需要以某种模式读取数据,增量模式(自InstantTime以来的新数据)
+(或)读优化模式(基于列数据获取最新视图)
+(或)实时模式(基于行和列数据获取最新视图)</span>
+
+#### BEGIN_INSTANTTIME_OPT_KEY {#BEGIN_INSTANTTIME_OPT_KEY} 
+属性:`hoodie.datasource.read.begin.instanttime`, [在增量模式下必须] <br/>
+<span style="color:grey">开始增量提取数据的即时时间。这里的instanttime不必一定与时间轴上的即时相对应。
+取出以`instant_time > BEGIN_INSTANTTIME`写入的新数据。
+例如:'20170901080000'将获取2017年9月1日08:00 AM之后写入的所有新数据。</span>
+ 
+#### END_INSTANTTIME_OPT_KEY {#END_INSTANTTIME_OPT_KEY}
+属性:`hoodie.datasource.read.end.instanttime`, 默认值:最新即时(即从开始即时获取所有新数据) <br/>
+<span style="color:grey">限制增量提取的数据的即时时间。取出以`instant_time <= END_INSTANTTIME`写入的新数据。</span>
+
+
+## WriteClient 配置 {#writeclient-configs}
+
+直接使用RDD级别api进行编程的Jobs可以构建一个`HoodieWriteConfig`对象,并将其传递给`HoodieWriteClient`构造函数。
+HoodieWriteConfig可以使用以下构建器模式构建。
+
+```java
+HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder()
+        .withPath(basePath)
+        .forTable(tableName)
+        .withSchema(schemaStr)
+        .withProps(props) // 从属性文件传递原始k、v对。
+        .withCompactionConfig(HoodieCompactionConfig.newBuilder().withXXX(...).build())
+        .withIndexConfig(HoodieIndexConfig.newBuilder().withXXX(...).build())
+        ...
+        .build();
+```
+
+以下各节介绍了写配置的不同方面,并解释了最重要的配置及其属性名称和默认值。
+
+#### withPath(hoodie_base_path) {#withPath}
+属性:`hoodie.base.path` [必须] <br/>
+<span style="color:grey">创建所有数据分区所依据的基本DFS路径。
+始终在前缀中明确指明存储方式(例如hdfs://,s3://等)。
+Hudi将有关提交、保存点、清理审核日志等的所有主要元数据存储在基本目录下的.hoodie目录中。</span>
+
+#### withSchema(schema_str) {#withSchema} 
+属性:`hoodie.avro.schema` [必须]<br/>
+<span style="color:grey">这是数据集的当前读取器的avro模式(schema)。
+这是整个模式的字符串。HoodieWriteClient使用此模式传递到HoodieRecordPayload的实现,以从源格式转换为avro记录。
+在更新过程中重写记录时也使用此模式。</span>
+
+#### forTable(table_name) {#forTable} 
+属性:`hoodie.table.name` [必须] <br/>
+ <span style="color:grey">数据集的表名,将用于在Hive中注册。每次运行需要相同。</span>
+
+#### withBulkInsertParallelism(bulk_insert_parallelism = 1500) {#withBulkInsertParallelism} 
+属性:`hoodie.bulkinsert.shuffle.parallelism`<br/>
+<span style="color:grey">批量插入旨在用于较大的初始导入,而此处的并行度决定了数据集中文件的初始数量。
+调整此值以达到在初始导入期间所需的最佳尺寸。</span>
+
+#### withParallelism(insert_shuffle_parallelism = 1500, upsert_shuffle_parallelism = 1500) {#withParallelism} 
+属性:`hoodie.insert.shuffle.parallelism`, `hoodie.upsert.shuffle.parallelism`<br/>
+<span style="color:grey">最初导入数据后,此并行度将控制用于读取输入记录的初始并行度。
+确保此值足够高,例如:1个分区用于1 GB的输入数据</span>
+
+#### combineInput(on_insert = false, on_update=true) {#combineInput} 
+属性:`hoodie.combine.before.insert`, `hoodie.combine.before.upsert`<br/>
+<span style="color:grey">在DFS中插入或更新之前先组合输入RDD并将多个部分记录合并为单个记录的标志</span>
+
+#### withWriteStatusStorageLevel(level = MEMORY_AND_DISK_SER) {#withWriteStatusStorageLevel} 
+属性:`hoodie.write.status.storage.level`<br/>
+<span style="color:grey">HoodieWriteClient.insert和HoodieWriteClient.upsert返回一个持久的RDD[WriteStatus],
+这是因为客户端可以选择检查WriteStatus并根据失败选择是否提交。这是此RDD的存储级别的配置</span>
+
+#### withAutoCommit(autoCommit = true) {#withAutoCommit} 
+属性:`hoodie.auto.commit`<br/>
+<span style="color:grey">插入和插入更新后,HoodieWriteClient是否应该自动提交。
+客户端可以选择关闭自动提交,并在"定义的成功条件"下提交</span>
+
+#### withAssumeDatePartitioning(assumeDatePartitioning = false) {#withAssumeDatePartitioning} 
+属性:`hoodie.assume.date.partitioning`<br/>
+<span style="color:grey">HoodieWriteClient是否应该假设数据按日期划分,即从基本路径划分为三个级别。
+这是支持<0.3.1版本创建的表的一个补丁。最终将被删除</span>
+
+#### withConsistencyCheckEnabled(enabled = false) {#withConsistencyCheckEnabled} 
+属性:`hoodie.consistency.check.enabled`<br/>
+<span style="color:grey">HoodieWriteClient是否应该执行其他检查,以确保写入的文件在基础文件系统/存储上可列出。
+将其设置为true可以解决S3的最终一致性模型,并确保作为提交的一部分写入的所有数据均能准确地用于查询。</span>
+
+### 索引配置
+以下配置控制索引行为,该行为将传入记录标记为对较旧记录的插入或更新。
+
+[withIndexConfig](#withIndexConfig) (HoodieIndexConfig) <br/>
+<span style="color:grey">可插入以具有外部索引(HBase)或使用存储在Parquet文件中的默认布隆过滤器(bloom filter)</span>
+        
+#### withIndexType(indexType = BLOOM) {#withIndexType}
+属性:`hoodie.index.type` <br/>
+<span style="color:grey">要使用的索引类型。默认为布隆过滤器。可能的选项是[BLOOM | HBASE | INMEMORY]。
+布隆过滤器消除了对外部系统的依赖,并存储在Parquet数据文件的页脚中</span>
+
+#### bloomFilterNumEntries(numEntries = 60000) {#bloomFilterNumEntries}
+属性:`hoodie.index.bloom.num_entries` <br/>
+<span style="color:grey">仅在索引类型为BLOOM时适用。<br/>这是要存储在布隆过滤器中的条目数。
+我们假设maxParquetFileSize为128MB,averageRecordSize为1024B,因此,一个文件中的记录总数约为130K。
+默认值(60000)大约是此近似值的一半。[HUDI-56](https://issues.apache.org/jira/browse/HUDI-56)
+描述了如何动态地对此进行计算。
+警告:将此值设置得太低,将产生很多误报,并且索引查找将必须扫描比其所需的更多的文件;如果将其设置得非常高,将线性增加每个数据文件的大小(每50000个条目大约4KB)。</span>
+
+#### bloomFilterFPP(fpp = 0.000000001) {#bloomFilterFPP}
+属性:`hoodie.index.bloom.fpp` <br/>
+<span style="color:grey">仅在索引类型为BLOOM时适用。<br/>根据条目数允许的错误率。
+这用于计算应为布隆过滤器分配多少位以及哈希函数的数量。通常将此值设置得很低(默认值:0.000000001),我们希望在磁盘空间上进行权衡以降低误报率</span>
+
+#### bloomIndexPruneByRanges(pruneRanges = true) {#bloomIndexPruneByRanges}
+属性:`hoodie.bloom.index.prune.by.ranges` <br/>
+<span style="color:grey">仅在索引类型为BLOOM时适用。<br/>为true时,从文件框定信息,可以加快索引查找的速度。 如果键具有单调递增的前缀,例如时间戳,则特别有用。</span>
+
+#### bloomIndexUseCaching(useCaching = true) {#bloomIndexUseCaching}
+属性:`hoodie.bloom.index.use.caching` <br/>
+<span style="color:grey">仅在索引类型为BLOOM时适用。<br/>为true时,将通过减少用于计算并行度或受影响分区的IO来缓存输入的RDD以加快索引查找</span>
+
+#### bloomIndexTreebasedFilter(useTreeFilter = true) {#bloomIndexTreebasedFilter}
+属性:`hoodie.bloom.index.use.treebased.filter` <br/>
+<span style="color:grey">仅在索引类型为BLOOM时适用。<br/>为true时,启用基于间隔树的文件过滤优化。与暴力模式相比,此模式可根据键范围加快文件过滤速度</span>
+
+#### bloomIndexBucketizedChecking(bucketizedChecking = true) {#bloomIndexBucketizedChecking}
+属性:`hoodie.bloom.index.bucketized.checking` <br/>
+<span style="color:grey">仅在索引类型为BLOOM时适用。<br/>为true时,启用了桶式布隆过滤。这减少了在基于排序的布隆索引查找中看到的偏差</span>
+
+#### bloomIndexKeysPerBucket(keysPerBucket = 10000000) {#bloomIndexKeysPerBucket}
+属性:`hoodie.bloom.index.keys.per.bucket` <br/>
+<span style="color:grey">仅在启用bloomIndexBucketizedChecking并且索引类型为bloom的情况下适用。<br/>
+此配置控制“存储桶”的大小,该大小可跟踪对单个文件进行的记录键检查的次数,并且是分配给执行布隆过滤器查找的每个分区的工作单位。
+较高的值将分摊将布隆过滤器读取到内存的固定成本。</span>
+
+#### bloomIndexParallelism(0) {#bloomIndexParallelism}
+属性:`hoodie.bloom.index.parallelism` <br/>
+<span style="color:grey">仅在索引类型为BLOOM时适用。<br/>这是索引查找的并行度,其中涉及Spark Shuffle。 默认情况下,这是根据输入的工作负载特征自动计算的</span>
+
+#### hbaseZkQuorum(zkString) [必须] {#hbaseZkQuorum}  
+属性:`hoodie.index.hbase.zkquorum` <br/>
+<span style="color:grey">仅在索引类型为HBASE时适用。要连接的HBase ZK Quorum URL。</span>
+
+#### hbaseZkPort(port) [必须] {#hbaseZkPort}  
+属性:`hoodie.index.hbase.zkport` <br/>
+<span style="color:grey">仅在索引类型为HBASE时适用。要连接的HBase ZK Quorum端口。</span>
+
+#### hbaseZkZnodeParent(zkZnodeParent)  [必须] {#hbaseTableName}
+属性:`hoodie.index.hbase.zknode.path` <br/>
+<span style="color:grey">仅在索引类型为HBASE时适用。这是根znode,它将包含HBase创建及使用的所有znode。</span>
+
+#### hbaseTableName(tableName)  [必须] {#hbaseTableName}
+属性:`hoodie.index.hbase.table` <br/>
+<span style="color:grey">仅在索引类型为HBASE时适用。HBase表名称,用作索引。Hudi将row_key和[partition_path, fileID, commitTime]映射存储在表中。</span>
+
+##### bloomIndexUpdatePartitionPath(updatePartitionPath = false) {#bloomIndexUpdatePartitionPath}
+属性:`hoodie.bloom.index.update.partition.path` <br/>
+<span style="color:grey">仅在索引类型为GLOBAL_BLOOM时适用。<br/>为true时,当对一个已有记录执行包含分区路径的更新操作时,将会导致把新记录插入到新分区,而把原有记录从旧分区里删除。为false时,只对旧分区的原有记录进行更新。</span>
+
+
+### 存储选项
+控制有关调整parquet和日志文件大小的方面。
+
+[withStorageConfig](#withStorageConfig) (HoodieStorageConfig) <br/>
+
+#### limitFileSize (size = 120MB) {#limitFileSize}
+属性:`hoodie.parquet.max.file.size` <br/>
+<span style="color:grey">Hudi写阶段生成的parquet文件的目标大小。对于DFS,这需要与基础文件系统块大小保持一致,以实现最佳性能。</span>
+
+#### parquetBlockSize(rowgroupsize = 120MB) {#parquetBlockSize} 
+属性:`hoodie.parquet.block.size` <br/>
+<span style="color:grey">Parquet行组大小。最好与文件大小相同,以便将文件中的单个列连续存储在磁盘上</span>
+
+#### parquetPageSize(pagesize = 1MB) {#parquetPageSize} 
+属性:`hoodie.parquet.page.size` <br/>
+<span style="color:grey">Parquet页面大小。页面是parquet文件中的读取单位。 在一个块内,页面被分别压缩。</span>
+
+#### parquetCompressionRatio(parquetCompressionRatio = 0.1) {#parquetCompressionRatio} 
+属性:`hoodie.parquet.compression.ratio` <br/>
+<span style="color:grey">当Hudi尝试调整新parquet文件的大小时,预期对parquet数据进行压缩的比例。
+如果bulk_insert生成的文件小于预期大小,请增加此值</span>
+
+#### parquetCompressionCodec(parquetCompressionCodec = gzip) {#parquetCompressionCodec}
+属性:`hoodie.parquet.compression.codec` <br/>
+<span style="color:grey">Parquet压缩编解码方式名称。默认值为gzip。可能的选项是[gzip | snappy | uncompressed | lzo]</span>
+
+#### logFileMaxSize(logFileSize = 1GB) {#logFileMaxSize} 
+属性:`hoodie.logfile.max.size` <br/>
+<span style="color:grey">LogFile的最大大小。这是在将日志文件移到下一个版本之前允许的最大大小。</span>
+
+#### logFileDataBlockMaxSize(dataBlockSize = 256MB) {#logFileDataBlockMaxSize} 
+属性:`hoodie.logfile.data.block.max.size` <br/>
+<span style="color:grey">LogFile数据块的最大大小。这是允许将单个数据块附加到日志文件的最大大小。
+这有助于确保附加到日志文件的数据被分解为可调整大小的块,以防止发生OOM错误。此大小应大于JVM内存。</span>
+
+#### logFileToParquetCompressionRatio(logFileToParquetCompressionRatio = 0.35) {#logFileToParquetCompressionRatio} 
+属性:`hoodie.logfile.to.parquet.compression.ratio` <br/>
+<span style="color:grey">随着记录从日志文件移动到parquet,预期会进行额外压缩的比例。
+用于merge_on_read存储,以将插入内容发送到日志文件中并控制压缩parquet文件的大小。</span>
+ 
+#### parquetCompressionCodec(parquetCompressionCodec = gzip) {#parquetCompressionCodec} 
+属性:`hoodie.parquet.compression.codec` <br/>
+<span style="color:grey">Parquet文件的压缩编解码方式</span>
+
+### 压缩配置
+压缩配置用于控制压缩(将日志文件合并到新的parquet基本文件中)、清理(回收较旧及未使用的文件组)。
+[withCompactionConfig](#withCompactionConfig) (HoodieCompactionConfig) <br/>
+
+#### withCleanerPolicy(policy = KEEP_LATEST_COMMITS) {#withCleanerPolicy} 
+属性:`hoodie.cleaner.policy` <br/>
+<span style="color:grey">要使用的清理政策。Hudi将删除旧版本的parquet文件以回收空间。
+任何引用此版本文件的查询和计算都将失败。最好确保数据保留的时间超过最大查询执行时间。</span>
+
+#### retainCommits(no_of_commits_to_retain = 24) {#retainCommits} 
+属性:`hoodie.cleaner.commits.retained` <br/>
+<span style="color:grey">保留的提交数。因此,数据将保留为num_of_commits * time_between_commits(计划的)。
+这也直接转化为您可以逐步提取此数据集的数量</span>
+
+#### archiveCommitsWith(minCommits = 96, maxCommits = 128) {#archiveCommitsWith} 
+属性:`hoodie.keep.min.commits`, `hoodie.keep.max.commits` <br/>
+<span style="color:grey">每个提交都是`.hoodie`目录中的一个小文件。由于DFS通常不支持大量小文件,因此Hudi将较早的提交归档到顺序日志中。
+提交通过重命名提交文件以原子方式发布。</span>
+
+#### withCommitsArchivalBatchSize(batch = 10) {#withCommitsArchivalBatchSize}
+属性:`hoodie.commits.archival.batch` <br/>
+<span style="color:grey">这控制着批量读取并一起归档的提交即时的数量。</span>
+
+#### compactionSmallFileSize(size = 0) {#compactionSmallFileSize} 
+属性:`hoodie.parquet.small.file.limit` <br/>
+<span style="color:grey">该值应小于maxFileSize,如果将其设置为0,会关闭此功能。
+由于批处理中分区中插入记录的数量众多,总会出现小文件。
+Hudi提供了一个选项,可以通过将对该分区中的插入作为对现有小文件的更新来解决小文件的问题。
+此处的大小是被视为“小文件大小”的最小文件大小。</span>
+
+#### insertSplitSize(size = 500000) {#insertSplitSize} 
+属性:`hoodie.copyonwrite.insert.split.size` <br/>
+<span style="color:grey">插入写入并行度。为单个分区的总共插入次数。
+写出100MB的文件,至少1kb大小的记录,意味着每个文件有100K记录。默认值是超额配置为500K。
+为了改善插入延迟,请对其进行调整以匹配单个文件中的记录数。
+将此值设置为较小的值将导致文件变小(尤其是当compactionSmallFileSize为0时)</span>
+
+#### autoTuneInsertSplits(true) {#autoTuneInsertSplits} 
+属性:`hoodie.copyonwrite.insert.auto.split` <br/>
+<span style="color:grey">Hudi是否应该基于最后24个提交的元数据动态计算insertSplitSize。默认关闭。</span>
+
+#### approxRecordSize(size = 1024) {#approxRecordSize} 
+属性:`hoodie.copyonwrite.record.size.estimate` <br/>
+<span style="color:grey">平均记录大小。如果指定,hudi将使用它,并且不会基于最后24个提交的元数据动态地计算。
+没有默认值设置。这对于计算插入并行度以及将插入打包到小文件中至关重要。如上所述。</span>
+
+#### withInlineCompaction(inlineCompaction = false) {#withInlineCompaction} 
+属性:`hoodie.compact.inline` <br/>
+<span style="color:grey">当设置为true时,紧接在插入或插入更新或批量插入的提交或增量提交操作之后由摄取本身触发压缩</span>
+
+#### withMaxNumDeltaCommitsBeforeCompaction(maxNumDeltaCommitsBeforeCompaction = 10) {#withMaxNumDeltaCommitsBeforeCompaction} 
+属性:`hoodie.compact.inline.max.delta.commits` <br/>
+<span style="color:grey">触发内联压缩之前要保留的最大增量提交数</span>
+
+#### withCompactionLazyBlockReadEnabled(true) {#withCompactionLazyBlockReadEnabled} 
+属性:`hoodie.compaction.lazy.block.read` <br/>
+<span style="color:grey">当CompactedLogScanner合并所有日志文件时,此配置有助于选择是否应延迟读取日志块。
+选择true以使用I/O密集型延迟块读取(低内存使用),或者为false来使用内存密集型立即块读取(高内存使用)</span>
+
+#### withCompactionReverseLogReadEnabled(false) {#withCompactionReverseLogReadEnabled} 
+属性:`hoodie.compaction.reverse.log.read` <br/>
+<span style="color:grey">HoodieLogFormatReader会从pos=0到pos=file_length向前读取日志文件。
+如果此配置设置为true,则Reader会从pos=file_length到pos=0反向读取日志文件</span>
+
+#### withCleanerParallelism(cleanerParallelism = 200) {#withCleanerParallelism} 
+属性:`hoodie.cleaner.parallelism` <br/>
+<span style="color:grey">如果清理变慢,请增加此值。</span>
+
+#### withCompactionStrategy(compactionStrategy = org.apache.hudi.io.compact.strategy.LogFileSizeBasedCompactionStrategy) {#withCompactionStrategy} 
+属性:`hoodie.compaction.strategy` <br/>
+<span style="color:grey">用来决定在每次压缩运行期间选择要压缩的文件组的压缩策略。
+默认情况下,Hudi选择具有累积最多未合并数据的日志文件</span>
+
+#### withTargetIOPerCompactionInMB(targetIOPerCompactionInMB = 500000) {#withTargetIOPerCompactionInMB} 
+属性:`hoodie.compaction.target.io` <br/>
+<span style="color:grey">LogFileSizeBasedCompactionStrategy的压缩运行期间要花费的MB量。当压缩以内联模式运行时,此值有助于限制摄取延迟。</span>
+
+#### withTargetPartitionsPerDayBasedCompaction(targetPartitionsPerCompaction = 10) {#withTargetPartitionsPerDayBasedCompaction} 
+属性:`hoodie.compaction.daybased.target` <br/>
+<span style="color:grey">由org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy使用,表示在压缩运行期间要压缩的最新分区数。</span>    
+
+#### withPayloadClass(payloadClassName = org.apache.hudi.common.model.HoodieAvroPayload) {#payloadClassName} 
+属性:`hoodie.compaction.payload.class` <br/>
+<span style="color:grey">这需要与插入/插入更新过程中使用的类相同。
+就像写入一样,压缩也使用记录有效负载类将日志中的记录彼此合并,再次与基本文件合并,并生成压缩后要写入的最终记录。</span>
+
+
+    
+### 指标配置
+能够将Hudi指标报告给graphite。
+[withMetricsConfig](#withMetricsConfig) (HoodieMetricsConfig) <br/>
+<span style="color:grey">Hudi会发布有关每次提交、清理、回滚等的指标。</span>
+
+#### on(metricsOn = true) {#on} 
+属性:`hoodie.metrics.on` <br/>
+<span style="color:grey">打开或关闭发送指标。默认情况下处于启用状态。</span>
+
+#### withReporterType(reporterType = GRAPHITE) {#withReporterType} 
+属性:`hoodie.metrics.reporter.type` <br/>
+<span style="color:grey">指标报告者的类型。默认使用graphite,也是唯一支持的类型。</span>
+
+#### toGraphiteHost(host = localhost) {#toGraphiteHost} 
+属性:`hoodie.metrics.graphite.host` <br/>
+<span style="color:grey">要连接的graphite主机</span>
+
+#### onGraphitePort(port = 4756) {#onGraphitePort} 
+属性:`hoodie.metrics.graphite.port` <br/>
+<span style="color:grey">要连接的graphite端口</span>
+
+#### usePrefix(prefix = "") {#usePrefix} 
+属性:`hoodie.metrics.graphite.metric.prefix` <br/>
+<span style="color:grey">适用于所有指标的标准前缀。这有助于添加如数据中心、环境等信息</span>
+    
+### 内存配置
+控制由Hudi内部执行的压缩和合并的内存使用情况
+[withMemoryConfig](#withMemoryConfig) (HoodieMemoryConfig) <br/>
+<span style="color:grey">内存相关配置</span>
+
+#### withMaxMemoryFractionPerPartitionMerge(maxMemoryFractionPerPartitionMerge = 0.6) {#withMaxMemoryFractionPerPartitionMerge} 
+属性:`hoodie.memory.merge.fraction` <br/>
+<span style="color:grey">该比例乘以用户内存比例(1-spark.memory.fraction)以获得合并期间要使用的堆空间的最终比例</span>
+
+#### withMaxMemorySizePerCompactionInBytes(maxMemorySizePerCompactionInBytes = 1GB) {#withMaxMemorySizePerCompactionInBytes} 
+属性:`hoodie.memory.compaction.fraction` <br/>
+<span style="color:grey">HoodieCompactedLogScanner读取日志块,将记录转换为HoodieRecords,然后合并这些日志块和记录。
+在任何时候,日志块中的条目数可以小于或等于相应的parquet文件中的条目数。这可能导致Scanner出现OOM。
+因此,可溢出的映射有助于减轻内存压力。使用此配置来设置可溢出映射的最大允许inMemory占用空间。</span>
+
+#### withWriteStatusFailureFraction(failureFraction = 0.1) {#withWriteStatusFailureFraction}
+属性:`hoodie.memory.writestatus.failure.fraction` <br/>
+<span style="color:grey">此属性控制报告给驱动程序的失败记录和异常的比例</span>
diff --git a/docs/_docs/0.5.2/2_4_configurations.md b/docs/_docs/0.5.2/2_4_configurations.md
new file mode 100644
index 0000000..6005102
--- /dev/null
+++ b/docs/_docs/0.5.2/2_4_configurations.md
@@ -0,0 +1,437 @@
+---
+version: 0.5.2
+title: Configurations
+keywords: garbage collection, hudi, jvm, configs, tuning
+permalink: /docs/0.5.2-configurations.html
+summary: "Here we list all possible configurations and what they mean"
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+This page covers the different ways of configuring your job to write/read Hudi tables. 
+At a high level, you can control behaviour at few levels. 
+
+- **[Spark Datasource Configs](#spark-datasource)** : These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
+- **[WriteClient Configs](#writeclient-configs)** : Internally, the Hudi datasource uses a RDD based `HoodieWriteClient` api to actually perform writes to storage. These configs provide deep control over lower level aspects like 
+   file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
+- **[RecordPayload Config](#PAYLOAD_CLASS_OPT_KEY)** : This is the lowest level of customization offered by Hudi. Record payloads define how to produce new values to upsert based on incoming new record and 
+   stored old record. Hudi provides default implementations such as `OverwriteWithLatestAvroPayload` which simply update table with the latest/last-written record. 
+   This can be overridden to a custom class extending `HoodieRecordPayload` class, on both datasource and WriteClient levels.
+ 
+## Talking to Cloud Storage
+
+Immaterial of whether RDD/WriteClient APIs or Datasource is used, the following information helps configure access
+to cloud stores.
+
+ * [AWS S3](/docs/0.5.2-s3_hoodie) <br/>
+   Configurations required for S3 and Hudi co-operability.
+ * [Google Cloud Storage](/docs/0.5.2-gcs_hoodie) <br/>
+   Configurations required for GCS and Hudi co-operability.
+
+## Spark Datasource Configs {#spark-datasource}
+
+Spark jobs using the datasource can be configured by passing the below options into the `option(k,v)` method as usual.
+The actual datasource level configs are listed below.
+
+
+### Write Options
+
+Additionally, you can pass down any of the WriteClient level configs directly using `options()` or `option(k,v)` methods.
+
+```java
+inputDF.write()
+.format("org.apache.hudi")
+.options(clientOpts) // any of the Hudi client opts can be passed in as well
+.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
+.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+.option(HoodieWriteConfig.TABLE_NAME, tableName)
+.mode(SaveMode.Append)
+.save(basePath);
+```
+
+Options useful for writing tables via `write.format.option(...)`
+
+#### TABLE_NAME_OPT_KEY {#TABLE_NAME_OPT_KEY}
+  Property: `hoodie.datasource.write.table.name` [Required]<br/>
+  <span style="color:grey">Hive table name, to register the table into.</span>
+  
+#### OPERATION_OPT_KEY {#OPERATION_OPT_KEY}
+  Property: `hoodie.datasource.write.operation`, Default: `upsert`<br/>
+  <span style="color:grey">whether to do upsert, insert or bulkinsert for the write operation. Use `bulkinsert` to load new data into a table, and there on use `upsert`/`insert`. 
+  bulk insert uses a disk based write path to scale to load large inputs without need to cache it.</span>
+  
+#### TABLE_TYPE_OPT_KEY {#TABLE_TYPE_OPT_KEY}
+  Property: `hoodie.datasource.write.table.type`, Default: `COPY_ON_WRITE` <br/>
+  <span style="color:grey">The table type for the underlying data, for this write. This can't change between writes.</span>
+  
+#### PRECOMBINE_FIELD_OPT_KEY {#PRECOMBINE_FIELD_OPT_KEY}
+  Property: `hoodie.datasource.write.precombine.field`, Default: `ts` <br/>
+  <span style="color:grey">Field used in preCombining before actual write. When two records have the same key value,
+we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)</span>
+
+#### PAYLOAD_CLASS_OPT_KEY {#PAYLOAD_CLASS_OPT_KEY}
+  Property: `hoodie.datasource.write.payload.class`, Default: `org.apache.hudi.OverwriteWithLatestAvroPayload` <br/>
+  <span style="color:grey">Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. 
+  This will render any value set for `PRECOMBINE_FIELD_OPT_VAL` in-effective</span>
+  
+#### RECORDKEY_FIELD_OPT_KEY {#RECORDKEY_FIELD_OPT_KEY}
+  Property: `hoodie.datasource.write.recordkey.field`, Default: `uuid` <br/>
+  <span style="color:grey">Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value
+will be obtained by invoking .toString() on the field value. Nested fields can be specified using
+the dot notation eg: `a.b.c`</span>
+
+#### PARTITIONPATH_FIELD_OPT_KEY {#PARTITIONPATH_FIELD_OPT_KEY}
+  Property: `hoodie.datasource.write.partitionpath.field`, Default: `partitionpath` <br/>
+  <span style="color:grey">Partition path field. Value to be used at the `partitionPath` component of `HoodieKey`.
+Actual value ontained by invoking .toString()</span>
+
+#### KEYGENERATOR_CLASS_OPT_KEY {#KEYGENERATOR_CLASS_OPT_KEY}
+  Property: `hoodie.datasource.write.keygenerator.class`, Default: `org.apache.hudi.SimpleKeyGenerator` <br/>
+  <span style="color:grey">Key generator class, that implements will extract the key out of incoming `Row` object</span>
+  
+#### COMMIT_METADATA_KEYPREFIX_OPT_KEY {#COMMIT_METADATA_KEYPREFIX_OPT_KEY}
+  Property: `hoodie.datasource.write.commitmeta.key.prefix`, Default: `_` <br/>
+  <span style="color:grey">Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata.
+This is useful to store checkpointing information, in a consistent way with the hudi timeline</span>
+
+#### INSERT_DROP_DUPS_OPT_KEY {#INSERT_DROP_DUPS_OPT_KEY}
+  Property: `hoodie.datasource.write.insert.drop.duplicates`, Default: `false` <br/>
+  <span style="color:grey">If set to true, filters out all duplicate records from incoming dataframe, during insert operations. </span>
+  
+#### HIVE_SYNC_ENABLED_OPT_KEY {#HIVE_SYNC_ENABLED_OPT_KEY}
+  Property: `hoodie.datasource.hive_sync.enable`, Default: `false` <br/>
+  <span style="color:grey">When set to true, register/sync the table to Apache Hive metastore</span>
+  
+#### HIVE_DATABASE_OPT_KEY {#HIVE_DATABASE_OPT_KEY}
+  Property: `hoodie.datasource.hive_sync.database`, Default: `default` <br/>
+  <span style="color:grey">database to sync to</span>
+  
+#### HIVE_TABLE_OPT_KEY {#HIVE_TABLE_OPT_KEY}
+  Property: `hoodie.datasource.hive_sync.table`, [Required] <br/>
+  <span style="color:grey">table to sync to</span>
+  
+#### HIVE_USER_OPT_KEY {#HIVE_USER_OPT_KEY}
+  Property: `hoodie.datasource.hive_sync.username`, Default: `hive` <br/>
+  <span style="color:grey">hive user name to use</span>
+  
+#### HIVE_PASS_OPT_KEY {#HIVE_PASS_OPT_KEY}
+  Property: `hoodie.datasource.hive_sync.password`, Default: `hive` <br/>
+  <span style="color:grey">hive password to use</span>
+  
+#### HIVE_URL_OPT_KEY {#HIVE_URL_OPT_KEY}
+  Property: `hoodie.datasource.hive_sync.jdbcurl`, Default: `jdbc:hive2://localhost:10000` <br/>
+  <span style="color:grey">Hive metastore url</span>
+  
+#### HIVE_PARTITION_FIELDS_OPT_KEY {#HIVE_PARTITION_FIELDS_OPT_KEY}
+  Property: `hoodie.datasource.hive_sync.partition_fields`, Default: ` ` <br/>
+  <span style="color:grey">field in the table to use for determining hive partition columns.</span>
+  
+#### HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY {#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY}
+  Property: `hoodie.datasource.hive_sync.partition_extractor_class`, Default: `org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor` <br/>
+  <span style="color:grey">Class used to extract partition field values into hive partition columns.</span>
+  
+#### HIVE_ASSUME_DATE_PARTITION_OPT_KEY {#HIVE_ASSUME_DATE_PARTITION_OPT_KEY}
+  Property: `hoodie.datasource.hive_sync.assume_date_partitioning`, Default: `false` <br/>
+  <span style="color:grey">Assume partitioning is yyyy/mm/dd</span>
+
+### Read Options
+
+Options useful for reading tables via `read.format.option(...)`
+
+#### QUERY_TYPE_OPT_KEY {#QUERY_TYPE_OPT_KEY}
+Property: `hoodie.datasource.query.type`, Default: `snapshot` <br/>
+<span style="color:grey">Whether data needs to be read, in incremental mode (new data since an instantTime)
+(or) Read Optimized mode (obtain latest view, based on columnar data)
+(or) Snapshot mode (obtain latest view, based on row & columnar data)</span>
+
+#### BEGIN_INSTANTTIME_OPT_KEY {#BEGIN_INSTANTTIME_OPT_KEY} 
+Property: `hoodie.datasource.read.begin.instanttime`, [Required in incremental mode] <br/>
+<span style="color:grey">Instant time to start incrementally pulling data from. The instanttime here need not
+necessarily correspond to an instant on the timeline. New data written with an
+ `instant_time > BEGIN_INSTANTTIME` are fetched out. For e.g: '20170901080000' will get
+ all new data written after Sep 1, 2017 08:00AM.</span>
+ 
+#### END_INSTANTTIME_OPT_KEY {#END_INSTANTTIME_OPT_KEY}
+Property: `hoodie.datasource.read.end.instanttime`, Default: latest instant (i.e fetches all new data since begin instant time) <br/>
+<span style="color:grey"> Instant time to limit incrementally fetched data to. New data written with an
+`instant_time <= END_INSTANTTIME` are fetched out.</span>
+
+
+## WriteClient Configs {#writeclient-configs}
+
+Jobs programming directly against the RDD level apis can build a `HoodieWriteConfig` object and pass it in to the `HoodieWriteClient` constructor. 
+HoodieWriteConfig can be built using a builder pattern as below. 
+
+```java
+HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder()
+        .withPath(basePath)
+        .forTable(tableName)
+        .withSchema(schemaStr)
+        .withProps(props) // pass raw k,v pairs from a property file.
+        .withCompactionConfig(HoodieCompactionConfig.newBuilder().withXXX(...).build())
+        .withIndexConfig(HoodieIndexConfig.newBuilder().withXXX(...).build())
+        ...
+        .build();
+```
+
+Following subsections go over different aspects of write configs, explaining most important configs with their property names, default values.
+
+#### withPath(hoodie_base_path) {#withPath}
+Property: `hoodie.base.path` [Required] <br/>
+<span style="color:grey">Base DFS path under which all the data partitions are created. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under the base directory. </span>
+
+#### withSchema(schema_str) {#withSchema} 
+Property: `hoodie.avro.schema` [Required]<br/>
+<span style="color:grey">This is the current reader avro schema for the table. This is a string of the entire schema. HoodieWriteClient uses this schema to pass on to implementations of HoodieRecordPayload to convert from the source format to avro record. This is also used when re-writing records during an update. </span>
+
+#### forTable(table_name) {#forTable} 
+Property: `hoodie.table.name` [Required] <br/>
+ <span style="color:grey">Table name that will be used for registering with Hive. Needs to be same across runs.</span>
+
+#### withBulkInsertParallelism(bulk_insert_parallelism = 1500) {#withBulkInsertParallelism} 
+Property: `hoodie.bulkinsert.shuffle.parallelism`<br/>
+<span style="color:grey">Bulk insert is meant to be used for large initial imports and this parallelism determines the initial number of files in your table. Tune this to achieve a desired optimal size during initial import.</span>
+
+#### withParallelism(insert_shuffle_parallelism = 1500, upsert_shuffle_parallelism = 1500) {#withParallelism} 
+Property: `hoodie.insert.shuffle.parallelism`, `hoodie.upsert.shuffle.parallelism`<br/>
+<span style="color:grey">Once data has been initially imported, this parallelism controls initial parallelism for reading input records. Ensure this value is high enough say: 1 partition for 1 GB of input data</span>
+
+#### combineInput(on_insert = false, on_update=true) {#combineInput} 
+Property: `hoodie.combine.before.insert`, `hoodie.combine.before.upsert`<br/>
+<span style="color:grey">Flag which first combines the input RDD and merges multiple partial records into a single record before inserting or updating in DFS</span>
+
+#### withWriteStatusStorageLevel(level = MEMORY_AND_DISK_SER) {#withWriteStatusStorageLevel} 
+Property: `hoodie.write.status.storage.level`<br/>
+<span style="color:grey">HoodieWriteClient.insert and HoodieWriteClient.upsert returns a persisted RDD[WriteStatus], this is because the Client can choose to inspect the WriteStatus and choose and commit or not based on the failures. This is a configuration for the storage level for this RDD </span>
+
+#### withAutoCommit(autoCommit = true) {#withAutoCommit} 
+Property: `hoodie.auto.commit`<br/>
+<span style="color:grey">Should HoodieWriteClient autoCommit after insert and upsert. The client can choose to turn off auto-commit and commit on a "defined success condition"</span>
+
+#### withAssumeDatePartitioning(assumeDatePartitioning = false) {#withAssumeDatePartitioning} 
+Property: `hoodie.assume.date.partitioning`<br/>
+<span style="color:grey">Should HoodieWriteClient assume the data is partitioned by dates, i.e three levels from base path. This is a stop-gap to support tables created by versions < 0.3.1. Will be removed eventually </span>
+
+#### withConsistencyCheckEnabled(enabled = false) {#withConsistencyCheckEnabled} 
+Property: `hoodie.consistency.check.enabled`<br/>
+<span style="color:grey">Should HoodieWriteClient perform additional checks to ensure written files' are listable on the underlying filesystem/storage. Set this to true, to workaround S3's eventual consistency model and ensure all data written as a part of a commit is faithfully available for queries. </span>
+
+### Index configs
+Following configs control indexing behavior, which tags incoming records as either inserts or updates to older records. 
+
+[withIndexConfig](#withIndexConfig) (HoodieIndexConfig) <br/>
+<span style="color:grey">This is pluggable to have a external index (HBase) or use the default bloom filter stored in the Parquet files</span>
+        
+#### withIndexType(indexType = BLOOM) {#withIndexType}
+Property: `hoodie.index.type` <br/>
+<span style="color:grey">Type of index to use. Default is Bloom filter. Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the dependency on a external system and is stored in the footer of the Parquet Data Files</span>
+
+#### bloomFilterNumEntries(numEntries = 60000) {#bloomFilterNumEntries}
+Property: `hoodie.index.bloom.num_entries` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/>This is the number of entries to be stored in the bloom filter. We assume the maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx a total of 130K records in a file. The default (60000) is roughly half of this approximation. [HUDI-56](https://issues.apache.org/jira/browse/HUDI-56) tracks computing this dynamically. Warning: Setting this very low, will generate a lot of false positives and index l [...]
+
+#### bloomFilterFPP(fpp = 0.000000001) {#bloomFilterFPP}
+Property: `hoodie.index.bloom.fpp` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> Error rate allowed given the number of entries. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001), we like to tradeoff disk space for lower false positives</span>
+
+#### bloomIndexPruneByRanges(pruneRanges = true) {#bloomIndexPruneByRanges}
+Property: `hoodie.bloom.index.prune.by.ranges` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> When true, range information from files to leveraged speed up index lookups. Particularly helpful, if the key has a monotonously increasing prefix, such as timestamp.</span>
+
+#### bloomIndexUseCaching(useCaching = true) {#bloomIndexUseCaching}
+Property: `hoodie.bloom.index.use.caching` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> When true, the input RDD will cached to speed up index lookup by reducing IO for computing parallelism or affected partitions</span>
+
+#### bloomIndexTreebasedFilter(useTreeFilter = true) {#bloomIndexTreebasedFilter}
+Property: `hoodie.bloom.index.use.treebased.filter` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> When true, interval tree based file pruning optimization is enabled. This mode speeds-up file-pruning based on key ranges when compared with the brute-force mode</span>
+
+#### bloomIndexBucketizedChecking(bucketizedChecking = true) {#bloomIndexBucketizedChecking}
+Property: `hoodie.bloom.index.bucketized.checking` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> When true, bucketized bloom filtering is enabled. This reduces skew seen in sort based bloom index lookup</span>
+
+#### bloomIndexKeysPerBucket(keysPerBucket = 10000000) {#bloomIndexKeysPerBucket}
+Property: `hoodie.bloom.index.keys.per.bucket` <br/>
+<span style="color:grey">Only applies if bloomIndexBucketizedChecking is enabled and index type is bloom. <br/> This configuration controls the "bucket" size which tracks the number of record-key checks made against a single file and is the unit of work allocated to each partition performing bloom filter lookup. A higher value would amortize the fixed cost of reading a bloom filter to memory. </span>
+
+#### bloomIndexParallelism(0) {#bloomIndexParallelism}
+Property: `hoodie.bloom.index.parallelism` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> This is the amount of parallelism for index lookup, which involves a Spark Shuffle. By default, this is auto computed based on input workload characteristics</span>
+
+#### hbaseZkQuorum(zkString) [Required] {#hbaseZkQuorum}  
+Property: `hoodie.index.hbase.zkquorum` <br/>
+<span style="color:grey">Only applies if index type is HBASE. HBase ZK Quorum url to connect to.</span>
+
+#### hbaseZkPort(port) [Required] {#hbaseZkPort}  
+Property: `hoodie.index.hbase.zkport` <br/>
+<span style="color:grey">Only applies if index type is HBASE. HBase ZK Quorum port to connect to.</span>
+
+#### hbaseZkZnodeParent(zkZnodeParent)  [Required] {#hbaseTableName}
+Property: `hoodie.index.hbase.zknode.path` <br/>
+<span style="color:grey">Only applies if index type is HBASE. This is the root znode that will contain all the znodes created/used by HBase.</span>
+
+#### hbaseTableName(tableName)  [Required] {#hbaseTableName}
+Property: `hoodie.index.hbase.table` <br/>
+<span style="color:grey">Only applies if index type is HBASE. HBase Table name to use as the index. Hudi stores the row_key and [partition_path, fileID, commitTime] mapping in the table.</span>
+
+##### bloomIndexUpdatePartitionPath(updatePartitionPath = false) {#bloomIndexUpdatePartitionPath}
+Property: `hoodie.bloom.index.update.partition.path` <br/>
+<span style="color:grey">Only applies if index type is GLOBAL_BLOOM. <br/>When set to true, an update including the partition path of a record that already exists will result in inserting the incoming record into the new partition and deleting the original record in the old partition. When set to false, the original record will only be updated in the old partition.</span>
+
+
+### Storage configs
+Controls aspects around sizing parquet and log files.
+
+[withStorageConfig](#withStorageConfig) (HoodieStorageConfig) <br/>
+
+#### limitFileSize (size = 120MB) {#limitFileSize}
+Property: `hoodie.parquet.max.file.size` <br/>
+<span style="color:grey">Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance. </span>
+
+#### parquetBlockSize(rowgroupsize = 120MB) {#parquetBlockSize} 
+Property: `hoodie.parquet.block.size` <br/>
+<span style="color:grey">Parquet RowGroup size. Its better this is same as the file size, so that a single column within a file is stored continuously on disk</span>
+
+#### parquetPageSize(pagesize = 1MB) {#parquetPageSize} 
+Property: `hoodie.parquet.page.size` <br/>
+<span style="color:grey">Parquet page size. Page is the unit of read within a parquet file. Within a block, pages are compressed seperately. </span>
+
+#### parquetCompressionRatio(parquetCompressionRatio = 0.1) {#parquetCompressionRatio} 
+Property: `hoodie.parquet.compression.ratio` <br/>
+<span style="color:grey">Expected compression of parquet data used by Hudi, when it tries to size new parquet files. Increase this value, if bulk_insert is producing smaller than expected sized files</span>
+
+#### parquetCompressionCodec(parquetCompressionCodec = gzip) {#parquetCompressionCodec}
+Property: `hoodie.parquet.compression.codec` <br/>
+<span style="color:grey">Parquet compression codec name. Default is gzip. Possible options are [gzip | snappy | uncompressed | lzo]</span>
+
+#### logFileMaxSize(logFileSize = 1GB) {#logFileMaxSize} 
+Property: `hoodie.logfile.max.size` <br/>
+<span style="color:grey">LogFile max size. This is the maximum size allowed for a log file before it is rolled over to the next version. </span>
+
+#### logFileDataBlockMaxSize(dataBlockSize = 256MB) {#logFileDataBlockMaxSize} 
+Property: `hoodie.logfile.data.block.max.size` <br/>
+<span style="color:grey">LogFile Data block max size. This is the maximum size allowed for a single data block to be appended to a log file. This helps to make sure the data appended to the log file is broken up into sizable blocks to prevent from OOM errors. This size should be greater than the JVM memory. </span>
+
+#### logFileToParquetCompressionRatio(logFileToParquetCompressionRatio = 0.35) {#logFileToParquetCompressionRatio} 
+Property: `hoodie.logfile.to.parquet.compression.ratio` <br/>
+<span style="color:grey">Expected additional compression as records move from log files to parquet. Used for merge_on_read table to send inserts into log files & control the size of compacted parquet file.</span>
+ 
+#### parquetCompressionCodec(parquetCompressionCodec = gzip) {#parquetCompressionCodec} 
+Property: `hoodie.parquet.compression.codec` <br/>
+<span style="color:grey">Compression Codec for parquet files </span>
+
+### Compaction configs
+Configs that control compaction (merging of log files onto a new parquet base file), cleaning (reclamation of older/unused file groups).
+[withCompactionConfig](#withCompactionConfig) (HoodieCompactionConfig) <br/>
+
+#### withCleanerPolicy(policy = KEEP_LATEST_COMMITS) {#withCleanerPolicy} 
+Property: `hoodie.cleaner.policy` <br/>
+<span style="color:grey"> Cleaning policy to be used. Hudi will delete older versions of parquet files to re-claim space. Any Query/Computation referring to this version of the file will fail. It is good to make sure that the data is retained for more than the maximum query execution time.</span>
+
+#### retainCommits(no_of_commits_to_retain = 24) {#retainCommits} 
+Property: `hoodie.cleaner.commits.retained` <br/>
+<span style="color:grey">Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this table</span>
+
+#### archiveCommitsWith(minCommits = 96, maxCommits = 128) {#archiveCommitsWith} 
+Property: `hoodie.keep.min.commits`, `hoodie.keep.max.commits` <br/>
+<span style="color:grey">Each commit is a small file in the `.hoodie` directory. Since DFS typically does not favor lots of small files, Hudi archives older commits into a sequential log. A commit is published atomically by a rename of the commit file.</span>
+
+#### withCommitsArchivalBatchSize(batch = 10) {#withCommitsArchivalBatchSize}
+Property: `hoodie.commits.archival.batch` <br/>
+<span style="color:grey">This controls the number of commit instants read in memory as a batch and archived together.</span>
+
+#### compactionSmallFileSize(size = 100MB) {#compactionSmallFileSize} 
+Property: `hoodie.parquet.small.file.limit` <br/>
+<span style="color:grey">This should be less < maxFileSize and setting it to 0, turns off this feature. Small files can always happen because of the number of insert records in a partition in a batch. Hudi has an option to auto-resolve small files by masking inserts into this partition as updates to existing small files. The size here is the minimum file size considered as a "small file size".</span>
+
+#### insertSplitSize(size = 500000) {#insertSplitSize} 
+Property: `hoodie.copyonwrite.insert.split.size` <br/>
+<span style="color:grey">Insert Write Parallelism. Number of inserts grouped for a single partition. Writing out 100MB files, with atleast 1kb records, means 100K records per file. Default is to overprovision to 500K. To improve insert latency, tune this to match the number of records in a single file. Setting this to a low number, will result in small files (particularly when compactionSmallFileSize is 0)</span>
+
+#### autoTuneInsertSplits(true) {#autoTuneInsertSplits} 
+Property: `hoodie.copyonwrite.insert.auto.split` <br/>
+<span style="color:grey">Should hudi dynamically compute the insertSplitSize based on the last 24 commit's metadata. Turned off by default. </span>
+
+#### approxRecordSize(size = 1024) {#approxRecordSize} 
+Property: `hoodie.copyonwrite.record.size.estimate` <br/>
+<span style="color:grey">The average record size. If specified, hudi will use this and not compute dynamically based on the last 24 commit's metadata. No value set as default. This is critical in computing the insert parallelism and bin-packing inserts into small files. See above.</span>
+
+#### withInlineCompaction(inlineCompaction = false) {#withInlineCompaction} 
+Property: `hoodie.compact.inline` <br/>
+<span style="color:grey">When set to true, compaction is triggered by the ingestion itself, right after a commit/deltacommit action as part of insert/upsert/bulk_insert</span>
+
+#### withMaxNumDeltaCommitsBeforeCompaction(maxNumDeltaCommitsBeforeCompaction = 10) {#withMaxNumDeltaCommitsBeforeCompaction} 
+Property: `hoodie.compact.inline.max.delta.commits` <br/>
+<span style="color:grey">Number of max delta commits to keep before triggering an inline compaction</span>
+
+#### withCompactionLazyBlockReadEnabled(true) {#withCompactionLazyBlockReadEnabled} 
+Property: `hoodie.compaction.lazy.block.read` <br/>
+<span style="color:grey">When a CompactedLogScanner merges all log files, this config helps to choose whether the logblocks should be read lazily or not. Choose true to use I/O intensive lazy block reading (low memory usage) or false for Memory intensive immediate block read (high memory usage)</span>
+
+#### withCompactionReverseLogReadEnabled(false) {#withCompactionReverseLogReadEnabled} 
+Property: `hoodie.compaction.reverse.log.read` <br/>
+<span style="color:grey">HoodieLogFormatReader reads a logfile in the forward direction starting from pos=0 to pos=file_length. If this config is set to true, the Reader reads the logfile in reverse direction, from pos=file_length to pos=0</span>
+
+#### withCleanerParallelism(cleanerParallelism = 200) {#withCleanerParallelism} 
+Property: `hoodie.cleaner.parallelism` <br/>
+<span style="color:grey">Increase this if cleaning becomes slow.</span>
+
+#### withCompactionStrategy(compactionStrategy = org.apache.hudi.io.compact.strategy.LogFileSizeBasedCompactionStrategy) {#withCompactionStrategy} 
+Property: `hoodie.compaction.strategy` <br/>
+<span style="color:grey">Compaction strategy decides which file groups are picked up for compaction during each compaction run. By default. Hudi picks the log file with most accumulated unmerged data</span>
+
+#### withTargetIOPerCompactionInMB(targetIOPerCompactionInMB = 500000) {#withTargetIOPerCompactionInMB} 
+Property: `hoodie.compaction.target.io` <br/>
+<span style="color:grey">Amount of MBs to spend during compaction run for the LogFileSizeBasedCompactionStrategy. This value helps bound ingestion latency while compaction is run inline mode.</span>
+
+#### withTargetPartitionsPerDayBasedCompaction(targetPartitionsPerCompaction = 10) {#withTargetPartitionsPerDayBasedCompaction} 
+Property: `hoodie.compaction.daybased.target` <br/>
+<span style="color:grey">Used by org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy to denote the number of latest partitions to compact during a compaction run.</span>    
+
+#### withPayloadClass(payloadClassName = org.apache.hudi.common.model.HoodieAvroPayload) {#payloadClassName} 
+Property: `hoodie.compaction.payload.class` <br/>
+<span style="color:grey">This needs to be same as class used during insert/upserts. Just like writing, compaction also uses the record payload class to merge records in the log against each other, merge again with the base file and produce the final record to be written after compaction.</span>
+
+
+### Metrics configs
+Enables reporting of Hudi metrics to graphite.
+[withMetricsConfig](#withMetricsConfig) (HoodieMetricsConfig) <br/>
+<span style="color:grey">Hudi publishes metrics on every commit, clean, rollback etc.</span>
+
+#### on(metricsOn = true) {#on} 
+Property: `hoodie.metrics.on` <br/>
+<span style="color:grey">Turn sending metrics on/off. on by default.</span>
+
+#### withReporterType(reporterType = GRAPHITE) {#withReporterType} 
+Property: `hoodie.metrics.reporter.type` <br/>
+<span style="color:grey">Type of metrics reporter. Graphite is the default and the only value suppported.</span>
+
+#### toGraphiteHost(host = localhost) {#toGraphiteHost} 
+Property: `hoodie.metrics.graphite.host` <br/>
+<span style="color:grey">Graphite host to connect to</span>
+
+#### onGraphitePort(port = 4756) {#onGraphitePort} 
+Property: `hoodie.metrics.graphite.port` <br/>
+<span style="color:grey">Graphite port to connect to</span>
+
+#### usePrefix(prefix = "") {#usePrefix} 
+Property: `hoodie.metrics.graphite.metric.prefix` <br/>
+<span style="color:grey">Standard prefix applied to all metrics. This helps to add datacenter, environment information for e.g</span>
+    
+### Memory configs
+Controls memory usage for compaction and merges, performed internally by Hudi
+[withMemoryConfig](#withMemoryConfig) (HoodieMemoryConfig) <br/>
+<span style="color:grey">Memory related configs</span>
+
+#### withMaxMemoryFractionPerPartitionMerge(maxMemoryFractionPerPartitionMerge = 0.6) {#withMaxMemoryFractionPerPartitionMerge} 
+Property: `hoodie.memory.merge.fraction` <br/>
+<span style="color:grey">This fraction is multiplied with the user memory fraction (1 - spark.memory.fraction) to get a final fraction of heap space to use during merge </span>
+
+#### withMaxMemorySizePerCompactionInBytes(maxMemorySizePerCompactionInBytes = 1GB) {#withMaxMemorySizePerCompactionInBytes} 
+Property: `hoodie.memory.compaction.fraction` <br/>
+<span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable map.</span>
+
+#### withWriteStatusFailureFraction(failureFraction = 0.1) {#withWriteStatusFailureFraction}
+Property: `hoodie.memory.writestatus.failure.fraction` <br/>
+<span style="color:grey">This property controls what fraction of the failed record, exceptions we report back to driver</span>
diff --git a/docs/_docs/0.5.2/2_5_performance.cn.md b/docs/_docs/0.5.2/2_5_performance.cn.md
new file mode 100644
index 0000000..8da1813
--- /dev/null
+++ b/docs/_docs/0.5.2/2_5_performance.cn.md
@@ -0,0 +1,64 @@
+---
+version: 0.5.2
+title: 性能
+keywords: hudi, index, storage, compaction, cleaning, implementation
+permalink: /cn/docs/0.5.2-performance.html
+toc: false
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+在本节中,我们将介绍一些有关Hudi插入更新、增量提取的实际性能数据,并将其与实现这些任务的其它传统工具进行比较。
+
+## 插入更新
+
+下面显示了从NoSQL数据库摄取获得的速度提升,这些速度提升数据是通过在写入时复制存储上的Hudi数据集上插入更新而获得的,
+数据集包括5个从小到大的表(相对于批量加载表)。
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_upsert_perf1.png" alt="hudi_upsert_perf1.png" style="max-width: 1000px" />
+</figure>
+
+由于Hudi可以通过增量构建数据集,它也为更频繁地调度摄取提供了可能性,从而减少了延迟,并显著节省了总体计算成本。
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_upsert_perf2.png" alt="hudi_upsert_perf2.png" style="max-width: 1000px" />
+</figure>
+
+Hudi插入更新在t1表的一次提交中就进行了高达4TB的压力测试。
+有关一些调优技巧,请参见[这里](https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide)。
+
+## 索引
+
+为了有效地插入更新数据,Hudi需要将要写入的批量数据中的记录分类为插入和更新(并标记它所属的文件组)。
+为了加快此操作的速度,Hudi采用了可插拔索引机制,该机制存储了recordKey和它所属的文件组ID之间的映射。
+默认情况下,Hudi使用内置索引,该索引使用文件范围和布隆过滤器来完成此任务,相比于Spark Join,其速度最高可提高10倍。
+
+当您将recordKey建模为单调递增时(例如时间戳前缀),Hudi提供了最佳的索引性能,从而进行范围过滤来避免与许多文件进行比较。
+即使对于基于UUID的键,也有[已知技术](https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/)来达到同样目的。
+例如,在具有80B键、3个分区、11416个文件、10TB数据的事件表上使用100M个时间戳前缀的键(5%的更新,95%的插入)时,
+相比于原始Spark Join,Hudi索引速度的提升**约为7倍(440秒相比于2880秒)**。
+即使对于具有挑战性的工作负载,如使用300个核对3.25B UUID键、30个分区、6180个文件的“100%更新”的数据库摄取工作负载,Hudi索引也可以提供**80-100%的加速**。
+
+## 读优化查询
+
+读优化视图的主要设计目标是在不影响查询的情况下实现上一节中提到的延迟减少和效率提高。
+下图比较了对Hudi和非Hudi数据集的Hive、Presto、Spark查询,并对此进行说明。
+
+**Hive**
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_query_perf_hive.png" alt="hudi_query_perf_hive.png" style="max-width: 800px" />
+</figure>
+
+**Spark**
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_query_perf_spark.png" alt="hudi_query_perf_spark.png" style="max-width: 1000px" />
+</figure>
+
+**Presto**
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_query_perf_presto.png" alt="hudi_query_perf_presto.png" style="max-width: 1000px" />
+</figure>
diff --git a/docs/_docs/0.5.2/2_5_performance.md b/docs/_docs/0.5.2/2_5_performance.md
new file mode 100644
index 0000000..cb13ad1
--- /dev/null
+++ b/docs/_docs/0.5.2/2_5_performance.md
@@ -0,0 +1,66 @@
+---
+version: 0.5.2
+title: Performance
+keywords: hudi, index, storage, compaction, cleaning, implementation
+permalink: /docs/0.5.2-performance.html
+toc: false
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+In this section, we go over some real world performance numbers for Hudi upserts, incremental pull and compare them against
+the conventional alternatives for achieving these tasks. 
+
+## Upserts
+
+Following shows the speed up obtained for NoSQL database ingestion, from incrementally upserting on a Hudi table on the copy-on-write storage,
+on 5 tables ranging from small to huge (as opposed to bulk loading the tables)
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_upsert_perf1.png" alt="hudi_upsert_perf1.png" style="max-width: 1000px" />
+</figure>
+
+Given Hudi can build the table incrementally, it opens doors for also scheduling ingesting more frequently thus reducing latency, with
+significant savings on the overall compute cost.
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_upsert_perf2.png" alt="hudi_upsert_perf2.png" style="max-width: 1000px" />
+</figure>
+
+Hudi upserts have been stress tested upto 4TB in a single commit across the t1 table. 
+See [here](https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide) for some tuning tips.
+
+## Indexing
+
+In order to efficiently upsert data, Hudi needs to classify records in a write batch into inserts & updates (tagged with the file group 
+it belongs to). In order to speed this operation, Hudi employs a pluggable index mechanism that stores a mapping between recordKey and 
+the file group id it belongs to. By default, Hudi uses a built in index that uses file ranges and bloom filters to accomplish this, with
+upto 10x speed up over a spark join to do the same. 
+
+Hudi provides best indexing performance when you model the recordKey to be monotonically increasing (e.g timestamp prefix), leading to range pruning filtering
+out a lot of files for comparison. Even for UUID based keys, there are [known techniques](https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/) to achieve this.
+For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a event table with 80B keys/3 partitions/11416 files/10TB data, Hudi index achieves a 
+**~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a challenging workload like an '100% update' database ingestion workload spanning 
+3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers a **80-100% speedup**.
+
+## Snapshot Queries
+
+The major design goal for snapshot queries is to achieve the latency reduction & efficiency gains in previous section,
+with no impact on queries. Following charts compare the Hudi vs non-Hudi tables across Hive/Presto/Spark queries and demonstrate this.
+
+**Hive**
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_query_perf_hive.png" alt="hudi_query_perf_hive.png" style="max-width: 800px" />
+</figure>
+
+**Spark**
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_query_perf_spark.png" alt="hudi_query_perf_spark.png" style="max-width: 1000px" />
+</figure>
+
+**Presto**
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_query_perf_presto.png" alt="hudi_query_perf_presto.png" style="max-width: 1000px" />
+</figure>
diff --git a/docs/_docs/0.5.2/2_6_deployment.cn.md b/docs/_docs/0.5.2/2_6_deployment.cn.md
new file mode 100644
index 0000000..35c0868
--- /dev/null
+++ b/docs/_docs/0.5.2/2_6_deployment.cn.md
@@ -0,0 +1,435 @@
+---
+version: 0.5.2
+title: 管理 Hudi Pipelines
+keywords: hudi, administration, operation, devops
+permalink: /cn/docs/0.5.2-deployment.html
+summary: This section offers an overview of tools available to operate an ecosystem of Hudi datasets
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+管理员/运维人员可以通过以下方式了解Hudi数据集/管道
+
+ - [通过Admin CLI进行管理](#admin-cli)
+ - [Graphite指标](#metrics)
+ - [Hudi应用程序的Spark UI](#spark-ui)
+
+本节简要介绍了每一种方法,并提供了有关[故障排除](#troubleshooting)的一些常规指南
+
+## Admin CLI {#admin-cli}
+
+一旦构建了hudi,就可以通过`cd hudi-cli && ./hudi-cli.sh`启动shell。
+一个hudi数据集位于DFS上的**basePath**位置,我们需要该位置才能连接到Hudi数据集。
+Hudi库使用.hoodie子文件夹跟踪所有元数据,从而有效地在内部管理该数据集。
+
+初始化hudi表,可使用如下命令。
+
+```java
+18/09/06 15:56:52 INFO annotation.AutowiredAnnotationBeanPostProcessor: JSR-330 'javax.inject.Inject' annotation found and supported for autowiring
+============================================
+*                                          *
+*     _    _           _   _               *
+*    | |  | |         | | (_)              *
+*    | |__| |       __| |  -               *
+*    |  __  ||   | / _` | ||               *
+*    | |  | ||   || (_| | ||               *
+*    |_|  |_|\___/ \____/ ||               *
+*                                          *
+============================================
+
+Welcome to Hoodie CLI. Please type help if you are looking for help.
+hudi->create --path /user/hive/warehouse/table1 --tableName hoodie_table_1 --tableType COPY_ON_WRITE
+.....
+18/09/06 15:57:15 INFO table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from ...
+```
+
+使用desc命令可以查看hudi表的描述信息:
+
+```java
+hoodie:hoodie_table_1->desc
+18/09/06 15:57:19 INFO timeline.HoodieActiveTimeline: Loaded instants []
+    _________________________________________________________
+    | Property                | Value                        |
+    |========================================================|
+    | basePath                | ...                          |
+    | metaPath                | ...                          |
+    | fileSystem              | hdfs                         |
+    | hoodie.table.name       | hoodie_table_1               |
+    | hoodie.table.type       | COPY_ON_WRITE                |
+    | hoodie.archivelog.folder|                              |
+```
+
+以下是连接到包含uber trips的Hudi数据集的示例命令。
+
+```java
+hoodie:trips->connect --path /app/uber/trips
+
+16/10/05 23:20:37 INFO model.HoodieTableMetadata: Attempting to load the commits under /app/uber/trips/.hoodie with suffix .commit
+16/10/05 23:20:37 INFO model.HoodieTableMetadata: Attempting to load the commits under /app/uber/trips/.hoodie with suffix .inflight
+16/10/05 23:20:37 INFO model.HoodieTableMetadata: All commits :HoodieCommits{commitList=[20161002045850, 20161002052915, 20161002055918, 20161002065317, 20161002075932, 20161002082904, 20161002085949, 20161002092936, 20161002105903, 20161002112938, 20161002123005, 20161002133002, 20161002155940, 20161002165924, 20161002172907, 20161002175905, 20161002190016, 20161002192954, 20161002195925, 20161002205935, 20161002215928, 20161002222938, 20161002225915, 20161002232906, 20161003003028, 201 [...]
+Metadata for table trips loaded
+hoodie:trips->
+```
+
+连接到数据集后,便可使用许多其他命令。该shell程序具有上下文自动完成帮助(按TAB键),下面是所有命令的列表,本节中对其中的一些命令进行了详细示例。
+
+
+```java
+hoodie:trips->help
+* ! - Allows execution of operating system (OS) commands
+* // - Inline comment markers (start of line only)
+* ; - Inline comment markers (start of line only)
+* addpartitionmeta - Add partition metadata to a dataset, if not present
+* clear - Clears the console
+* cls - Clears the console
+* commit rollback - Rollback a commit
+* commits compare - Compare commits with another Hoodie dataset
+* commit showfiles - Show file level details of a commit
+* commit showpartitions - Show partition level details of a commit
+* commits refresh - Refresh the commits
+* commits show - Show the commits
+* commits sync - Compare commits with another Hoodie dataset
+* connect - Connect to a hoodie dataset
+* date - Displays the local date and time
+* exit - Exits the shell
+* help - List all commands usage
+* quit - Exits the shell
+* records deduplicate - De-duplicate a partition path contains duplicates & produce repaired files to replace with
+* script - Parses the specified resource file and executes its commands
+* stats filesizes - File Sizes. Display summary stats on sizes of files
+* stats wa - Write Amplification. Ratio of how many records were upserted to how many records were actually written
+* sync validate - Validate the sync by counting the number of records
+* system properties - Shows the shell's properties
+* utils loadClass - Load a class
+* version - Displays shell version
+
+hoodie:trips->
+```
+
+
+### 检查提交
+
+在Hudi中,更新或插入一批记录的任务被称为**提交**。提交可提供基本的原子性保证,即只有提交的数据可用于查询。
+每个提交都有一个单调递增的字符串/数字,称为**提交编号**。通常,这是我们开始提交的时间。
+
+查看有关最近10次提交的一些基本信息,
+
+
+```java
+hoodie:trips->commits show --sortBy "Total Bytes Written" --desc true --limit 10
+    ________________________________________________________________________________________________________________________________________________________________________
+    | CommitTime    | Total Bytes Written| Total Files Added| Total Files Updated| Total Partitions Written| Total Records Written| Total Update Records Written| Total Errors|
+    |=======================================================================================================================================================================|
+    ....
+    ....
+    ....
+hoodie:trips->
+```
+
+在每次写入开始时,Hudi还将.inflight提交写入.hoodie文件夹。您可以使用那里的时间戳来估计正在进行的提交已经花费的时间
+
+```java
+$ hdfs dfs -ls /app/uber/trips/.hoodie/*.inflight
+-rw-r--r--   3 vinoth supergroup     321984 2016-10-05 23:18 /app/uber/trips/.hoodie/20161005225920.inflight
+```
+
+
+### 深入到特定的提交
+
+了解写入如何分散到特定分区,
+
+
+```java
+hoodie:trips->commit showpartitions --commit 20161005165855 --sortBy "Total Bytes Written" --desc true --limit 10
+    __________________________________________________________________________________________________________________________________________
+    | Partition Path| Total Files Added| Total Files Updated| Total Records Inserted| Total Records Updated| Total Bytes Written| Total Errors|
+    |=========================================================================================================================================|
+     ....
+     ....
+```
+
+如果您需要文件级粒度,我们可以执行以下操作
+
+```java
+hoodie:trips->commit showfiles --commit 20161005165855 --sortBy "Partition Path"
+    ________________________________________________________________________________________________________________________________________________________
+    | Partition Path| File ID                             | Previous Commit| Total Records Updated| Total Records Written| Total Bytes Written| Total Errors|
+    |=======================================================================================================================================================|
+    ....
+    ....
+```
+
+
+### 文件系统视图
+
+Hudi将每个分区视为文件组的集合,每个文件组包含按提交顺序排列的文件切片列表(请参阅概念)。以下命令允许用户查看数据集的文件切片。
+
+```java
+ hoodie:stock_ticks_mor->show fsview all
+ ....
+  _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
+ | Partition | FileId | Base-Instant | Data-File | Data-File Size| Num Delta Files| Total Delta File Size| Delta Files |
+ |==============================================================================================================================================================================================================================================================================================================================================================================================================|
+ | 2018/08/31| 111415c3-f26d-4639-86c8-f9956f245ac3| 20181002180759| hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/111415c3-f26d-4639-86c8-f9956f245ac3_0_20181002180759.parquet| 432.5 KB | 1 | 20.8 KB | [HoodieLogFile {hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/.111415c3-f26d-4639-86c8-f9956f245ac3_20181002180759.log.1}]|
+
+
+
+ hoodie:stock_ticks_mor->show fsview latest --partitionPath "2018/08/31"
+ ......
+ ___________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ [...]
+ | Partition | FileId | Base-Instant | Data-File | Data-File Size| Num Delta Files| Total Delta Size| Delta Size - compaction scheduled| Delta Size - compaction unscheduled| Delta To Base Ratio - compaction scheduled| Delta To Base Ratio - compaction unscheduled| Delta Files - compaction scheduled | Delta Files - compaction unscheduled|
+ |========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================== [...]
+ | 2018/08/31| 111415c3-f26d-4639-86c8-f9956f245ac3| 20181002180759| hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/111415c3-f26d-4639-86c8-f9956f245ac3_0_20181002180759.parquet| 432.5 KB | 1 | 20.8 KB | 20.8 KB | 0.0 B | 0.0 B | 0.0 B | [HoodieLogFile {hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/.111415c3-f26d-4639-86c8-f9956f245ac3_20181002180759.log.1}]| [] |
+
+ hoodie:stock_ticks_mor->
+```
+
+
+### 统计信息
+
+由于Hudi直接管理DFS数据集的文件大小,这些信息会帮助你全面了解Hudi的运行状况
+
+
+```java
+hoodie:trips->stats filesizes --partitionPath 2016/09/01 --sortBy "95th" --desc true --limit 10
+    ________________________________________________________________________________________________
+    | CommitTime    | Min     | 10th    | 50th    | avg     | 95th    | Max     | NumFiles| StdDev  |
+    |===============================================================================================|
+    | <COMMIT_ID>   | 93.9 MB | 93.9 MB | 93.9 MB | 93.9 MB | 93.9 MB | 93.9 MB | 2       | 2.3 KB  |
+    ....
+    ....
+```
+
+如果Hudi写入花费的时间更长,那么可以通过观察写放大指标来发现任何异常
+
+```java
+hoodie:trips->stats wa
+    __________________________________________________________________________
+    | CommitTime    | Total Upserted| Total Written| Write Amplifiation Factor|
+    |=========================================================================|
+    ....
+    ....
+```
+
+
+### 归档的提交
+
+为了限制DFS上.commit文件的增长量,Hudi将较旧的.commit文件(适当考虑清理策略)归档到commits.archived文件中。
+这是一个序列文件,其包含commitNumber => json的映射,及有关提交的原始信息(上面已很好地汇总了相同的信息)。
+
+### 压缩
+
+要了解压缩和写程序之间的时滞,请使用以下命令列出所有待处理的压缩。
+
+```java
+hoodie:trips->compactions show all
+     ___________________________________________________________________
+    | Compaction Instant Time| State    | Total FileIds to be Compacted|
+    |==================================================================|
+    | <INSTANT_1>            | REQUESTED| 35                           |
+    | <INSTANT_2>            | INFLIGHT | 27                           |
+```
+
+要检查特定的压缩计划,请使用
+
+```java
+hoodie:trips->compaction show --instant <INSTANT_1>
+    _________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
+    | Partition Path| File Id | Base Instant  | Data File Path                                    | Total Delta Files| getMetrics                                                                                                                    |
+    |================================================================================================================================================================================================================================================
+    | 2018/07/17    | <UUID>  | <INSTANT_1>   | viewfs://ns-default/.../../UUID_<INSTANT>.parquet | 1                | {TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=1230.0, TOTAL_LOG_FILES_SIZE=2.51255751E8, TOTAL_IO_WRITE_MB=991.0, TOTAL_IO_MB=2221.0}|
+
+```
+
+要手动调度或运行压缩,请使用以下命令。该命令使用spark启动器执行压缩操作。
+注意:确保没有其他应用程序正在同时调度此数据集的压缩
+
+```java
+hoodie:trips->help compaction schedule
+Keyword:                   compaction schedule
+Description:               Schedule Compaction
+ Keyword:                  sparkMemory
+   Help:                   Spark executor memory
+   Mandatory:              false
+   Default if specified:   '__NULL__'
+   Default if unspecified: '1G'
+
+* compaction schedule - Schedule Compaction
+```
+
+```java
+hoodie:trips->help compaction run
+Keyword:                   compaction run
+Description:               Run Compaction for given instant time
+ Keyword:                  tableName
+   Help:                   Table name
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  parallelism
+   Help:                   Parallelism for hoodie compaction
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  schemaFilePath
+   Help:                   Path for Avro schema file
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  sparkMemory
+   Help:                   Spark executor memory
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  retry
+   Help:                   Number of retries
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  compactionInstant
+   Help:                   Base path for the target hoodie dataset
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+* compaction run - Run Compaction for given instant time
+```
+
+### 验证压缩
+
+验证压缩计划:检查压缩所需的所有文件是否都存在且有效
+
+```java
+hoodie:stock_ticks_mor->compaction validate --instant 20181005222611
+...
+
+   COMPACTION PLAN VALID
+
+    ___________________________________________________________________________________________________________________________________________________________________________________________________________________________
+    | File Id                             | Base Instant Time| Base Data File                                                                                                                   | Num Delta Files| Valid| Error|
+    |==========================================================================================================================================================================================================================|
+    | 05320e98-9a57-4c38-b809-a6beaaeb36bd| 20181005222445   | hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/05320e98-9a57-4c38-b809-a6beaaeb36bd_0_20181005222445.parquet| 1              | true |      |
+
+
+
+hoodie:stock_ticks_mor->compaction validate --instant 20181005222601
+
+   COMPACTION PLAN INVALID
+
+    _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
+    | File Id                             | Base Instant Time| Base Data File                                                                                                                   | Num Delta Files| Valid| Error                                                                           |
+    |=====================================================================================================================================================================================================================================================================================================|
+    | 05320e98-9a57-4c38-b809-a6beaaeb36bd| 20181005222445   | hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/05320e98-9a57-4c38-b809-a6beaaeb36bd_0_20181005222445.parquet| 1              | false| All log files specified in compaction operation is not present. Missing ....    |
+```
+
+### 注意
+
+必须在其他写入/摄取程序没有运行的情况下执行以下命令。
+
+有时,有必要从压缩计划中删除fileId以便加快或取消压缩操作。
+压缩计划之后在此文件上发生的所有新日志文件都将被安全地重命名以便进行保留。Hudi提供以下CLI来支持
+
+
+### 取消调度压缩
+
+```java
+hoodie:trips->compaction unscheduleFileId --fileId <FileUUID>
+....
+No File renames needed to unschedule file from pending compaction. Operation successful.
+```
+
+在其他情况下,需要撤销整个压缩计划。以下CLI支持此功能
+
+```java
+hoodie:trips->compaction unschedule --compactionInstant <compactionInstant>
+.....
+No File renames needed to unschedule pending compaction. Operation successful.
+```
+
+### 修复压缩
+
+上面的压缩取消调度操作有时可能会部分失败(例如:DFS暂时不可用)。
+如果发生部分故障,则压缩操作可能与文件切片的状态不一致。
+当您运行`压缩验证`时,您会注意到无效的压缩操作(如果有的话)。
+在这种情况下,修复命令将立即执行,它将重新排列文件切片,以使文件不丢失,并且文件切片与压缩计划一致
+
+```java
+hoodie:stock_ticks_mor->compaction repair --instant 20181005222611
+......
+Compaction successfully repaired
+.....
+```
+
+
+## 指标 {#metrics}
+
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
+
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
+
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_commit_duration.png" alt="hudi_commit_duration.png" style="max-width: 100%" />
+</figure>
+
+
+## 故障排除 {#troubleshooting}
+
+以下部分通常有助于调试Hudi故障。以下元数据已被添加到每条记录中,可以通过标准Hadoop SQL引擎(Hive/Presto/Spark)检索,来更容易地诊断问题的严重性。
+
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对检查重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
+
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即仅在每个分区内保证recordKey(主键)的唯一性。
+
+### 缺失记录
+
+请在可能写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误,那么记录实际上不是由Hudi写入的,而是交还给应用程序来决定如何处理。
+
+### 重复
+
+首先,请确保访问Hudi数据集的查询是[没有问题的](sql_queries.html),并之后确认的确有重复。
+
+ - 如果确认,请使用上面的元数据字段来标识包含记录的物理文件和分区文件。
+ - 如果重复的记录存在于不同分区路径下的文件,则意味着您的应用程序正在为同一recordKey生成不同的分区路径,请修复您的应用程序.
+ - 如果重复的记录存在于同一分区路径下的多个文件,请使用邮件列表汇报这个问题。这不应该发生。您可以使用`records deduplicate`命令修复数据。
+
+### Spark故障 {#spark-ui}
+
+典型的upsert() DAG如下所示。请注意,Hudi客户端会缓存中间的RDD,以智能地并调整文件大小和Spark并行度。
+另外,由于还显示了探针作业,Spark UI显示了两次sortByKey,但它只是一个排序。
+<figure>
+    <img class="docimage" src="/assets/images/hudi_upsert_dag.png" alt="hudi_upsert_dag.png" style="max-width: 100%" />
+</figure>
+
+
+概括地说,有两个步骤
+
+**索引查找以标识要更改的文件**
+
+ - Job 1 : 触发输入数据读取,转换为HoodieRecord对象,然后根据输入记录拿到目标分区路径。
+ - Job 2 : 加载我们需要检查的文件名集。
+ - Job 3  & 4 : 通过联合上面1和2中的RDD,智能调整spark join并行度,然后进行实际查找。
+ - Job 5 : 生成带有位置的recordKeys作为标记的RDD。
+
+**执行数据的实际写入**
+
+ - Job 6 : 将记录与recordKey(位置)进行懒惰连接,以提供最终的HoodieRecord集,现在它包含每条记录的文件/分区路径信息(如果插入,则为null)。然后还要再次分析工作负载以确定文件的大小。
+ - Job 7 : 实际写入数据(更新 + 插入 + 插入转为更新以保持文件大小)
+
+根据异常源(Hudi/Spark),上述关于DAG的信息可用于查明实际问题。最常遇到的故障是由YARN/DFS临时故障引起的。
+将来,将在项目中添加更复杂的调试/管理UI,以帮助自动进行某些调试。
diff --git a/docs/_docs/0.5.2/2_6_deployment.md b/docs/_docs/0.5.2/2_6_deployment.md
new file mode 100644
index 0000000..c5cac7b
--- /dev/null
+++ b/docs/_docs/0.5.2/2_6_deployment.md
@@ -0,0 +1,598 @@
+---
+version: 0.5.2
+title: Deployment Guide
+keywords: hudi, administration, operation, devops, deployment
+permalink: /docs/0.5.2-deployment.html
+summary: This section offers an overview of tools available to operate an ecosystem of Hudi
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+This section provides all the help you need to deploy and operate Hudi tables at scale. 
+Specifically, we will cover the following aspects.
+
+ - [Deployment Model](#deploying) : How various Hudi components are deployed and managed.
+ - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, guidelines and general best-practices.
+ - [Migrating to Hudi](#migrating) : How to migrate your existing tables to Apache Hudi.
+ - [Interacting via CLI](#cli) : Using the CLI to perform maintenance or deeper introspection.
+ - [Monitoring](#monitoring) : Tracking metrics from your hudi tables using popular tools.
+ - [Troubleshooting](#troubleshooting) : Uncovering, triaging and resolving issues in production.
+ 
+## Deploying
+
+All in all, Hudi deploys with no long running servers or additional infrastructure cost to your data lake. In fact, Hudi pioneered this model of building a transactional distributed storage layer
+using existing infrastructure and its heartening to see other systems adopting similar approaches as well. Hudi writing is done via Spark jobs (DeltaStreamer or custom Spark datasource jobs), deployed per standard Apache Spark [recommendations](https://spark.apache.org/docs/latest/cluster-overview.html).
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache Spark or Presto and hence no additional infrastructure is necessary. 
+
+A typical Hudi data ingestion can be achieved in 2 modes. In a singe run mode, Hudi ingestion reads next batch of data, ingest them to Hudi table and exits. In continuous mode, Hudi ingestion runs as a long-running service executing ingestion in a loop.
+
+With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting delta files. Again, compaction can be performed in an asynchronous-mode by letting compaction run concurrently with ingestion or in a serial fashion with one after another.
+
+### DeltaStreamer
+
+[DeltaStreamer](/docs/0.5.2-writing_data.html#deltastreamer) is the standalone utility to incrementally pull upstream changes from varied sources such as DFS, Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark application in 2 modes.
+
+ - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Background operations like cleaning old file versions and archiving hoodie timeline are automatically executed as part of the run. For Merge-On-Read tables, Compaction is also run inline as part of ingestion unless disabled by passing the flag "--disable-compaction". By default, Compaction is run inline for eve [...]
+
+Here is an example invocation for reading from kafka topic in a single-run mode and writing to Merge On Read table type in a yarn cluster.
+
+```java
+[hoodie]$ spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
+ --master yarn \
+ --deploy-mode cluster \
+ --num-executors 10 \
+ --executor-memory 3g \
+ --driver-memory 6g \
+ --conf spark.driver.extraJavaOptions="-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/varadarb_ds_driver.hprof" \
+ --conf spark.executor.extraJavaOptions="-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/varadarb_ds_executor.hprof" \
+ --queue hadoop-platform-queue \
+ --conf spark.scheduler.mode=FAIR \
+ --conf spark.yarn.executor.memoryOverhead=1072 \
+ --conf spark.yarn.driver.memoryOverhead=2048 \
+ --conf spark.task.cpus=1 \
+ --conf spark.executor.cores=1 \
+ --conf spark.task.maxFailures=10 \
+ --conf spark.memory.fraction=0.4 \
+ --conf spark.rdd.compress=true \
+ --conf spark.kryoserializer.buffer.max=200m \
+ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
+ --conf spark.memory.storageFraction=0.1 \
+ --conf spark.shuffle.service.enabled=true \
+ --conf spark.sql.hive.convertMetastoreParquet=false \
+ --conf spark.ui.port=5555 \
+ --conf spark.driver.maxResultSize=3g \
+ --conf spark.executor.heartbeatInterval=120s \
+ --conf spark.network.timeout=600s \
+ --conf spark.eventLog.overwrite=true \
+ --conf spark.eventLog.enabled=true \
+ --conf spark.eventLog.dir=hdfs:///user/spark/applicationHistory \
+ --conf spark.yarn.max.executor.failures=10 \
+ --conf spark.sql.catalogImplementation=hive \
+ --conf spark.sql.shuffle.partitions=100 \
+ --driver-class-path $HADOOP_CONF_DIR \
+ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
+ --table-type MERGE_ON_READ \
+ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
+ --source-ordering-field ts  \
+ --target-base-path /user/hive/warehouse/stock_ticks_mor \
+ --target-table stock_ticks_mor \
+ --props /var/demo/config/kafka-source.properties \
+ --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+```
+
+ - **Continuous Mode** :  Here, deltastreamer runs an infinite loop with each round performing one ingestion round as described in **Run Once Mode**. The frequency of data ingestion can be controlled by the configuration "--min-sync-interval-seconds". For Merge-On-Read tables, Compaction is run in asynchronous fashion concurrently with ingestion unless disabled by passing the flag "--disable-compaction". Every ingestion run triggers a compaction request asynchronously and this frequency  [...]
+
+Here is an example invocation for reading from kafka topic in a continuous mode and writing to Merge On Read table type in a yarn cluster.
+
+```java
+[hoodie]$ spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
+ --master yarn \
+ --deploy-mode cluster \
+ --num-executors 10 \
+ --executor-memory 3g \
+ --driver-memory 6g \
+ --conf spark.driver.extraJavaOptions="-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/varadarb_ds_driver.hprof" \
+ --conf spark.executor.extraJavaOptions="-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/varadarb_ds_executor.hprof" \
+ --queue hadoop-platform-queue \
+ --conf spark.scheduler.mode=FAIR \
+ --conf spark.yarn.executor.memoryOverhead=1072 \
+ --conf spark.yarn.driver.memoryOverhead=2048 \
+ --conf spark.task.cpus=1 \
+ --conf spark.executor.cores=1 \
+ --conf spark.task.maxFailures=10 \
+ --conf spark.memory.fraction=0.4 \
+ --conf spark.rdd.compress=true \
+ --conf spark.kryoserializer.buffer.max=200m \
+ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
+ --conf spark.memory.storageFraction=0.1 \
+ --conf spark.shuffle.service.enabled=true \
+ --conf spark.sql.hive.convertMetastoreParquet=false \
+ --conf spark.ui.port=5555 \
+ --conf spark.driver.maxResultSize=3g \
+ --conf spark.executor.heartbeatInterval=120s \
+ --conf spark.network.timeout=600s \
+ --conf spark.eventLog.overwrite=true \
+ --conf spark.eventLog.enabled=true \
+ --conf spark.eventLog.dir=hdfs:///user/spark/applicationHistory \
+ --conf spark.yarn.max.executor.failures=10 \
+ --conf spark.sql.catalogImplementation=hive \
+ --conf spark.sql.shuffle.partitions=100 \
+ --driver-class-path $HADOOP_CONF_DIR \
+ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
+ --table-type MERGE_ON_READ \
+ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
+ --source-ordering-field ts  \
+ --target-base-path /user/hive/warehouse/stock_ticks_mor \
+ --target-table stock_ticks_mor \
+ --props /var/demo/config/kafka-source.properties \
+ --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
+ --continuous
+```
+
+### Spark Datasource Writer Jobs
+
+As described in [Writing Data](/docs/0.5.2-writing_data.html#datasource-writer), you can use spark datasource to ingest to hudi table. This mechanism allows you to ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports spark streaming to ingest a streaming source to Hudi table. For Merge On Read table types, inline compaction is turned on by default which runs after every ingestion run. The compaction frequency can be changed by setting the property "hoodie.compac [...]
+
+Here is an example invocation using spark datasource
+
+```java
+inputDF.write()
+       .format("org.apache.hudi")
+       .options(clientOpts) // any of the Hudi client opts can be passed in as well
+       .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+       .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
+       .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+       .option(HoodieWriteConfig.TABLE_NAME, tableName)
+       .mode(SaveMode.Append)
+       .save(basePath);
+```
+ 
+## Upgrading 
+
+New Hudi releases are listed on the [releases page](/releases), with detailed notes which list all the changes, with highlights in each release. 
+At the end of the day, Hudi is a storage system and with that comes a lot of responsibilities, which we take seriously. 
+
+As general guidelines, 
+
+ - We strive to keep all changes backwards compatible (i.e new code can read old data/timeline files) and when we cannot, we will provide upgrade/downgrade tools via the CLI
+ - We cannot always guarantee forward compatibility (i.e old code being able to read data/timeline files written by a greater version). This is generally the norm, since no new features can be built otherwise.
+   However any large such changes, will be turned off by default, for smooth transition to newer release. After a few releases and once enough users deem the feature stable in production, we will flip the defaults in a subsequent release.
+ - Always upgrade the query bundles (mr-bundle, presto-bundle, spark-bundle) first and then upgrade the writers (deltastreamer, spark jobs using datasource). This often provides the best experience and it's easy to fix 
+   any issues by rolling forward/back the writer code (which typically you might have more control over)
+ - With large, feature rich releases we recommend migrating slowly, by first testing in staging environments and running your own tests. Upgrading Hudi is no different than upgrading any database system.
+
+Note that release notes can override this information with specific instructions, applicable on case-by-case basis.
+
+## Migrating
+
+Currently migrating to Hudi can be done using two approaches 
+
+- **Convert newer partitions to Hudi** : This model is suitable for large event tables (e.g: click streams, ad impressions), which also typically receive writes for the last few days alone. You can convert the last 
+   N partitions to Hudi and proceed writing as if it were a Hudi table to begin with. The Hudi query side code is able to correctly handle both hudi and non-hudi data partitions.
+- **Full conversion to Hudi** : This model is suitable if you are currently bulk/full loading the table few times a day (e.g database ingestion). The full conversion of Hudi is simply a one-time step (akin to 1 run of your existing job),
+   which moves all of the data into the Hudi format and provides the ability to incrementally update for future writes.
+
+For more details, refer to the detailed [migration guide](/docs/0.5.2-migration_guide.html). In the future, we will be supporting seamless zero-copy bootstrap of existing tables with all the upsert/incremental query capabilities fully supported.
+
+## CLI
+
+Once hudi has been built, the shell can be fired by via  `cd hudi-cli && ./hudi-cli.sh`. A hudi table resides on DFS, in a location referred to as the `basePath` and 
+we would need this location in order to connect to a Hudi table. Hudi library effectively manages this table internally, using `.hoodie` subfolder to track all metadata.
+
+To initialize a hudi table, use the following command.
+
+```java
+===================================================================
+*         ___                          ___                        *
+*        /\__\          ___           /\  \           ___         *
+*       / /  /         /\__\         /  \  \         /\  \        *
+*      / /__/         / /  /        / /\ \  \        \ \  \       *
+*     /  \  \ ___    / /  /        / /  \ \__\       /  \__\      *
+*    / /\ \  /\__\  / /__/  ___   / /__/ \ |__|     / /\/__/      *
+*    \/  \ \/ /  /  \ \  \ /\__\  \ \  \ / /  /  /\/ /  /         *
+*         \  /  /    \ \  / /  /   \ \  / /  /   \  /__/          *
+*         / /  /      \ \/ /  /     \ \/ /  /     \ \__\          *
+*        / /  /        \  /  /       \  /  /       \/__/          *
+*        \/__/          \/__/         \/__/    Apache Hudi CLI    *
+*                                                                 *
+===================================================================
+
+hudi->create --path /user/hive/warehouse/table1 --tableName hoodie_table_1 --tableType COPY_ON_WRITE
+.....
+```
+
+To see the description of hudi table, use the command:
+
+```java
+hudi:hoodie_table_1->desc
+18/09/06 15:57:19 INFO timeline.HoodieActiveTimeline: Loaded instants []
+    _________________________________________________________
+    | Property                | Value                        |
+    |========================================================|
+    | basePath                | ...                          |
+    | metaPath                | ...                          |
+    | fileSystem              | hdfs                         |
+    | hoodie.table.name       | hoodie_table_1               |
+    | hoodie.table.type       | COPY_ON_WRITE                |
+    | hoodie.archivelog.folder|                              |
+```
+
+Following is a sample command to connect to a Hudi table contains uber trips.
+
+```java
+hudi:trips->connect --path /app/uber/trips
+
+16/10/05 23:20:37 INFO model.HoodieTableMetadata: All commits :HoodieCommits{commitList=[20161002045850, 20161002052915, 20161002055918, 20161002065317, 20161002075932, 20161002082904, 20161002085949, 20161002092936, 20161002105903, 20161002112938, 20161002123005, 20161002133002, 20161002155940, 20161002165924, 20161002172907, 20161002175905, 20161002190016, 20161002192954, 20161002195925, 20161002205935, 20161002215928, 20161002222938, 20161002225915, 20161002232906, 20161003003028, 201 [...]
+Metadata for table trips loaded
+```
+
+Once connected to the table, a lot of other commands become available. The shell has contextual autocomplete help (press TAB) and below is a list of all commands, few of which are reviewed in this section
+are reviewed
+
+```java
+hudi:trips->help
+* ! - Allows execution of operating system (OS) commands
+* // - Inline comment markers (start of line only)
+* ; - Inline comment markers (start of line only)
+* addpartitionmeta - Add partition metadata to a table, if not present
+* clear - Clears the console
+* cls - Clears the console
+* commit rollback - Rollback a commit
+* commits compare - Compare commits with another Hoodie table
+* commit showfiles - Show file level details of a commit
+* commit showpartitions - Show partition level details of a commit
+* commits refresh - Refresh the commits
+* commits show - Show the commits
+* commits sync - Compare commits with another Hoodie table
+* connect - Connect to a hoodie table
+* date - Displays the local date and time
+* exit - Exits the shell
+* help - List all commands usage
+* quit - Exits the shell
+* records deduplicate - De-duplicate a partition path contains duplicates & produce repaired files to replace with
+* script - Parses the specified resource file and executes its commands
+* stats filesizes - File Sizes. Display summary stats on sizes of files
+* stats wa - Write Amplification. Ratio of how many records were upserted to how many records were actually written
+* sync validate - Validate the sync by counting the number of records
+* system properties - Shows the shell's properties
+* utils loadClass - Load a class
+* version - Displays shell version
+
+hudi:trips->
+```
+
+
+### Inspecting Commits
+
+The task of upserting or inserting a batch of incoming records is known as a **commit** in Hudi. A commit provides basic atomicity guarantees such that only committed data is available for querying.
+Each commit has a monotonically increasing string/number called the **commit number**. Typically, this is the time at which we started the commit.
+
+To view some basic information about the last 10 commits,
+
+
+```java
+hudi:trips->commits show --sortBy "Total Bytes Written" --desc true --limit 10
+    ________________________________________________________________________________________________________________________________________________________________________
+    | CommitTime    | Total Bytes Written| Total Files Added| Total Files Updated| Total Partitions Written| Total Records Written| Total Update Records Written| Total Errors|
+    |=======================================================================================================================================================================|
+    ....
+    ....
+    ....
+```
+
+At the start of each write, Hudi also writes a .inflight commit to the .hoodie folder. You can use the timestamp there to estimate how long the commit has been inflight
+
+
+```java
+$ hdfs dfs -ls /app/uber/trips/.hoodie/*.inflight
+-rw-r--r--   3 vinoth supergroup     321984 2016-10-05 23:18 /app/uber/trips/.hoodie/20161005225920.inflight
+```
+
+
+### Drilling Down to a specific Commit
+
+To understand how the writes spread across specific partiions,
+
+
+```java
+hudi:trips->commit showpartitions --commit 20161005165855 --sortBy "Total Bytes Written" --desc true --limit 10
+    __________________________________________________________________________________________________________________________________________
+    | Partition Path| Total Files Added| Total Files Updated| Total Records Inserted| Total Records Updated| Total Bytes Written| Total Errors|
+    |=========================================================================================================================================|
+     ....
+     ....
+```
+
+If you need file level granularity , we can do the following
+
+
+```java
+hudi:trips->commit showfiles --commit 20161005165855 --sortBy "Partition Path"
+    ________________________________________________________________________________________________________________________________________________________
+    | Partition Path| File ID                             | Previous Commit| Total Records Updated| Total Records Written| Total Bytes Written| Total Errors|
+    |=======================================================================================================================================================|
+    ....
+    ....
+```
+
+
+### FileSystem View
+
+Hudi views each partition as a collection of file-groups with each file-group containing a list of file-slices in commit order (See [concepts]()). 
+The below commands allow users to view the file-slices for a data-set.
+
+```java
+hudi:stock_ticks_mor->show fsview all
+ ....
+  _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
+ | Partition | FileId | Base-Instant | Data-File | Data-File Size| Num Delta Files| Total Delta File Size| Delta Files |
+ |==============================================================================================================================================================================================================================================================================================================================================================================================================|
+ | 2018/08/31| 111415c3-f26d-4639-86c8-f9956f245ac3| 20181002180759| hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/111415c3-f26d-4639-86c8-f9956f245ac3_0_20181002180759.parquet| 432.5 KB | 1 | 20.8 KB | [HoodieLogFile {hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/.111415c3-f26d-4639-86c8-f9956f245ac3_20181002180759.log.1}]|
+
+
+
+hudi:stock_ticks_mor->show fsview latest --partitionPath "2018/08/31"
+ ......
+ ___________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ [...]
+ | Partition | FileId | Base-Instant | Data-File | Data-File Size| Num Delta Files| Total Delta Size| Delta Size - compaction scheduled| Delta Size - compaction unscheduled| Delta To Base Ratio - compaction scheduled| Delta To Base Ratio - compaction unscheduled| Delta Files - compaction scheduled | Delta Files - compaction unscheduled|
+ |========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================== [...]
+ | 2018/08/31| 111415c3-f26d-4639-86c8-f9956f245ac3| 20181002180759| hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/111415c3-f26d-4639-86c8-f9956f245ac3_0_20181002180759.parquet| 432.5 KB | 1 | 20.8 KB | 20.8 KB | 0.0 B | 0.0 B | 0.0 B | [HoodieLogFile {hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/.111415c3-f26d-4639-86c8-f9956f245ac3_20181002180759.log.1}]| [] |
+
+```
+
+
+### Statistics
+
+Since Hudi directly manages file sizes for DFS table, it might be good to get an overall picture
+
+
+```java
+hudi:trips->stats filesizes --partitionPath 2016/09/01 --sortBy "95th" --desc true --limit 10
+    ________________________________________________________________________________________________
+    | CommitTime    | Min     | 10th    | 50th    | avg     | 95th    | Max     | NumFiles| StdDev  |
+    |===============================================================================================|
+    | <COMMIT_ID>   | 93.9 MB | 93.9 MB | 93.9 MB | 93.9 MB | 93.9 MB | 93.9 MB | 2       | 2.3 KB  |
+    ....
+    ....
+```
+
+In case of Hudi write taking much longer, it might be good to see the write amplification for any sudden increases
+
+
+```java
+hudi:trips->stats wa
+    __________________________________________________________________________
+    | CommitTime    | Total Upserted| Total Written| Write Amplifiation Factor|
+    |=========================================================================|
+    ....
+    ....
+```
+
+
+### Archived Commits
+
+In order to limit the amount of growth of .commit files on DFS, Hudi archives older .commit files (with due respect to the cleaner policy) into a commits.archived file.
+This is a sequence file that contains a mapping from commitNumber => json with raw information about the commit (same that is nicely rolled up above).
+
+
+### Compactions
+
+To get an idea of the lag between compaction and writer applications, use the below command to list down all
+pending compactions.
+
+```java
+hudi:trips->compactions show all
+     ___________________________________________________________________
+    | Compaction Instant Time| State    | Total FileIds to be Compacted|
+    |==================================================================|
+    | <INSTANT_1>            | REQUESTED| 35                           |
+    | <INSTANT_2>            | INFLIGHT | 27                           |
+```
+
+To inspect a specific compaction plan, use
+
+```java
+hudi:trips->compaction show --instant <INSTANT_1>
+    _________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
+    | Partition Path| File Id | Base Instant  | Data File Path                                    | Total Delta Files| getMetrics                                                                                                                    |
+    |================================================================================================================================================================================================================================================
+    | 2018/07/17    | <UUID>  | <INSTANT_1>   | viewfs://ns-default/.../../UUID_<INSTANT>.parquet | 1                | {TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=1230.0, TOTAL_LOG_FILES_SIZE=2.51255751E8, TOTAL_IO_WRITE_MB=991.0, TOTAL_IO_MB=2221.0}|
+
+```
+
+To manually schedule or run a compaction, use the below command. This command uses spark launcher to perform compaction
+operations. 
+
+**NOTE:** Make sure no other application is scheduling compaction for this table concurrently
+{: .notice--info}
+
+```java
+hudi:trips->help compaction schedule
+Keyword:                   compaction schedule
+Description:               Schedule Compaction
+ Keyword:                  sparkMemory
+   Help:                   Spark executor memory
+   Mandatory:              false
+   Default if specified:   '__NULL__'
+   Default if unspecified: '1G'
+
+* compaction schedule - Schedule Compaction
+```
+
+```java
+hudi:trips->help compaction run
+Keyword:                   compaction run
+Description:               Run Compaction for given instant time
+ Keyword:                  tableName
+   Help:                   Table name
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  parallelism
+   Help:                   Parallelism for hoodie compaction
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  schemaFilePath
+   Help:                   Path for Avro schema file
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  sparkMemory
+   Help:                   Spark executor memory
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  retry
+   Help:                   Number of retries
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  compactionInstant
+   Help:                   Base path for the target hoodie table
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+* compaction run - Run Compaction for given instant time
+```
+
+### Validate Compaction
+
+Validating a compaction plan : Check if all the files necessary for compactions are present and are valid
+
+```java
+hudi:stock_ticks_mor->compaction validate --instant 20181005222611
+...
+
+   COMPACTION PLAN VALID
+
+    ___________________________________________________________________________________________________________________________________________________________________________________________________________________________
+    | File Id                             | Base Instant Time| Base Data File                                                                                                                   | Num Delta Files| Valid| Error|
+    |==========================================================================================================================================================================================================================|
+    | 05320e98-9a57-4c38-b809-a6beaaeb36bd| 20181005222445   | hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/05320e98-9a57-4c38-b809-a6beaaeb36bd_0_20181005222445.parquet| 1              | true |      |
+
+
+
+hudi:stock_ticks_mor->compaction validate --instant 20181005222601
+
+   COMPACTION PLAN INVALID
+
+    _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
+    | File Id                             | Base Instant Time| Base Data File                                                                                                                   | Num Delta Files| Valid| Error                                                                           |
+    |=====================================================================================================================================================================================================================================================================================================|
+    | 05320e98-9a57-4c38-b809-a6beaaeb36bd| 20181005222445   | hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/05320e98-9a57-4c38-b809-a6beaaeb36bd_0_20181005222445.parquet| 1              | false| All log files specified in compaction operation is not present. Missing ....    |
+```
+
+**NOTE:** The following commands must be executed without any other writer/ingestion application running.
+{: .notice--warning}
+
+Sometimes, it becomes necessary to remove a fileId from a compaction-plan inorder to speed-up or unblock compaction
+operation. Any new log-files that happened on this file after the compaction got scheduled will be safely renamed
+so that are preserved. Hudi provides the following CLI to support it
+
+
+### Unscheduling Compaction
+
+```java
+hudi:trips->compaction unscheduleFileId --fileId <FileUUID>
+....
+No File renames needed to unschedule file from pending compaction. Operation successful.
+```
+
+In other cases, an entire compaction plan needs to be reverted. This is supported by the following CLI
+
+```java
+hudi:trips->compaction unschedule --compactionInstant <compactionInstant>
+.....
+No File renames needed to unschedule pending compaction. Operation successful.
+```
+
+### Repair Compaction
+
+The above compaction unscheduling operations could sometimes fail partially (e:g -> DFS temporarily unavailable). With
+partial failures, the compaction operation could become inconsistent with the state of file-slices. When you run
+`compaction validate`, you can notice invalid compaction operations if there is one.  In these cases, the repair
+command comes to the rescue, it will rearrange the file-slices so that there is no loss and the file-slices are
+consistent with the compaction plan
+
+```java
+hudi:stock_ticks_mor->compaction repair --instant 20181005222611
+......
+Compaction successfully repaired
+.....
+```
+
+
+## Monitoring
+
+Once the Hudi writer is configured with the right table and environment for metrics, it produces the following graphite metrics, that aid in debugging hudi tables
+
+ - **Commit Duration** - This is amount of time it took to successfully commit a batch of records
+ - **Rollback Duration** - Similarly, amount of time taken to undo partial data left over by a failed commit (happens everytime automatically after a failing write)
+ - **File Level metrics** - Shows the amount of new files added, versions, deleted (cleaned) in each commit
+ - **Record Level Metrics** - Total records inserted/updated etc per commit
+ - **Partition Level metrics** - number of partitions upserted (super useful to understand sudden spikes in commit duration)
+
+These metrics can then be plotted on a standard tool like grafana. Below is a sample commit duration chart.
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_commit_duration.png" alt="hudi_commit_duration.png" style="max-width: 100%" />
+</figure>
+
+
+## Troubleshooting
+
+Section below generally aids in debugging Hudi failures. Off the bat, the following metadata is added to every record to help triage  issues easily using standard Hadoop SQL engines (Hive/Presto/Spark)
+
+ - **_hoodie_record_key** - Treated as a primary key within each DFS partition, basis of all updates/inserts
+ - **_hoodie_commit_time** - Last commit that touched this record
+ - **_hoodie_file_name** - Actual file name containing the record (super useful to triage duplicates)
+ - **_hoodie_partition_path** - Path from basePath that identifies the partition containing this record
+ 
+ For performance related issues, please refer to the [tuning guide](https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide)
+
+
+### Missing records
+
+Please check if there were any write errors using the admin commands above, during the window at which the record could have been written.
+If you do find errors, then the record was not actually written by Hudi, but handed back to the application to decide what to do with it.
+
+### Duplicates
+
+First of all, please confirm if you do indeed have duplicates **AFTER** ensuring the query is accessing the Hudi table [properly](/docs/0.5.2-sql_queries.html) .
+
+ - If confirmed, please use the metadata fields above, to identify the physical files & partition files containing the records .
+ - If duplicates span files across partitionpath, then this means your application is generating different partitionPaths for same recordKey, Please fix your app
+ - if duplicates span multiple files within the same partitionpath, please engage with mailing list. This should not happen. You can use the `records deduplicate` command to fix your data.
+
+### Spark failures {#spark-ui}
+
+Typical upsert() DAG looks like below. Note that Hudi client also caches intermediate RDDs to intelligently profile workload and size files and spark parallelism.
+Also Spark UI shows sortByKey twice due to the probe job also being shown, nonetheless its just a single sort.
+
+<figure>
+    <img class="docimage" src="/assets/images/hudi_upsert_dag.png" alt="hudi_upsert_dag.png" style="max-width: 100%" />
+</figure>
+
+At a high level, there are two steps
+
+**Index Lookup to identify files to be changed**
+
+ - Job 1 : Triggers the input data read, converts to HoodieRecord object and then stops at obtaining a spread of input records to target partition paths
+ - Job 2 : Load the set of file names which we need check against
+ - Job 3  & 4 : Actual lookup after smart sizing of spark join parallelism, by joining RDDs in 1 & 2 above
+ - Job 5 : Have a tagged RDD of recordKeys with locations
+
+**Performing the actual writing of data**
+
+ - Job 6 : Lazy join of incoming records against recordKey, location to provide a final set of HoodieRecord which now contain the information about which file/partitionpath they are found at (or null if insert). Then also profile the workload again to determine sizing of files
+ - Job 7 : Actual writing of data (update + insert + insert turned to updates to maintain file size)
+
+Depending on the exception source (Hudi/Spark), the above knowledge of the DAG can be used to pinpoint the actual issue. The most often encountered failures result from YARN/DFS temporary failures.
+In the future, a more sophisticated debug/management UI would be added to the project, that can help automate some of this debugging.
diff --git a/docs/_docs/0.5.2/3_1_privacy.cn.md b/docs/_docs/0.5.2/3_1_privacy.cn.md
new file mode 100644
index 0000000..7db7cdc
--- /dev/null
+++ b/docs/_docs/0.5.2/3_1_privacy.cn.md
@@ -0,0 +1,25 @@
+---
+version: 0.5.2
+title: Privacy Policy
+keywords: hudi, privacy
+permalink: /cn/docs/0.5.2-privacy.html
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+Information about your use of this website is collected using server access logs and a tracking cookie.
+The collected information consists of the following:
+
+* The IP address from which you access the website;
+* The type of browser and operating system you use to access our site;
+* The date and time you access our site;
+* The pages you visit;
+* The addresses of pages from where you followed a link to our site.
+
+Part of this information is gathered using a tracking cookie set by the [Google Analytics](http://www.google.com/analytics) service and handled by Google as described in their [privacy policy](http://www.google.com/privacy.html). See your browser documentation for instructions on how to disable the cookie if you prefer not to share this data with Google.
+
+We use the gathered information to help us make our site more useful to visitors and to better understand how and when our site is used. We do not track or collect personally identifiable information or associate gathered data with any personally identifying information from other sources.
+
+By using this website, you consent to the collection of this data in the manner and for the purpose described above.
+
+The Hudi development community welcomes your questions or comments regarding this Privacy Policy. Send them to dev@hudi.apache.org
diff --git a/docs/_docs/0.5.2/3_1_privacy.md b/docs/_docs/0.5.2/3_1_privacy.md
new file mode 100644
index 0000000..137a072
--- /dev/null
+++ b/docs/_docs/0.5.2/3_1_privacy.md
@@ -0,0 +1,24 @@
+---
+version: 0.5.2
+title: Privacy Policy
+keywords: hudi, privacy
+permalink: /docs/0.5.2-privacy.html
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+Information about your use of this website is collected using server access logs and a tracking cookie.
+The collected information consists of the following:
+
+* The IP address from which you access the website;
+* The type of browser and operating system you use to access our site;
+* The date and time you access our site;
+* The pages you visit;
+* The addresses of pages from where you followed a link to our site.
+
+Part of this information is gathered using a tracking cookie set by the [Google Analytics](http://www.google.com/analytics) service and handled by Google as described in their [privacy policy](http://www.google.com/privacy.html). See your browser documentation for instructions on how to disable the cookie if you prefer not to share this data with Google.
+
+We use the gathered information to help us make our site more useful to visitors and to better understand how and when our site is used. We do not track or collect personally identifiable information or associate gathered data with any personally identifying information from other sources.
+
+By using this website, you consent to the collection of this data in the manner and for the purpose described above.
+
+The Hudi development community welcomes your questions or comments regarding this Privacy Policy. Send them to dev@hudi.apache.org
diff --git a/docs/_docs/0.5.2/3_2_docs_versions.cn.md b/docs/_docs/0.5.2/3_2_docs_versions.cn.md
new file mode 100644
index 0000000..58cc636
--- /dev/null
+++ b/docs/_docs/0.5.2/3_2_docs_versions.cn.md
@@ -0,0 +1,21 @@
+---
+version: 0.5.2
+title: 文档版本
+keywords: hudi, privacy
+permalink: /cn/docs/0.5.2-docs-versions.html
+last_modified_at: 2019-12-30T15:59:57-04:00
+language: cn
+---
+
+<table class="docversions">
+    <tbody>
+      {% for d in site.previous_docs %}
+        <tr>
+            <th>{{ d.version }}</th>
+            <td><a href="{{ d.en }}">英文版</a></td>
+            <td><a href="{{ d.cn }}">中文版</a></td>
+        </tr>
+      {% endfor %}
+    </tbody>
+</table>
+
diff --git a/docs/_docs/0.5.2/3_2_docs_versions.md b/docs/_docs/0.5.2/3_2_docs_versions.md
new file mode 100644
index 0000000..f045e53
--- /dev/null
+++ b/docs/_docs/0.5.2/3_2_docs_versions.md
@@ -0,0 +1,19 @@
+---
+version: 0.5.2
+title: Docs Versions
+keywords: hudi, privacy
+permalink: /docs/0.5.2-docs-versions.html
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+<table class="docversions">
+    <tbody>
+      {% for d in site.previous_docs %}
+        <tr>
+            <th>{{ d.version }}</th>
+            <td><a href="{{ d.en }}">English Version</a></td>
+            <td><a href="{{ d.cn }}">Chinese Version</a></td>
+        </tr>
+      {% endfor %}
+    </tbody>
+</table>
diff --git a/docs/_includes/nav_list b/docs/_includes/nav_list
index fe8aaef..59d3f4e 100644
--- a/docs/_includes/nav_list
+++ b/docs/_includes/nav_list
@@ -25,6 +25,14 @@
                 {% assign menu_label = "文档菜单" %}
             {% endif %}
     {% endif %}
+    {% elsif page.version == "0.5.2" %}
+            {% assign navigation = site.data.navigation["0.5.2_docs"] %}
+
+            {% if page.language == "cn" %}
+                {% assign navigation = site.data.navigation["0.5.2_cn_docs"] %}
+                {% assign menu_label = "文档菜单" %}
+            {% endif %}
+    {% endif %}
 {% endif %}
 
 <nav class="nav__list">
diff --git a/docs/_includes/quick_link.html b/docs/_includes/quick_link.html
index 7b979cc..2e3fe43 100644
--- a/docs/_includes/quick_link.html
+++ b/docs/_includes/quick_link.html
@@ -9,6 +9,8 @@
   {%- assign author = site.0.5.0_author -%}
 {%- elsif page.version == "0.5.1" -%}
   {%- assign author = site.0.5.1_author -%}
+{%- elsif page.version == "0.5.2" -%}
+  {%- assign author = site.0.5.2_author -%}
 {%- else -%}
   {%- assign author = site.author -%}
 {%- endif -%}
diff --git a/docs/_pages/index.md b/docs/_pages/index.md
index 7190d6b..802ac8b 100644
--- a/docs/_pages/index.md
+++ b/docs/_pages/index.md
@@ -4,7 +4,7 @@ permalink: /
 title: Welcome to Apache Hudi !
 excerpt: >
   Apache Hudi ingests & manages storage of large analytical datasets over DFS (hdfs or cloud stores).<br />
-  <small><a href="https://github.com/apache/incubator-hudi/releases/tag/release-0.5.1-incubating" target="_blank">Latest release 0.5.1-incubating</a></small>
+  <small><a href="https://github.com/apache/incubator-hudi/releases/tag/release-0.5.2-incubating" target="_blank">Latest release 0.5.2-incubating</a></small>
 power_items:
   - img_path: /assets/images/powers/aws.jpg
   - img_path: /assets/images/powers/emis.jpg
diff --git a/docs/_pages/releases.cn.md b/docs/_pages/releases.cn.md
index 879ee34..181df17 100644
--- a/docs/_pages/releases.cn.md
+++ b/docs/_pages/releases.cn.md
@@ -7,12 +7,41 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 language: cn
 ---
 
-## [Release 0.5.0-incubating](https://github.com/apache/incubator-hudi/releases/tag/release-0.5.0-incubating)
+## [Release 0.5.2-incubating](https://github.com/apache/incubator-hudi/releases/tag/release-0.5.2-incubating) ([docs](/docs/0.5.2-quick-start-guide.html))
 
 ### Download Information
- * Source Release : [Apache Hudi(incubating) 0.5.0-incubating Source Release](https://www.apache.org/dist/incubator/hudi/0.5.0-incubating/hudi-0.5.0-incubating.src.tgz) ([asc](https://www.apache.org/dist/incubator/hudi/0.5.0-incubating/hudi-0.5.0-incubating.src.tgz.asc), [sha512](https://www.apache.org/dist/incubator/hudi/0.5.0-incubating/hudi-0.5.0-incubating.src.tgz.sha512))
+ * Source Release : [Apache Hudi(incubating) 0.5.2-incubating Source Release](https://www.apache.org/dist/incubator/hudi/0.5.2-incubating/hudi-0.5.2-incubating.src.tgz) ([asc](https://www.apache.org/dist/incubator/hudi/0.5.2-incubating/hudi-0.5.2-incubating.src.tgz.asc), [sha512](https://www.apache.org/dist/incubator/hudi/0.5.2-incubating/hudi-0.5.2-incubating.src.tgz.sha512))
  * Apache Hudi (incubating) jars corresponding to this release is available [here](https://repository.apache.org/#nexus-search;quick~hudi)
 
+### Migration Guide for this release
+ * Write Client restructuring has moved classes around ([HUDI-554](https://issues.apache.org/jira/browse/HUDI-554)). Package `client` now has all the various client classes, that do the transaction management. `func` renamed to `execution` and some helpers moved to `client/utils`. All compaction code under `io` now under `table/compact`. Rollback code under `table/rollback` and in general all code for individual operations under `table`. This change only affects the apps/projects dependi [...]
+
+### Release Highlights
+ * Support for overwriting the payload implementation in `hoodie.properties` via specifying the `hoodie.compaction.payload.class` config option. Previously, once the payload class is set once in `hoodie.properties`, it cannot be changed. In some cases, if a code refactor is done and the jar updated, one may need to pass the new payload class name.
+ * `TimestampBasedKeyGenerator` supports for `CharSequence`  types. Previously `TimestampBasedKeyGenerator` only supports `Double`, `Long`, `Float` and `String` 4 data types for the partition key. Now, after data type extending, `CharSequence` has been supported in `TimestampBasedKeyGenerator`.
+ * Hudi now supports incremental pulling from defined partitions via the `hoodie.datasource.read.incr.path.glob` [config option](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala#L111). For some use case that users only need to pull the incremental part of certain partitions, it can run faster by only loading relevant parquet files.
+ * With 0.5.2, hudi allows partition path to be updated with `GLOBAL_BLOOM` index. Previously, when a record is to be updated with a new partition path, and when set to `GLOBAL_BLOOM` as index, hudi ignores the new partition path and update the record in the original partition path. Now, hudi allows records to be inserted into their new partition paths and delete the records in the old partition paths. A configuration (e.g. `hoodie.index.bloom.update.partition.path=true`) can be added to [...]
+ * A `JdbcbasedSchemaProvider` schema provider has been provided to get metadata through JDBC. For the use case that users want to synchronize data from MySQL, and at the same time, want to get the schema from the database, it's very helpful.
+ * Simplify `HoodieBloomIndex` without the need for 2GB limit handling. Prior to spark 2.4.0, each spark partition has a limit of 2GB. In Hudi 0.5.1, after we upgraded to spark 2.4.4, we don't have the limitation anymore. Hence removing the safe parallelism constraint we had in` HoodieBloomIndex`.
+ * CLI related changes:
+   - Allows users to specify option to print additional commit metadata, e.g. *Total Log Blocks*, *Total Rollback Blocks*, *Total Updated Records Compacted* and so on.
+   - Supports `temp_query` and `temp_delete` to query and delete temp view. This command creates a temp table. Users can write HiveQL queries against the table to filter the desired row. For example,
+```
+temp_query --sql "select Instant, NumInserts, NumWrites from satishkotha_debug where FileId='ed33bd99-466f-4417-bd92-5d914fa58a8f' and Instant > '20200123211217' order by Instant"
+```
+
+### Raw Release Notes
+  The raw release notes are available [here](https://issues.apache.org/jira/projects/HUDI/versions/12346606#release-report-tab-body)
+
+## [Release 0.5.1-incubating](https://github.com/apache/incubator-hudi/releases/tag/release-0.5.1-incubating) ([docs](/docs/0.5.1-quick-start-guide.html))
+
+### Download Information
+ * Source Release : [Apache Hudi(incubating) 0.5.1-incubating Source Release](https://www.apache.org/dist/incubator/hudi/0.5.1-incubating/hudi-0.5.1-incubating.src.tgz) ([asc](https://www.apache.org/dist/incubator/hudi/0.5.1-incubating/hudi-0.5.1-incubating.src.tgz.asc), [sha512](https://www.apache.org/dist/incubator/hudi/0.5.1-incubating/hudi-0.5.1-incubating.src.tgz.sha512))
+ * Apache Hudi (incubating) jars corresponding to this release is available [here](https://repository.apache.org/#nexus-search;quick~hudi)
+
+### Migration Guide for this release
+ * In 0.5.1, the community restructured the package of key generators. The key generator related classes have been moved from `org.apache.hudi` to `org.apache.hudi.keygen`. 
+
 ### Release Highlights
  * Package and format renaming from com.uber.hoodie to org.apache.hudi (See migration guide section below)
  * Major redo of Hudi bundles to address class and jar version mismatches in different environments
diff --git a/docs/_pages/releases.md b/docs/_pages/releases.md
index 65f989f..c4d34bb 100644
--- a/docs/_pages/releases.md
+++ b/docs/_pages/releases.md
@@ -6,12 +6,41 @@ toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
+## [Release 0.5.2-incubating](https://github.com/apache/incubator-hudi/releases/tag/release-0.5.2-incubating) ([docs](/docs/0.5.2-quick-start-guide.html))
+
+### Download Information
+ * Source Release : [Apache Hudi(incubating) 0.5.2-incubating Source Release](https://www.apache.org/dist/incubator/hudi/0.5.2-incubating/hudi-0.5.2-incubating.src.tgz) ([asc](https://www.apache.org/dist/incubator/hudi/0.5.2-incubating/hudi-0.5.2-incubating.src.tgz.asc), [sha512](https://www.apache.org/dist/incubator/hudi/0.5.2-incubating/hudi-0.5.2-incubating.src.tgz.sha512))
+ * Apache Hudi (incubating) jars corresponding to this release is available [here](https://repository.apache.org/#nexus-search;quick~hudi)
+
+### Migration Guide for this release
+ * Write Client restructuring has moved classes around ([HUDI-554](https://issues.apache.org/jira/browse/HUDI-554)). Package `client` now has all the various client classes, that do the transaction management. `func` renamed to `execution` and some helpers moved to `client/utils`. All compaction code under `io` now under `table/compact`. Rollback code under `table/rollback` and in general all code for individual operations under `table`. This change only affects the apps/projects dependi [...]
+
+### Release Highlights
+ * Support for overwriting the payload implementation in `hoodie.properties` via specifying the `hoodie.compaction.payload.class` config option. Previously, once the payload class is set once in `hoodie.properties`, it cannot be changed. In some cases, if a code refactor is done and the jar updated, one may need to pass the new payload class name.
+ * `TimestampBasedKeyGenerator` supports for `CharSequence`  types. Previously `TimestampBasedKeyGenerator` only supports `Double`, `Long`, `Float` and `String` 4 data types for the partition key. Now, after data type extending, `CharSequence` has been supported in `TimestampBasedKeyGenerator`.
+ * Hudi now supports incremental pulling from defined partitions via the `hoodie.datasource.read.incr.path.glob` [config option](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala#L111). For some use case that users only need to pull the incremental part of certain partitions, it can run faster by only loading relevant parquet files.
+ * With 0.5.2, hudi allows partition path to be updated with `GLOBAL_BLOOM` index. Previously, when a record is to be updated with a new partition path, and when set to `GLOBAL_BLOOM` as index, hudi ignores the new partition path and update the record in the original partition path. Now, hudi allows records to be inserted into their new partition paths and delete the records in the old partition paths. A configuration (e.g. `hoodie.index.bloom.update.partition.path=true`) can be added to [...]
+ * A `JdbcbasedSchemaProvider` schema provider has been provided to get metadata through JDBC. For the use case that users want to synchronize data from MySQL, and at the same time, want to get the schema from the database, it's very helpful.
+ * Simplify `HoodieBloomIndex` without the need for 2GB limit handling. Prior to spark 2.4.0, each spark partition has a limit of 2GB. In Hudi 0.5.1, after we upgraded to spark 2.4.4, we don't have the limitation anymore. Hence removing the safe parallelism constraint we had in` HoodieBloomIndex`.
+ * CLI related changes:
+   - Allows users to specify option to print additional commit metadata, e.g. *Total Log Blocks*, *Total Rollback Blocks*, *Total Updated Records Compacted* and so on.
+   - Supports `temp_query` and `temp_delete` to query and delete temp view. This command creates a temp table. Users can write HiveQL queries against the table to filter the desired row. For example,
+```
+temp_query --sql "select Instant, NumInserts, NumWrites from satishkotha_debug where FileId='ed33bd99-466f-4417-bd92-5d914fa58a8f' and Instant > '20200123211217' order by Instant"
+```
+
+### Raw Release Notes
+  The raw release notes are available [here](https://issues.apache.org/jira/projects/HUDI/versions/12346606#release-report-tab-body)
+
 ## [Release 0.5.1-incubating](https://github.com/apache/incubator-hudi/releases/tag/release-0.5.1-incubating) ([docs](/docs/0.5.1-quick-start-guide.html))
 
 ### Download Information
  * Source Release : [Apache Hudi(incubating) 0.5.1-incubating Source Release](https://www.apache.org/dist/incubator/hudi/0.5.1-incubating/hudi-0.5.1-incubating.src.tgz) ([asc](https://www.apache.org/dist/incubator/hudi/0.5.1-incubating/hudi-0.5.1-incubating.src.tgz.asc), [sha512](https://www.apache.org/dist/incubator/hudi/0.5.1-incubating/hudi-0.5.1-incubating.src.tgz.sha512))
  * Apache Hudi (incubating) jars corresponding to this release is available [here](https://repository.apache.org/#nexus-search;quick~hudi)
 
+### Migration Guide for this release
+ * In 0.5.1, the community restructured the package of key generators. The key generator related classes have been moved from `org.apache.hudi` to `org.apache.hudi.keygen`. 
+
 ### Release Highlights
  * Dependency Version Upgrades
    - Upgrade from Spark 2.1.0 to Spark 2.4.4


Mime
View raw message