Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 90FBD200C4B for ; Mon, 20 Mar 2017 22:19:56 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 8F8C1160B76; Mon, 20 Mar 2017 21:19:56 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 88DB8160B71 for ; Mon, 20 Mar 2017 22:19:55 +0100 (CET) Received: (qmail 5547 invoked by uid 500); 20 Mar 2017 21:19:54 -0000 Mailing-List: contact commits-help@hawq.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hawq.incubator.apache.org Delivered-To: mailing list commits@hawq.incubator.apache.org Received: (qmail 5538 invoked by uid 99); 20 Mar 2017 21:19:54 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Mar 2017 21:19:54 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 57A371823C9 for ; Mon, 20 Mar 2017 21:19:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -3.569 X-Spam-Level: X-Spam-Status: No, score=-3.569 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id aZ8iJjzquSxG for ; Mon, 20 Mar 2017 21:19:51 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id 25DD65F243 for ; Mon, 20 Mar 2017 21:19:49 +0000 (UTC) Received: (qmail 5500 invoked by uid 99); 20 Mar 2017 21:19:49 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Mar 2017 21:19:49 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 3E589DFE1D; Mon, 20 Mar 2017 21:19:49 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: yozie@apache.org To: commits@hawq.incubator.apache.org Message-Id: X-Mailer: ASF-Git Admin Mailer Subject: incubator-hawq-docs git commit: HAWQ-1394 document HiveORC profile (closes #104) Date: Mon, 20 Mar 2017 21:19:49 +0000 (UTC) archived-at: Mon, 20 Mar 2017 21:19:56 -0000 Repository: incubator-hawq-docs Updated Branches: refs/heads/develop ce67adabd -> 8c37d043a HAWQ-1394 document HiveORC profile (closes #104) Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/8c37d043 Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/8c37d043 Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/8c37d043 Branch: refs/heads/develop Commit: 8c37d043ab8b6ef59315a7bd92acb2c9b45dcedb Parents: ce67ada Author: Lisa Owen Authored: Mon Mar 20 14:19:45 2017 -0700 Committer: David Yozie Committed: Mon Mar 20 14:19:45 2017 -0700 ---------------------------------------------------------------------- markdown/pxf/HivePXF.html.md.erb | 183 ++++++++++++++++++++++++++++++++-- 1 file changed, 175 insertions(+), 8 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/8c37d043/markdown/pxf/HivePXF.html.md.erb ---------------------------------------------------------------------- diff --git a/markdown/pxf/HivePXF.html.md.erb b/markdown/pxf/HivePXF.html.md.erb index bc4e9f6..345bbca 100644 --- a/markdown/pxf/HivePXF.html.md.erb +++ b/markdown/pxf/HivePXF.html.md.erb @@ -50,7 +50,7 @@ The PXF Hive plug-in supports several file formats and profiles for accessing th | TextFile | Flat file with data in comma-, tab-, or space-separated value format or JSON notation. | Hive, HiveText | | SequenceFile | Flat file consisting of binary key/value pairs. | Hive | | RCFile | Record columnar data consisting of binary key/value pairs; high row compression rate. | Hive, HiveRC | -| ORCFile | Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. | Hive | +| ORCFile | Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. | Hive, HiveORC | | Parquet | Compressed columnar data representation. | Hive | | Avro | JSON-defined, schema-based data serialization format. | Hive | @@ -82,7 +82,7 @@ The following table summarizes external mapping rules for Hive primitive types. Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types. -An example using complex data types with the `Hive` profile is provided later in this topic. +Examples using complex data types with the `Hive` and `HiveORC` profiles are provided later in this topic. ## Sample Data Set @@ -316,7 +316,7 @@ HCatalog integration has the following limitations: In the previous section, you used HCatalog integration to query a Hive table. You can also create a PXF/HAWQ external table to access Hive table data. This Hive table access mechanism requires that you identify an appropriate Hive profile. -The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`. The `HiveText` and `HiveRC` profiles are specifically optimized for text and RC file formats, respectively. The `Hive` profile is optimized for all file storage types; use the `Hive` profile when the underlying Hive table is composed of multiple partitions with differing file formats. +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`, and `HiveORC`. The `HiveText` and `HiveRC` profiles are specifically optimized for text and RC file formats, respectively. The `HiveORC` profile is optimized for ORC file formats. The `Hive` profile is optimized for all file storage types; use the `Hive` profile when the underlying Hive table is composed of multiple partitions with differing file formats. Use the following syntax to create a HAWQ external table representing Hive data: @@ -324,7 +324,7 @@ Use the following syntax to create a HAWQ external table representing Hive data: CREATE EXTERNAL TABLE ( [, ...] | LIKE ) LOCATION ('pxf://[:]/. - ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=']) + ?PROFILE=Hive|HiveText|HiveRC|HiveORC[&DELIMITER=']) FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='') ``` @@ -336,9 +336,9 @@ Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](.. | \ | The PXF port. If \ is omitted, PXF assumes \ identifies a High Availability HDFS Nameservice and connects to the port number designated by the `pxf_service_port` server configuration parameter value. Default is 51200. | | \ | The name of the Hive database. If omitted, defaults to the Hive database named `default`. | | \ | The name of the Hive table. | -| PROFILE | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. | +| PROFILE | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, `HiveRC`, or `HiveORC`. | | DELIMITER | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set. \ must be a single ascii character or specified in hexadecimal representation. | -| FORMAT (`Hive` profile) | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`. | +| FORMAT (`Hive` and `HiveORC` profiles) | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`. | | FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\'. | @@ -401,7 +401,7 @@ Use the PXF `HiveText` profile to create a queryable HAWQ external table from th 2. Query the external table: ``` sql - postgres=# SELECT * FROM salesinfo_hivetextprofile WHERE location="Beijing"; + postgres=# SELECT * FROM salesinfo_hivetextprofile WHERE location='Beijing'; ``` ``` shell @@ -475,6 +475,96 @@ Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. ... ``` +## HiveORC Profile + +The Optimized Row Columnar (ORC) file format is a columnar file format that provides a highly efficient way to both store and access HDFS data. ORC format offers improvements over text and RCFile formats in terms of both compression and performance. HAWQ/PXF supports ORC version 1.2.1. + +ORC is type-aware and specifically designed for Hadoop workloads. ORC files store both the type of and encoding information for the data in the file. All columns within a single group of row data (also known as stripe) are stored together on disk in ORC format files. The columnar nature of the ORC format type enables read projection, helping avoid accessing unecessary columns during a query. + +ORC also supports predicate pushdown with built-in indexes at the file, stripe, and row levels, moving the filter operation to the data loading phase. + +Refer to the [Apache orc](https://orc.apache.org/docs/) and the Apache Hive [LanguageManual ORC](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) websites for detailed information about the ORC file format. + +Use the `HiveORC` profile to access ORC format data. The `HiveORC` profile provides: + +- Enhanced query performance - Column projection information is leveraged to enhance query performance by reducing disk I/O and data payload. + +- Optimized predicate pushdown - Predicate pushdown is optimized for: + - `int2`, `int4`, `int8`, `float8`, `text`, `bpchar`, and `boolean` data type and `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operator combinations + - `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operators and comparisons between the integer types + - `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operators and comparisons between the `float8` and `float4` types + - `IN` operator on arrays of `int2`, `int4`, `int8`, `boolean`, and `text` + +- Complex type support - You can access Hive tables composed of array, map, struct, and union data types. PXF serializes each of these complex types to `text`. + +**Note**: The `HiveORC` profile currently supports access to data stored in ORC format only through a Hive mapped table. + +### Example: Using the HiveORC Profile + +In the following example, you will create a Hive table stored in ORC format and use the `HiveORC` profile to query this Hive table. + +1. Create a Hive table with ORC file format: + + ``` shell + $ HADOOP_USER_NAME=hdfs hive + ``` + + ``` sql + hive> CREATE TABLE sales_info_ORC (location string, month string, + number_of_orders int, total_sales double) + STORED AS ORC; + ``` + +2. Insert the data from the `sales_info` table into `sales_info_ORC`: + + ``` sql + hive> INSERT INTO TABLE sales_info_ORC SELECT * FROM sales_info; + ``` + + A copy of the sample data set is now stored in ORC format in `sales_info_ORC`. + +3. Perform a Hive query on `sales_info_ORC` to verify that the data was loaded successfully: + + ``` sql + hive> SELECT * FROM sales_info_ORC; + ``` + +4. Start the `psql` subsystem and turn on timing: + + ``` shell + $ psql -d postgres + ``` + + ``` sql + postgres=> \timing + Timing is on. + ``` + +4. Use the PXF `HiveORC` profile to create a queryable HAWQ external table from the Hive table named `sales_info_ORC` you created in Step 1. The `FORMAT` clause must specify `'CUSTOM'`. The `HiveORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'` `formatter`. + + ``` sql + postgres=> CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, month text, num_orders int, total_sales float8) + LOCATION ('pxf://namenode:51200/default.sales_info_ORC?PROFILE=HiveORC') + FORMAT 'CUSTOM' (formatter='pxfwritable_import'); + ``` + +5. Query the external table: + + ``` sql + postgres=> SELECT * FROM salesinfo_hiveORCprofile; + ``` + + ``` pre + location | month | number_of_orders | total_sales + ---------------+-------+------------------+------------- + Prague | Jan | 101 | 4875.33 + Rome | Mar | 87 | 1557.39 + Bangalore | May | 317 | 8936.99 + ... + + Time: 425.416 ms + ``` + ## Accessing Parquet-Format Hive Tables The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using: @@ -498,8 +588,9 @@ And query the HAWQ external table using: postgres=# SELECT fname,lname FROM pxf_parquet_table; ``` +## Complex Data Types -## Hive Profile Complex Data Type Example +### Example: Using the Hive Profile with Complex Data Types This example employs the `Hive` profile and the array and map complex types, specifically an array of integers and a string key/value pair map. @@ -607,6 +698,82 @@ When specifying an array field in a Hive table, you must identify the terminator `intarray` and `propmap` are each serialized as text strings. +### Example: Using the HiveORC Profile with Complex Data Types + +In the following example, you will create a Hive table stored in ORC format. You will use the `HiveORC` profile to query the complex types in the `table_complextypes` Hive table you created in the previous exercise. + +1. Create a Hive table with ORC file format: + + ``` shell + $ HADOOP_USER_NAME=hdfs hive + ``` + + ``` sql + hive> CREATE TABLE table_complextypes_ORC( index int, name string, intarray ARRAY, propmap MAP) + ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' + COLLECTION ITEMS TERMINATED BY '%' + MAP KEYS TERMINATED BY ':' + STORED AS ORC; + ``` + +2. Insert the data from the `table_complextypes` table into `table_complextypes_ORC`: + + ``` sql + hive> INSERT INTO TABLE table_complextypes_ORC SELECT * FROM table_complextypes; + ``` + + A copy of the sample data set is now stored in ORC format in `table_complextypes_ORC`. + +3. Perform a Hive query on `table_complextypes_ORC` to verify that the data was loaded successfully: + + ``` sql + hive> SELECT * FROM table_complextypes_ORC; + ``` + + ``` pre + OK + 3 Prague [1,2,3] {"zone":"euro","status":"up"} + 89 Rome [4,5,6] {"zone":"euro"} + 400 Bangalore [7,8,9] {"zone":"apac","status":"pending"} + ... + ``` + +4. Start the `psql` subsystem: + + ``` shell + $ psql -d postgres + ``` + +4. Use the PXF `HiveORC` profile to create a queryable HAWQ external table from the Hive table named `table_complextypes_ORC` you created in Step 1. The `FORMAT` clause must specify `'CUSTOM'`. The `HiveORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'` `formatter`. + + ``` sql + postgres=> CREATE EXTERNAL TABLE complextypes_hiveorc(index int, name text, intarray text, propmap text) + LOCATION ('pxf://namenode:51200/default.table_complextypes_ORC?PROFILE=HiveORC') + FORMAT 'CUSTOM' (formatter='pxfwritable_import'); + ``` + + Notice that the integer array and map complex types are mapped to type text. + +5. Query the external table: + + ``` sql + postgres=> SELECT * FROM complextypes_hiveorc; + ``` + + ``` pre + index | name | intarray | propmap + -------+------------+----------+------------------------------------ + 3 | Prague | [1,2,3] | {"zone":"euro","status":"up"} + 89 | Rome | [4,5,6] | {"zone":"euro"} + 400 | Bangalore | [7,8,9] | {"zone":"apac","status":"pending"} + ... + + ``` + + `intarray` and `propmap` are each serialized as text strings. + + + ## Partition Filtering The PXF Hive plug-in supports the Hive partitioning feature and directory structure. This enables partition exclusion on selected HDFS files comprising the Hive table. To use the partition filtering feature to reduce network traffic and I/O, run a PXF query using a `WHERE` clause that refers to a specific partition in the partitioned Hive table.