hawq-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yo...@apache.org
Subject incubator-hawq-docs git commit: HAWQ-1394 document HiveORC profile (closes #104)
Date Mon, 20 Mar 2017 21:19:49 GMT
Repository: incubator-hawq-docs
Updated Branches:
  refs/heads/develop ce67adabd -> 8c37d043a


HAWQ-1394 document HiveORC profile (closes #104)


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/8c37d043
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/8c37d043
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/8c37d043

Branch: refs/heads/develop
Commit: 8c37d043ab8b6ef59315a7bd92acb2c9b45dcedb
Parents: ce67ada
Author: Lisa Owen <lowen@pivotal.io>
Authored: Mon Mar 20 14:19:45 2017 -0700
Committer: David Yozie <yozie@apache.org>
Committed: Mon Mar 20 14:19:45 2017 -0700

----------------------------------------------------------------------
 markdown/pxf/HivePXF.html.md.erb | 183 ++++++++++++++++++++++++++++++++--
 1 file changed, 175 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/8c37d043/markdown/pxf/HivePXF.html.md.erb
----------------------------------------------------------------------
diff --git a/markdown/pxf/HivePXF.html.md.erb b/markdown/pxf/HivePXF.html.md.erb
index bc4e9f6..345bbca 100644
--- a/markdown/pxf/HivePXF.html.md.erb
+++ b/markdown/pxf/HivePXF.html.md.erb
@@ -50,7 +50,7 @@ The PXF Hive plug-in supports several file formats and profiles for accessing
th
 | TextFile | Flat file with data in comma-, tab-, or space-separated value format or JSON
notation. | Hive, HiveText |
 | SequenceFile | Flat file consisting of binary key/value pairs. | Hive |
 | RCFile | Record columnar data consisting of binary key/value pairs; high row compression
rate. | Hive, HiveRC |
-| ORCFile | Optimized row columnar data with stripe, footer, and postscript sections; reduces
data size. | Hive |
+| ORCFile | Optimized row columnar data with stripe, footer, and postscript sections; reduces
data size. | Hive, HiveORC |
 | Parquet | Compressed columnar data representation. | Hive |
 | Avro | JSON-defined, schema-based data serialization format. | Hive |
 
@@ -82,7 +82,7 @@ The following table summarizes external mapping rules for Hive primitive
types.
 
 Hive supports complex data types including array, struct, map, and union. PXF maps each of
these complex types to `text`.  While HAWQ does not natively support these types, you can
create HAWQ functions or application code to extract subcomponents of these complex data types.
 
-An example using complex data types with the `Hive` profile is provided later in this topic.
+Examples using complex data types with the `Hive` and `HiveORC` profiles are provided later
in this topic.
 
 
 ## <a id="hive_sampledataset"></a>Sample Data Set
@@ -316,7 +316,7 @@ HCatalog integration has the following limitations:
 
 In the previous section, you used HCatalog integration to query a Hive table. You can also
create a PXF/HAWQ external table to access Hive table data. This Hive table access mechanism
requires that you identify an appropriate Hive profile.
 
-The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`,
and `HiveRC`. The `HiveText` and `HiveRC` profiles are specifically optimized for text and
RC file formats, respectively. The `Hive` profile is optimized for all file storage types;
use the `Hive` profile when the underlying Hive table is composed of multiple partitions with
differing file formats.
+The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`,
and `HiveRC`, and `HiveORC`. The `HiveText` and `HiveRC` profiles are specifically optimized
for text and RC file formats, respectively. The `HiveORC` profile is optimized for ORC file
formats. The `Hive` profile is optimized for all file storage types; use the `Hive` profile
when the underlying Hive table is composed of multiple partitions with differing file formats.
 
 Use the following syntax to create a HAWQ external table representing Hive data:
 
@@ -324,7 +324,7 @@ Use the following syntax to create a HAWQ external table representing
Hive data:
 CREATE EXTERNAL TABLE <table_name>
     ( <column_name> <data_type> [, ...] | LIKE <other_table> )
 LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
-    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
+    ?PROFILE=Hive|HiveText|HiveRC|HiveORC[&DELIMITER=<delim>'])
 FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
 ```
 
@@ -336,9 +336,9 @@ Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL
TABLE](..
 | \<port\>    | The PXF port. If \<port\> is omitted, PXF assumes \<host\>
identifies a High Availability HDFS Nameservice and connects to the port number designated
by the `pxf_service_port` server configuration parameter value. Default is 51200. |
 | \<hive-db-name\>    | The name of the Hive database. If omitted, defaults to the
Hive database named `default`. |
 | \<hive-table-name\>    | The name of the Hive table. |
-| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or
`HiveRC`. |
+| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, `HiveRC`,
or `HiveORC`. |
 | DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC`
profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must
be a single ascii character or specified in hexadecimal representation. |
-| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format
supports only the built-in `pxfwritable_import` `formatter`.   |
+| FORMAT (`Hive` and `HiveORC` profiles)   | The `FORMAT` clause must specify `CUSTOM`. The
`CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
 | FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The
`delimiter` must be specified a second time in '\<delim\>'. |
 
 
@@ -401,7 +401,7 @@ Use the PXF `HiveText` profile to create a queryable HAWQ external table
from th
 2. Query the external table:
 
     ``` sql
-    postgres=# SELECT * FROM salesinfo_hivetextprofile WHERE location="Beijing";
+    postgres=# SELECT * FROM salesinfo_hivetextprofile WHERE location='Beijing';
     ```
 
     ``` shell
@@ -475,6 +475,96 @@ Use the `HiveRC` profile to query RCFile-formatted data in Hive tables.
      ...
     ```
 
+## <a id="hiveorc-intro"></a>HiveORC Profile
+
+The Optimized Row Columnar (ORC) file format is a columnar file format that provides a highly
efficient way to both store and access HDFS data. ORC format offers improvements over text
and RCFile formats in terms of both compression and performance. HAWQ/PXF supports ORC version
1.2.1.
+
+ORC is type-aware and specifically designed for Hadoop workloads. ORC files store both the
type of and encoding information for the data in the file. All columns within a single group
of row data (also known as stripe) are stored together on disk in ORC format files. The columnar
nature of the ORC format type enables read projection, helping avoid accessing unecessary
columns during a query.
+
+ORC also supports predicate pushdown with built-in indexes at the file, stripe, and row levels,
moving the filter operation to the data loading phase.
+
+Refer to the [Apache orc](https://orc.apache.org/docs/) and the Apache Hive [LanguageManual
ORC](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) websites for detailed
information about the ORC file format.
+
+Use the `HiveORC` profile to access ORC format data. The `HiveORC` profile provides:
+
+- Enhanced query performance - Column projection information is leveraged to enhance query
performance by reducing disk I/O and data payload.
+
+- Optimized predicate pushdown - Predicate pushdown is optimized for:
+    - `int2`, `int4`, `int8`, `float8`, `text`, `bpchar`, and `boolean` data type and `=`,
`>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operator combinations
+    - `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operators and comparisons
between the integer types
+    - `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operators and comparisons
between the `float8` and `float4` types
+    - `IN` operator on arrays of `int2`, `int4`, `int8`, `boolean`, and `text`
+
+- Complex type support - You can access Hive tables composed of array, map, struct, and union
data types. PXF serializes each of these complex types to `text`.
+
+**Note**: The `HiveORC` profile currently supports access to data stored in ORC format only
through a Hive mapped table.
+
+### <a id="using-hiveorc-profile"></a>Example: Using the HiveORC Profile
+
+In the following example, you will create a Hive table stored in ORC format and use the `HiveORC`
profile to query this Hive table.
+
+1. Create a Hive table with ORC file format:
+
+    ``` shell
+    $ HADOOP_USER_NAME=hdfs hive
+    ```
+
+    ``` sql
+    hive> CREATE TABLE sales_info_ORC (location string, month string,
+            number_of_orders int, total_sales double)
+          STORED AS ORC;
+    ```
+
+2. Insert the data from the `sales_info` table into `sales_info_ORC`:
+
+    ``` sql
+    hive> INSERT INTO TABLE sales_info_ORC SELECT * FROM sales_info;
+    ```
+
+    A copy of the sample data set is now stored in ORC format in `sales_info_ORC`.
+
+3. Perform a Hive query on `sales_info_ORC` to verify that the data was loaded successfully:
+
+    ``` sql
+    hive> SELECT * FROM sales_info_ORC;
+    ```
+
+4. Start the `psql` subsystem and turn on timing:
+
+    ``` shell
+    $ psql -d postgres
+    ```
+
+    ``` sql
+    postgres=> \timing
+    Timing is on.
+    ```
+
+4. Use the PXF `HiveORC` profile to create a queryable HAWQ external table from the Hive
table named `sales_info_ORC` you created in Step 1. The `FORMAT` clause must specify `'CUSTOM'`.
The `HiveORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'` `formatter`.
+
+    ``` sql
+    postgres=> CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, month text,
num_orders int, total_sales float8)
+                 LOCATION ('pxf://namenode:51200/default.sales_info_ORC?PROFILE=HiveORC')
+                 FORMAT 'CUSTOM' (formatter='pxfwritable_import');
+    ```
+
+5. Query the external table:
+
+    ``` sql
+    postgres=> SELECT * FROM salesinfo_hiveORCprofile;
+    ```
+
+    ``` pre
+       location    | month | number_of_orders | total_sales 
+    ---------------+-------+------------------+-------------
+     Prague        | Jan   |              101 |     4875.33
+     Rome          | Mar   |               87 |     1557.39
+     Bangalore     | May   |              317 |     8936.99
+     ...
+
+    Time: 425.416 ms
+    ```
+
 ## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
 
 The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use
the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data
types. For example, if a Hive table is created using:
@@ -498,8 +588,9 @@ And query the HAWQ external table using:
 postgres=# SELECT fname,lname FROM pxf_parquet_table;
 ```
 
+## <a id="complex_dt_example"></a> Complex Data Types
 
-## <a id="complex_dt_example"></a>Hive Profile Complex Data Type Example
+### <a id="complex_dt_example"></a>Example: Using the Hive Profile with Complex
Data Types
 
 This example employs the `Hive` profile and the array and map complex types, specifically
an array of integers and a string key/value pair map.
 
@@ -607,6 +698,82 @@ When specifying an array field in a Hive table, you must identify the
terminator
 
     `intarray` and `propmap` are each serialized as text strings.
 
+### <a id="using-hiveorc-profile-complex"></a>Example: Using the HiveORC Profile
with Complex Data Types
+
+In the following example, you will create a Hive table stored in ORC format. You will use
the `HiveORC` profile to query the complex types in the `table_complextypes` Hive table you
created in the previous exercise.
+
+1. Create a Hive table with ORC file format:
+
+    ``` shell
+    $ HADOOP_USER_NAME=hdfs hive
+    ```
+
+    ``` sql
+    hive> CREATE TABLE table_complextypes_ORC( index int, name string, intarray ARRAY<int>,
propmap MAP<string, string>)
+            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
+            COLLECTION ITEMS TERMINATED BY '%'
+            MAP KEYS TERMINATED BY ':'
+          STORED AS ORC;
+    ```
+
+2. Insert the data from the `table_complextypes` table into `table_complextypes_ORC`:
+
+    ``` sql
+    hive> INSERT INTO TABLE table_complextypes_ORC SELECT * FROM table_complextypes;
+    ```
+
+    A copy of the sample data set is now stored in ORC format in `table_complextypes_ORC`.
+
+3. Perform a Hive query on `table_complextypes_ORC` to verify that the data was loaded successfully:
+
+    ``` sql
+    hive> SELECT * FROM table_complextypes_ORC;
+    ```
+
+    ``` pre
+    OK
+    3       Prague       [1,2,3]    {"zone":"euro","status":"up"}
+    89      Rome         [4,5,6]    {"zone":"euro"}
+    400     Bangalore    [7,8,9]    {"zone":"apac","status":"pending"}
+    ...
+    ```
+
+4. Start the `psql` subsystem:
+
+    ``` shell
+    $ psql -d postgres
+    ```
+
+4. Use the PXF `HiveORC` profile to create a queryable HAWQ external table from the Hive
table named `table_complextypes_ORC` you created in Step 1. The `FORMAT` clause must specify
`'CUSTOM'`. The `HiveORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'`
`formatter`.
+
+    ``` sql
+    postgres=> CREATE EXTERNAL TABLE complextypes_hiveorc(index int, name text, intarray
text, propmap text)
+               LOCATION ('pxf://namenode:51200/default.table_complextypes_ORC?PROFILE=HiveORC')
+                 FORMAT 'CUSTOM' (formatter='pxfwritable_import');
+    ```
+
+    Notice that the integer array and map complex types are mapped to type text.
+
+5. Query the external table:
+
+    ``` sql
+    postgres=> SELECT * FROM complextypes_hiveorc;
+    ```
+
+    ``` pre
+     index |    name    | intarray |              propmap               
+    -------+------------+----------+------------------------------------
+         3 | Prague     | [1,2,3]  | {"zone":"euro","status":"up"}
+        89 | Rome       | [4,5,6]  | {"zone":"euro"}
+       400 | Bangalore  | [7,8,9]  | {"zone":"apac","status":"pending"}
+     ...
+
+    ```
+
+    `intarray` and `propmap` are each serialized as text strings.
+
+
+
 ## <a id="partitionfiltering"></a>Partition Filtering
 
 The PXF Hive plug-in supports the Hive partitioning feature and directory structure. This
enables partition exclusion on selected HDFS files comprising the Hive table. To use the
partition filtering feature to reduce network traffic and I/O, run a PXF query using a `WHERE`
clause that refers to a specific partition in the partitioned Hive table.


Mime
View raw message