hawq-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yo...@apache.org
Subject [02/14] incubator-hawq-docs git commit: PXF Hive plug-in mods
Date Thu, 27 Oct 2016 23:32:56 GMT
PXF Hive plug-in mods


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/457d703a
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/457d703a
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/457d703a

Branch: refs/heads/develop
Commit: 457d703a3f5c057e241acf985fbc35da34f6a075
Parents: 0398a62
Author: Lisa Owen <lowen@pivotal.io>
Authored: Mon Sep 26 15:40:10 2016 -0700
Committer: Lisa Owen <lowen@pivotal.io>
Committed: Mon Sep 26 15:40:10 2016 -0700

----------------------------------------------------------------------
 pxf/HivePXF.html.md.erb | 141 ++++++++++++++++++++++---------------------
 1 file changed, 72 insertions(+), 69 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/457d703a/pxf/HivePXF.html.md.erb
----------------------------------------------------------------------
diff --git a/pxf/HivePXF.html.md.erb b/pxf/HivePXF.html.md.erb
index 55a0aeb..a1c26c3 100644
--- a/pxf/HivePXF.html.md.erb
+++ b/pxf/HivePXF.html.md.erb
@@ -61,14 +61,14 @@ The following table summarizes external mapping rules for Hive primitive
types.
 
 ### <a id="topic_b4v_g3n_25"></a>Complex Data Types
 
-Hive supports complex data types including array, struct, map, and union. PXF maps each of
these complex types to `text`.  While HAWQ does not natively support these types, you can
create HAWQ functions or application code to extract components of complex data types.
+Hive supports complex data types including array, struct, map, and union. PXF maps each of
these complex types to `text`.  While HAWQ does not natively support these types, you can
create HAWQ functions or application code to extract subcomponents of these complex data types.
 
-A complex data type example is provided later in this section.
+A complex data type example is provided later in this topic.
 
 
 ## <a id="hive_sampledataset"></a>Sample Data Set
 
-Examples used in this section will operate on a common data set. This simple data set models
a retail sales operation and includes fields with the following names and data types:
+Examples used in this topic will operate on a common data set. This simple data set models
a retail sales operation and includes fields with the following names and data types:
 
 - location - text
 - month - text
@@ -117,10 +117,10 @@ hive> CREATE TABLE sales_info (location string, month string,
 
 Notice:
 
-- The `STORED AS textfile` subclause instructs Hive to create the table in TextFile (the
default) format.  Hive textfile format supports comma-, tab-, and space-separated values,
as well as data specified in JSON notation.
+- The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the
default) format.  Hive Textfile format supports comma-, tab-, and space-separated values,
as well as data specified in JSON notation.
 - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a
data record (line). The `sales_info` table field delimiter is a comma (`,`).
 
-Load the `pxf_hive_datafile.txt` sample data file into the newly-created `sales_info` table:
+Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
 
 ``` sql
 hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt' 
@@ -137,6 +137,8 @@ In examples later in this section, you will access the `sales_info` Hive
table d
 
 ## <a id="topic_p2s_lvl_25"></a>Querying External Hive Data
 
+The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`,
and `HiveRC`.
+
 Use the following syntax to create a HAWQ external table representing Hive data:
 
 ``` sql
@@ -160,7 +162,7 @@ Hive-plug-in-specific keywords and values used in the `CREATE EXTERNAL
TABLE` ca
 | FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The
`delimiter` must be specified a second time in `delim`. |
 
 
-## <a id="profile_hive"></a>Example: Hive Profile 
+### <a id="profile_hive"></a>Hive Profile 
 
 The `Hive` profile works with any Hive file format table.
 
@@ -190,7 +192,7 @@ postgres=# select * from salesinfo_hiveprofile;
 
 ```
 
-## <a id="profile_hivetext"></a>Example: HiveText Profile
+### <a id="profile_hivetext"></a>HiveText Profile
 
 Use the `HiveText` profile to query text formats. The `HiveText` profile is more performant
than the `Hive` profile. 
 
@@ -214,7 +216,7 @@ Notice:
 Query the external table:
 
 ``` sql
-postgres=# select * from salesinfo_hivetextprofile where location="Beijing"; 
+postgres=# SELECT * FROM salesinfo_hivetextprofile where location="Beijing"; 
 ```
 
 ``` shell                     
@@ -225,9 +227,9 @@ postgres=# select * from salesinfo_hivetextprofile where location="Beijing";
 (2 rows)
 ```
 
-## <a id="profile_hiverc"></a>Example: HiveRC Profile
+### <a id="profile_hiverc"></a>HiveRC Profile
 
-### <a id="profile_hiverc_rcfiletbl"></a>RCFile Format Hive Table
+#### <a id="profile_hiverc_rcfiletbl"></a>RCFile Format Hive Table
 
 The RCFile Hive format is used for row columnar formatted data. Create a Hive table with
RCFile format:
 
@@ -253,7 +255,7 @@ A copy of the sample data set is now stored in RCFile format in `sales_info_rcfi
 ``` sql
 hive> SELECT * FROM sales_info_rcfile;
 ```
-### <a id="profile_hiverc_pxfquery"></a>PXF Query
+#### <a id="profile_hiverc_pxfquery"></a>PXF Query
 
 Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. The `HiveRC` profile
is more performant than the `Hive` profile. 
 
@@ -272,7 +274,7 @@ postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text,
month te
 Query the external table:
 
 ``` sql
-postgres=# select location, total_sales from salesinfo_hivercprofile; 
+postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile; 
 ```
 
 ``` shell                     
@@ -285,22 +287,39 @@ postgres=# select location, total_sales from salesinfo_hivercprofile;
  ...
 ```
 
-## <a id="profileperf"></a>Profile Performance Considerations
+### <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
+
+The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use
the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data
types. For example, if a Hive table is created using:
+
+``` sql
+hive> CREATE TABLE hive_parquet_table (fname string, lname string, custid int, acctbalance
double) 
+        STORED AS parquet;
+```
+
+Define the HAWQ external table using:
+
+``` sql
+postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (fname text, lname text, custid int, acctbalance
double precision) 
+    LOCATION ('pxf://namenode:51200/hive-db-name.hive_parquet_table?profile=Hive') 
+    FORMAT 'CUSTOM' (formatter='pxfwritable_import'); 
+```
+
+### <a id="profileperf"></a>Profile Performance Considerations
 
 The `HiveRC` and `HiveText` profiles are faster than the generic `Hive` profile.
 
 ?? MORE HERE. ??
 
-## <a id="complex_dt_example"></a>Example: Complex Data Types
+### <a id="complex_dt_example"></a>Complex Data Types Example
 
-Prepare a sample data set for use. This data set will employ the array and map complex types,
specifically an array of integers and a key/value pair map. 
+Prepare a sample data set for use. This data set will employ the array and map complex types,
specifically an array of integers and a string key/value pair map. 
 
 The data set includes fields with the following names and data types:
 
 - index - int
 - name - string
 - intarray - array of integers
-- propmap - map of string keys and string value pairs
+- propmap - map of string key and value pairs
 
 When specifying an array field in a Hive table, you must identify the terminator for each
item in the collection. Similarly, the map key termination character must also be specified.
 
@@ -332,20 +351,20 @@ $ HADOOP_USER_NAME=hdfs hive
 ```
 
 ``` sql
-CREATE TABLE table_complextypes( index int, name string, intarray ARRAY<INT>, propmap
MAP<string, string>) 
-  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
-  COLLECTION ITEMS TERMINATED BY '%' 
-  MAP KEYS TERMINATED BY ':' 
-  STORED AS TEXTFILE;
+postgres=# CREATE TABLE table_complextypes( index int, name string, intarray ARRAY<int>,
propmap MAP<string, string>) 
+             ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
+             COLLECTION ITEMS TERMINATED BY '%' 
+             MAP KEYS TERMINATED BY ':' 
+             STORED AS TEXTFILE;
 ```
 
 Notice:
 
-- The `FIELDS TERMINATED BY` identifies a comma as the field terminator.
-- The `COLLECTION ITEMS TERMINATED BY` specifies the percent sign as the collection items
(array item, map key/value pair) terminator.
-- The `MAP KEYS TERMINATED BY` ientifies a colon as the terminator for map keys.
+- `FIELDS TERMINATED BY` identifies a comma as the field terminator.
+- The `COLLECTION ITEMS TERMINATED BY` subclause specifies the percent sign as the collection
items (array item, map key/value pair) terminator.
+- `MAP KEYS TERMINATED BY` identifies a colon as the terminator for map keys.
 
-Load the `pxf_hive_complex.txt` sample data file into the newly-created `table_complextypes`
table:
+Load the `pxf_hive_complex.txt` sample data file into the `table_complextypes` table you
just created:
 
 ``` sql
 hive> LOAD DATA local INPATH '/tmp/pxf_hive_complex.txt' INTO TABLE table_complextypes;
@@ -364,7 +383,7 @@ hive> SELECT * FROM table_complextypes;
 ...
 ```
 
-Use the PXF `Hive` profile to create a queryable HAWQ external table representing the Hive
table_complextypes:
+Use the PXF `Hive` profile to create a queryable HAWQ external table representing the Hive
`table_complextypes`:
 
 ``` sql
 postgres=# CREATE EXTERNAL TABLE complextypes_hiveprofile(index int, name text, intarray
text, propmap text)
@@ -372,12 +391,12 @@ postgres=# CREATE EXTERNAL TABLE complextypes_hiveprofile(index int,
name text,
              FORMAT 'CUSTOM' (formatter='pxfwritable_import');
 ```
 
-Notice that the integer array and map complex types are mapped to PXF type text.
+Notice that the integer array and map complex types are mapped to type text.
 
 Query the external table:
 
 ``` sql
-select * from complextypes_hiveprofile; 
+postgres=# SELECT * FROM complextypes_hiveprofile; 
 ```
 
 ``` shell                      
@@ -396,6 +415,7 @@ select * from complextypes_hiveprofile;
 (10 rows)
 ```
 
+`intarray` and `propmap` are each text strings.
 
 ## <a id="hcatalog"></a>Using PXF and HCatalog to Query Hive
 
@@ -458,9 +478,11 @@ To obtain a description of a Hive table with HCatalog integration, you
can use t
     
     ``` sql
     postgres=# \d hcatalog.default.sales_info_rcfile;
-
+    ```
+    
+    ``` shell
     PXF Hive Table "default.sales_info_rcfile"
-              Column      |  Type  
+          Column      |  Type  
     ------------------+--------
      location         | text
      month            | text
@@ -472,12 +494,12 @@ To obtain a description of a Hive table with HCatalog integration, you
can use t
 
 **Note:** When using `\d` or `\d+` commands in the `psql` HAWQ client, `hcatalog` will not
be listed as a database. If you use other `psql` compatible clients, `hcatalog` will be listed
as a database with a size value of `-1` since `hcatalog` is not a real database in HAWQ.
 
-Alternatively, you can use the `pxf\_get\_item\_fields` user-defined function (UDF) to obtain
Hive table descriptions from other client interfaces or third-party applications. The UDF
takes a PXF profile and a table pattern string as its input parameters.  **Note:** The only
supported input profile at this time is `'Hive'`.
+Alternatively, you can use the `pxf_get_item_fields` user-defined function (UDF) to obtain
Hive table descriptions from other client interfaces or third-party applications. The UDF
takes a PXF profile and a table pattern string as its input parameters.  **Note:** The only
supported input profile at this time is `'Hive'`.
 
 - The following statement returns a description of a specific table. The description includes
path, itemname (table), fieldname, and fieldtype.
 
     ``` sql
-    postgres=# select * from pxf_get_item_fields('Hive','default.sales_info_rcfile');
+    postgres=# SELECT * FROM pxf_get_item_fields('Hive','default.sales_info_rcfile');
     ```
     
     ``` pre
@@ -492,32 +514,32 @@ Alternatively, you can use the `pxf\_get\_item\_fields` user-defined
function (U
 - The following statement returns table descriptions from the default database.
 
     ``` sql
-    postgres=# select * from pxf_get_item_fields('Hive','default.*');
+    postgres=# SELECT * FROM pxf_get_item_fields('Hive','default.*');
     ```
 
 - The following statement returns a description of the entire schema.
 
     ``` sql
-    postgres=# select * from pxf_get_item_fields('Hive', '*.*');
+    postgres=# SELECT * FROM pxf_get_item_fields('Hive', '*.*');
     ```
 
 ### <a id="topic_r5k_pst_25"></a>Limitations
 
 HCatalog integration has the following limitations:
 
--   HCatalog integration queries and describe commands do not support complex types; only
primitive types are supported. Use PXF external tables to query complex types in Hive. (See
[Hive Complex Types](#topic_b4v_g3n_25) for example.)
--   Even for primitive types, HCatalog metadata descriptions produced by `\d` and` \d+` are
converted to HAWQ types. For example, the Hive type `tinyint` is converted to HAWQ type `int2`.
(See [Data Types](../reference/HAWQDataTypes.html) for a list of data types in HAWQ.)
+-   HCatalog integration queries and describe commands do not support complex types; only
primitive types are supported. Use PXF external tables to query complex types in Hive. (See
[Complex Types Example](#complex_dt_example).)
+-   Even for primitive types, HCatalog metadata descriptions produced by `\d` and` \d+` are
converted to HAWQ types. For example, the Hive type `tinyint` is converted to HAWQ type `int2`.
(See [Data Type Mapping](#hive_primdatatypes).)
 -   HAWQ reserves the database name `hcatalog` for system use. You cannot connect to or alter
the system `hcatalog` database.
 
 ## <a id="partitionfiltering"></a>Partition Filtering
 
-The PXF Hive plug-in supports the Hive partitioning feature and directory structure. This
enables partition exclusion on selected HDFS files comprising the Hive table. To use the
partition filtering feature to reduce network traffic and I/O, run a PXF query using a WHERE
clause that refers to a specific partition in the partitioned Hive table.
+The PXF Hive plug-in supports the Hive partitioning feature and directory structure. This
enables partition exclusion on selected HDFS files comprising the Hive table. To use the
partition filtering feature to reduce network traffic and I/O, run a PXF query using a `WHERE`
clause that refers to a specific partition in the partitioned Hive table.
 
-To take advantage of PXF partition filtering push-down, the Hive and PXF partition field
names should be the same. Otherwise, PXF ignores Partition filtering and the filtering is
performed on the HAWQ side, impacting performance.
+To take advantage of PXF partition filtering push-down, the Hive and PXF partition field
names should be the same. Otherwise, PXF ignores partition filtering and the filtering is
performed on the HAWQ side, impacting performance.
 
 **Note:** The Hive plug-in only filters on partition columns, not on other table attributes.
 
-### <a id="example2"></a>Example
+### <a id="example2"></a>Create Partitioned Hive Table
 
 Create a Hive table `sales_part` with two partition columns, `delivery_state` and `delivery_city:`
 
@@ -530,13 +552,13 @@ hive> CREATE TABLE sales_part (name string, type string, supplier_key
int, price
 Load data into this Hive table and add some partitions:
 
 ``` sql
-INSERT INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Fresno')
VALUES ('block', 'widget', 33, 15.17);
-INSERT INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento')
VALUES ('cube', 'widget', 11, 1.17);
-INSERT INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA', delivery_city = 'Reno')
VALUES ('dowel', 'widget', 51, 31.82);
-INSERT INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA', delivery_city = 'Las Vegas')
VALUES ('px49', 'pipe', 52, 99.82);
+hive> INSERT INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city
= 'Fresno') VALUES ('block', 'widget', 33, 15.17);
+hive> INSERT INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city
= 'Sacramento') VALUES ('cube', 'widget', 11, 1.17);
+hive> INSERT INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA', delivery_city
= 'Reno') VALUES ('dowel', 'widget', 51, 31.82);
+hive> INSERT INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA', delivery_city
= 'Las Vegas') VALUES ('px49', 'pipe', 52, 99.82);
 ```
 
-The Hive storage directory structure appears as follows:
+The Hive storage directory structure for the `sales_part` table appears as follows:
 
 ``` pre
 $ sudo -u hdfs hdfs dfs -ls -R /apps/hive/warehouse/sales_part
@@ -546,7 +568,7 @@ $ sudo -u hdfs hdfs dfs -ls -R /apps/hive/warehouse/sales_part
 /apps/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=’Las Vegas’/
 ```
 
-To define a HAWQ PXF table that will read this Hive table and take advantage of partition
filter push-down, define the fields corresponding to the Hive partition fields at the end
of the attribute list. In HiveQL, a `select                         *` statement on a
partitioned table shows the partition fields at the end of the record.
+To define a HAWQ PXF table that will read this Hive table and take advantage of partition
filter push-down, define the fields corresponding to the Hive partition fields at the end
of the `CREATE EXTERNAL TABLE` attribute list. In HiveQL, a `SELECT *` statement on a
partitioned table shows the partition fields at the end of the record.
 
 ``` sql
 postgres=# CREATE EXTERNAL TABLE pxf_sales_part(
@@ -563,7 +585,7 @@ FORMAT 'custom' (FORMATTER='pxfwritable_import');
 postgres=# SELECT * FROM pxf_sales_part;
 ```
 
-### <a id="example3"></a>Example
+### <a id="example3"></a>Query Without Pushdown
 
 In the following example, the HAWQ query filters the `delivery_city` partition `Sacramento`.
The filter on  `item_name` is not pushed down, since it is not a partition column. It is
performed on the HAWQ side after all the data on `Sacramento` is transferred for processing.
 
@@ -571,7 +593,7 @@ In the following example, the HAWQ query filters the `delivery_city` partition
`
 postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name
= 'cube';
 ```
 
-### <a id="example4"></a>Example
+### <a id="example4"></a>Query With Pushdown
 
 The following HAWQ query reads all the data under `delivery_state` partition `CALIFORNIA`,
regardless of the city.
 
@@ -585,8 +607,6 @@ This topic describes a difference in query results between Hive and PXF
queries
 
 Similar to Hive, PXF represents a table's partitioning columns as columns that are appended
to the end of the table. However, PXF translates any column value in a default partition to
a NULL value. This means that a HAWQ query that includes an IS NULL filter on a partitioning
column can return different results than the same Hive query.
 
-### <a id="topic_g4r_4wh_1s"></a>Example
-
 Consider a Hive partitioned table that is created with the statement:
 
 ``` sql
@@ -603,32 +623,15 @@ The table is loaded with five rows that contain the following data:
 5.0    2013-12-06
 ```
 
-In this case, the insertion of row 4 creates a Hive default partition, because the partition
column "date" contains a null value.
+The insertion of row 4 creates a Hive default partition, because the partition column `xdate`
contains a null value.
 
-In Hive, any query that filters on the partition column omits data in the default partition.
For example the following query returns no rows:
+In Hive, any query that filters on the partition column omits data in the default partition.
For example, the following query returns no rows:
 
 ``` sql
-hive> select * from sales where xdate is null;
+hive> SELECT * FROM sales WHERE xdate is null;
 ```
 
 However, if you map this table as a PXF external table in HAWQ, all default partition values
are translated into actual NULL values. In HAWQ, executing the same query against the PXF
table returns row 4 as the result, because the filter matches the NULL value.
 
 Keep this behavior in mind when executing IS NULL queries on Hive partitioned tables.
 
-## <a id="topic_dbb_nz3_ts"></a>Accessing Hive Tables in Parquet Format
-
-The PXF Hive profile supports both non-partitioned and partitioned Hive tables that use the
Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types.
For example, if a Hivw table is created using:
-
-``` sql
-hive> CREATE TABLE hive_parquet_table (fname string, lname string, custid int, acctbalance
double) 
-        STORED AS parquet;
-```
-
-Define the HAWQ external table using:
-
-``` sql
-postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (fname text, lname text, custid int, acctbalance
double precision) 
-    LOCATION ('pxf://namenode:51200/hive-db-name.hive_parquet_table?profile=Hive') 
-    FORMAT 'CUSTOM' (formatter='pxfwritable_import'); 
-```
-


Mime
View raw message