hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dyozie <...@git.apache.org>
Subject [GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...
Date Thu, 27 Oct 2016 16:11:16 GMT
Github user dyozie commented on a diff in the pull request:

    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
    -This topic describes how to access Hive data using PXF. You have several options for
querying data stored in Hive. You can create external tables in PXF and then query those tables,
or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ
accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing
large data sets supporting multiple data formats, including comma-separated value (.csv),
RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +This section describes how to use PXF to access Hive data. Options for querying data
stored in Hive include:
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html)
for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster. 
     -   You have set the `hive.metastore.uris` property in the `hive-site.xml` on the NameNode.
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +Hive supports several file formats:
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or
JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections;
reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats)
for detailed information about the file formats supported by Hive.
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats
listed above. These include:
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ
columns of the same type.
    +The following table summarizes external mapping rules for Hive primitive types.
    +| Hive Data Type  | Hawq Data Type |
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +Hive supports complex data types including array, struct, map, and union. PXF maps each
of these complex types to `text`.  While HAWQ does not natively support these types, you can
create HAWQ functions or application code to extract subcomponents of these complex data types.
    +An example using complex data types is provided later in this topic.
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +Examples used in this topic will operate on a common data set. This simple data set models
a retail sales operation and includes fields with the following names and data types:
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +Prepare the sample data set for use:
    +1. First, create a text file:
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,`
to separate the four field values:
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
     ## <a id="hivecommandline"></a>Hive Command Line
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
    -Here is an example of how to create and load data into a sample Hive table from an
existing file.
    +The default Hive database is named `default`. 
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double)
row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage
format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            STORED AS textfile;
    +    ```
    -The source table can also be a combination of these types. The PXF Hive plug-in uses
this information to query the data in runtime.
    +    Notice that:
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile
(the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values,
as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within
a data record (line). The `sales_info` table field delimiter is a comma (`,`).
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you
just created:
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    --- End diff --
    Global edit to change "verify the data" -> "verify that the data"

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message