hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dyozie <...@git.apache.org>
Subject [GitHub] incubator-hawq-docs pull request #17: Updates for hawq register
Date Fri, 30 Sep 2016 18:55:50 GMT
Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/17#discussion_r81397353
  
    --- Diff: reference/cli/admin_utilities/hawqregister.html.md.erb ---
    @@ -2,102 +2,83 @@
     title: hawq register
     ---
     
    -Loads and registers external parquet-formatted data in HDFS into a corresponding table
in HAWQ.
    +Loads and registers 
    +AO or Parquet-formatted data in HDFS into a corresponding table in HAWQ.
     
     ## <a id="topic1__section2"></a>Synopsis
     
     ``` pre
    -hawq register <databasename> <tablename> <hdfspath> 
    +Usage 1:
    +hawq register [<connection_options>] [-f <hdfsfilepath>] [-e <Eof>]
<tablename>
    +
    +Usage 2:
    +hawq register [<connection_options>] [-c <configfilepath>][--force] <tablename>
    +
    +Connection Options:
          [-h <hostname>] 
          [-p <port>] 
          [-U <username>] 
          [-d <database>]
    -     [-t <tablename>] 
    +     
    +Misc. Options:
          [-f <filepath>] 
    +	 [-e <eof>]
    + 	 [--force] 
          [-c <yml_config>]  
     hawq register help | -? 
     hawq register --version
     ```
     
     ## <a id="topic1__section3"></a>Prerequisites
     
    -The client machine where `hawq register` is executed must have the following:
    +The client machine where `hawq register` is executed must meet the following conditions:
     
     -   Network access to and from all hosts in your HAWQ cluster (master and segments) and
the hosts where the data to be loaded is located.
    +-   The Hadoop client must be configured and the hdfs filepath specified.
     -   The files to be registered and the HAWQ table located in the same HDFS cluster.
     -   The target table DDL is configured with the correct data type mapping.
     
     ## <a id="topic1__section4"></a>Description
     
    -`hawq register` is a utility that loads and registers existing or external parquet data
in HDFS into HAWQ, so that it can be directly ingested and accessed through HAWQ. Parquet
data from the file or directory in the specified path is loaded into the appropriate HAWQ
table directory in HDFS and the utility updates the corresponding HAWQ metadata for the files.

    +`hawq register` is a utility that loads and registers existing data files or folders
in HDFS into HAWQ internal tables, allowing HAWQ to directly read the data and use internal
table processing for operations such as transactions and high perforance, without needing
to load or copy it. Data from the file or directory specified by \<hdfsfilepath\> is
loaded into the appropriate HAWQ table directory in HDFS and the utility updates the corresponding
HAWQ metadata for the files. 
     
    -Only parquet tables can be loaded using the `hawq register` command. Metadata for the
parquet file(s) and the destination table must be consistent. Different  data types are used
by HAWQ tables and parquet tables, so the data is mapped. You must verify that the structure
of the parquet files and the HAWQ table are compatible before running `hawq register`. 
    +You can use `hawq register` to:
     
    -Note: only HAWQ or HIVE-generated parquet tables are currently supported.
    +-  Load and register external Parquet-formatted file data generated by an external system
such as Hive or Spark.
    +-  Recover cluster data from a backup cluster.
     
    -###Limitations for Registering Hive Tables to HAWQ
    -The currently-supported data types for generating Hive tables into HAWQ tables are: boolean,
int, smallint, tinyint, bigint, float, double, string, binary, char, and varchar.  
    +Two usage models are available.
     
    -The following HIVE data types cannot be converted to HAWQ equivalents: timestamp, decimal,
array, struct, map, and union.   
    +###Usage Model 1: register file data to an existing table.
     
    +`hawq register [-h hostname] [-p port] [-U username] [-d databasename] [-f filepath]
[-e eof]<tablename>`
     
    -## <a id="topic1__section5"></a>Options
    -
    -**General Options**
    -
    -<dt>-? (show help) </dt>  
    -<dd>Show help, then exit.
    -
    -<dt>-\\\-version  </dt> 
    -<dd>Show the version of this utility, then exit.</dd>
    -
    -
    -**Connection Options**
    -
    -<dt>-h \<hostname\> </dt>
    -<dd>Specifies the host name of the machine on which the HAWQ master database server
is running. If not specified, reads from the environment variable `$PGHOST` or defaults to
`localhost`.</dd>
    -
    -<dt> -p \<port\> </dt> 
    -<dd>Specifies the TCP port on which the HAWQ master database server is listening
for connections. If not specified, reads from the environment variable `$PGPORT` or defaults
to 5432.</dd>
    +Metadata for the Parquet file(s) and the destination table must be consistent. Different
 data types are used by HAWQ tables and Parquet files, so the data is mapped. Refer to the
section [Data Type Mapping](hawqregister.html#topic1__section7) below. You must verify that
the structure of the Parquet files and the HAWQ table are compatible before running `hawq
register`. 
     
    -<dt>-U \<username\> </dt> 
    -<dd>The database role name to connect as. If not specified, reads from the environment
variable `$PGUSER` or defaults to the current system user name.</dd>
    +####Limitations
    +Only HAWQ or Hive-generated Parquet tables are supported.
    +Hash tables and artitioned tables are not supported in this use model.
     
    -<dt>-d  , --database \<databasename\>  </dt>
    -<dd>The database to register the parquet HDFS data into. The default is `postgres`<dd>
    +###Usage Model 2: Use information from a YAML configuration file to register data
      
    -<dt>-t , --tablename \<tablename\> </dt>
    -<dd>The HAWQ table that will store the parquet data. The table cannot use hash
distribution: only tables using random distribution can be registered into HAWQ.</dd>
    -
    -<dt>-f , --filepath \<hdfspath\></dt>
    -<dd>The path of the file or directory in HDFS containing the files to be registered.</dd>
    -
    -<dt>-c , --config \<yml_config\> </dt> 
    -<dd>Registers a YAML-format configuration file into HAWQ.</dd>
    -
    -
    +`hawq register [-h hostname] [-p port] [-U username] [-d databasename] [-c configfile]
[--force] <tablename>`
     
    -## <a id="topic1__section6"></a>Examples
    +Files generated by the `hawq extract` command are registered through use of metadata
in a YAML configuration file. Both AO and Parquet tables can be registered. Tables need not
exist in HAWQ before being registered.
     
    -This example shows how to register a HIVE-generated parquet file in HDFS into the table
`parquet_table` in HAWQ, which is in the database named `postgres`. The file path of the HIVE-generated
file is `hdfs://localhost:8020/temp/hive.paq`.
    -
    -For the purposes of this example, assume that the location of the database is `hdfs://localhost:8020/hawq_default`,
the tablespace id is 16385, the database id is 16387, the table filenode id is 77160, and
the last file under the filenode is numbered 7.
    -
    -Enter:
    -
    -``` pre
    -$ hawq register postgres parquet_table hdfs://localhost:8020/temp/hive.paq
    -```
    +The register process behaves differently, according to different conditions. 
     
    -After running the `hawq register` command for the file location  `hdfs://localhost:8020/temp/hive.paq`,
the corresponding new location of the file in HDFS is:  `hdfs://localhost:8020/hawq_default/16385/16387/77160/8`.
The command then updates the metadata of the table `parquet_table` in HAWQ, which is contained
in the table `pg_aoseg.pg_paqseg_77160`. The pg\_aoseg is a fixed schema for row-oriented
and parquet ao tables. For row-oriented tables, table name prefix is pg\_aoseg. The table
name prefix for parquet tables is pg\_paqseg. 77160 is the relation id of the table.
    +-  Existing tables have files appended to the existing HAWQ table.
    +-  If a table does not exist, it is created and registered into HAWQ. 
    +-  If the -\-force option is used, the data in existing catalog tables is erased and
re-registered.
     
    -To locate the table, you can either find the relation ID by looking up the catalog table
pg\_class by running `select oid from pg_class where relname=$relname` or by finding the table
name by using the command `select segrelid from pg_appendonly where relid = $relid` then running
`select relname from pg_class where oid = segrelid`.
    +###Limitations for Registering Hive Tables to HAWQ
    +The currently-supported data types for generating Hive tables into HAWQ tables are: boolean,
int, smallint, tinyint, bigint, float, double, string, binary, char, and varchar.  
     
    -**Recommendation:** Before running ```hawq register```, create a copy of the parquet
file to be registered, then run ```hawq register``` on the copy. This leaves the original
file available for additional Hive queries or if a data mapping error is encountered.
    +The following HIVE data types cannot be converted to HAWQ equivalents: timestamp, decimal,
array, struct, map, and union.   
     
    -##Data Type Mapping<a id="topic1__section7"></a>
    +###Data Type Mapping<a id="topic1__section7"></a>
     
    -HAWQ and parquet tables and HIVE and HAWQ tables use different data types. Mapping must
be used for compatibility. You are responsible for making sure your implementation is mapped
to the appropriate data type before running `hawq register`. The tables below show equivalent
data types, if available.
    --- End diff --
    
    See previous edit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message