hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ictmalili <...@git.apache.org>
Subject [GitHub] incubator-hawq-docs pull request #17: Updates for hawq register
Date Thu, 29 Sep 2016 01:46:40 GMT
Github user ictmalili commented on a diff in the pull request:

    --- Diff: datamgmt/load/g-register_files.html.md.erb ---
    @@ -0,0 +1,214 @@
    +title: Registering Files into HAWQ Internal Tables
    +The `hawq register` utility loads and registers HDFS data files or folders into HAWQ
internal tables. Files can be read directly, rather than having to be copied or loaded, resulting
in higher performance and more efficient transaction processing.
    +Data from the file or directory specified by \<hdfsfilepath\> is loaded into the
appropriate HAWQ table directory in HDFS and the utility updates the corresponding HAWQ metadata
for the files. Either AO for Parquet-formatted in HDFS can be loaded into a corresponding
table in HAWQ.
    +You can use `hawq register` either to:
    +-  Load and register external Parquet-formatted file data generated by an external system
such as Hive or Spark.
    +-  Recover cluster data from a backup cluster for disaster recovery. 
    +Requirements for running `hawq register` on the client server are:
    +-   Network access to and from all hosts in your HAWQ cluster (master and segments) and
the hosts where the data to be loaded is located.
    +-   The Hadoop client configured and the hdfs filepath specified.
    +-   The files to be registered and the HAWQ table must be located in the same HDFS cluster.
    +-   The target table DDL is configured with the correct data type mapping.
    +##Registering Externally Generated HDFS File Data to an Existing Table<a id="topic1__section2"></a>
    +Files or folders in HDFS can be registered into an existing table, allowing them to be
managed as a HAWQ internal table. When registering files, you can optionally specify the maximum
amount of data to be loaded, in bytes, using the `--eof` option. If registering a folder,
the actual file sizes are used. 
    +Only HAWQ or Hive-generated Parquet tables are supported. Partitioned tables are not
supported. Attempting to register these tables will result in an error.
    +Metadata for the Parquet file(s) and the destination table must be consistent. Different
 data types are used by HAWQ tables and Parquet files, so data must be mapped. You must verify
that the structure of the parquet files and the HAWQ table are compatible before running `hawq
    +We recommand creating a copy of the Parquet file to be registered before running ```hawq
    +You can then then run ```hawq register``` on the copy,  leaving the original file available
for additional Hive queries or if a data mapping error is encountered.
    +###Limitations for Registering Hive Tables to HAWQ
    +The currently-supported data types for generating Hive tables into HAWQ tables are: boolean,
int, smallint, tinyint, bigint, float, double, string, binary, char, and varchar.  
    +The following HIVE data types cannot be converted to HAWQ equivalents: timestamp, decimal,
array, struct, map, and union.   
    +###Example: Registering a Hive-Generated Parquet File
    +This example shows how to register a HIVE-generated parquet file in HDFS into the table
`parquet_table` in HAWQ, which is in the database named `postgres`. The file path of the HIVE-generated
file is `hdfs://localhost:8020/temp/hive.paq`.
    +In this example, the location of the database is `hdfs://localhost:8020/hawq_default`,
the tablespace id is 16385, the database id is 16387, the table filenode id is 77160, and
the last file under the filenode is numbered 7.
    +``` pre
    +$ hawq register postgres -f hdfs://localhost:8020/temp/hive.paq parquet_table
    +After running the `hawq register` command for the file location  `hdfs://localhost:8020/temp/hive.paq`,
the corresponding new location of the file in HDFS is:  `hdfs://localhost:8020/hawq_default/16385/16387/77160/8`.

    +The command then updates the metadata of the table `parquet_table` in HAWQ, which is
contained in the table `pg_aoseg.pg_paqseg_77160`. The pg\_aoseg is a fixed schema for row-oriented
and Parquet AO tables. For row-oriented tables, the table name prefix is pg\_aoseg. The table
name prefix for parquet tables is pg\_paqseg. 77160 is the relation id of the table.
    +To locate the table, either find the relation ID by looking up the catalog table pg\_class
in SQL by running 
    +select oid from pg_class where relname=$relname
    +or find the table name by using the SQL command 
    +select segrelid from pg_appendonly where relid = $relid
    +then running 
    +select relname from pg_class where oid = segrelid
    +##Registering Data Using Information from a .yml Configuration File<a id="topic1__section3"></a>
    +The `hawq register` command can register HDFS files  by using metadata loaded from a
.yml configuration file by using the `--config <yml_config\>` option. Both AO and Parquet
tables can be registered. Tables need not exist in HAWQ before being registered. This function
can be useful in disaster recovery, allowing information created by the `hawq extract` command
to be used to re-create HAWQ tables.
    +You can also use a .yml confguration file to append HDFS files to an existing HAWQ table
or create a table and register it into HAWQ.
    +For disaster recovery, tables can be re-registered using the HDFS files and a .yml file.
The clusters are assumed to have data periodically imported from Cluster A to Cluster B. 
    +Data is registered according to the following conditions: 
    +-  Existing tables have files appended to the existing HAWQ table.
    +-  If a table does not exist, it is created and registered into HAWQ. The catalog table
will be updated with the file size specified by the .yml file.
    +-  If the --force option is used, the data in existing catalog tables is erased and re-registered.
All HDFS-related catalog contents in `pg_aoseg.pg_paqseg_$relid ` are cleared. The original
files on HDFS are retained.
    +-  If the --repair option is used, data is rolled back to a previous state, as specified
in the .yml file. Any files generated after the checkpoint specified in the .yml file will
be erased. Both the file on HDFS and its metadata are erased.
    --- End diff --
    If this document is specified for Release 2.0.1, I'd recommend we remove repair part description,
since it's been cut from Release 2.0.1. 

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message