Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7833B200B8F for ; Fri, 30 Sep 2016 20:55:50 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 7707F160AB4; Fri, 30 Sep 2016 18:55:50 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B81ED160AD9 for ; Fri, 30 Sep 2016 20:55:49 +0200 (CEST) Received: (qmail 38606 invoked by uid 500); 30 Sep 2016 18:55:49 -0000 Mailing-List: contact dev-help@hawq.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hawq.incubator.apache.org Delivered-To: mailing list dev@hawq.incubator.apache.org Received: (qmail 38575 invoked by uid 99); 30 Sep 2016 18:55:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Sep 2016 18:55:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 4A1CF1A5340 for ; Fri, 30 Sep 2016 18:55:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -7.019 X-Spam-Level: X-Spam-Status: No, score=-7.019 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id F_bQp44FeFjS for ; Fri, 30 Sep 2016 18:55:47 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id 523B95FB90 for ; Fri, 30 Sep 2016 18:55:46 +0000 (UTC) Received: (qmail 38360 invoked by uid 99); 30 Sep 2016 18:55:45 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Sep 2016 18:55:45 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 4F5EFE17B1; Fri, 30 Sep 2016 18:55:45 +0000 (UTC) From: dyozie To: dev@hawq.incubator.apache.org Reply-To: dev@hawq.incubator.apache.org References: In-Reply-To: Subject: [GitHub] incubator-hawq-docs pull request #17: Updates for hawq register Content-Type: text/plain Message-Id: <20160930185545.4F5EFE17B1@git1-us-west.apache.org> Date: Fri, 30 Sep 2016 18:55:45 +0000 (UTC) archived-at: Fri, 30 Sep 2016 18:55:50 -0000 Github user dyozie commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/17#discussion_r81383986 --- Diff: datamgmt/load/g-register_files.html.md.erb --- @@ -0,0 +1,213 @@ +--- +title: Registering Files into HAWQ Internal Tables +--- + +The `hawq register` utility loads and registers HDFS data files or folders into HAWQ internal tables. Files can be read directly, rather than having to be copied or loaded, resulting in higher performance and more efficient transaction processing. + +Data from the file or directory specified by \ is loaded into the appropriate HAWQ table directory in HDFS and the utility updates the corresponding HAWQ metadata for the files. Either AO for Parquet-formatted in HDFS can be loaded into a corresponding table in HAWQ. + +You can use `hawq register` either to: + +- Load and register external Parquet-formatted file data generated by an external system such as Hive or Spark. +- Recover cluster data from a backup cluster for disaster recovery. + +Requirements for running `hawq register` on the client server are: + +- Network access to and from all hosts in your HAWQ cluster (master and segments) and the hosts where the data to be loaded is located. +- The Hadoop client configured and the hdfs filepath specified. +- The files to be registered and the HAWQ table must be located in the same HDFS cluster. +- The target table DDL is configured with the correct data type mapping. + +##Registering Externally Generated HDFS File Data to an Existing Table + +Files or folders in HDFS can be registered into an existing table, allowing them to be managed as a HAWQ internal table. When registering files, you can optionally specify the maximum amount of data to be loaded, in bytes, using the `--eof` option. If registering a folder, the actual file sizes are used. + +Only HAWQ or Hive-generated Parquet tables are supported. Partitioned tables are not supported. Attempting to register these tables will result in an error. + +Metadata for the Parquet file(s) and the destination table must be consistent. Different data types are used by HAWQ tables and Parquet files, so data must be mapped. You must verify that the structure of the parquet files and the HAWQ table are compatible before running `hawq register`. + +We recommand creating a copy of the Parquet file to be registered before running ```hawq register``` +You can then then run ```hawq register``` on the copy, leaving the original file available for additional Hive queries or if a data mapping error is encountered. + +###Limitations for Registering Hive Tables to HAWQ +The currently-supported data types for generating Hive tables into HAWQ tables are: boolean, int, smallint, tinyint, bigint, float, double, string, binary, char, and varchar. + +The following HIVE data types cannot be converted to HAWQ equivalents: timestamp, decimal, array, struct, map, and union. + +###Example: Registering a Hive-Generated Parquet File + +This example shows how to register a HIVE-generated parquet file in HDFS into the table `parquet_table` in HAWQ, which is in the database named `postgres`. The file path of the HIVE-generated file is `hdfs://localhost:8020/temp/hive.paq`. + +In this example, the location of the database is `hdfs://localhost:8020/hawq_default`, the tablespace id is 16385, the database id is 16387, the table filenode id is 77160, and the last file under the filenode is numbered 7. --- End diff -- For future work, it would be nice to provide commands for determining what these ID values will be before executing the register command. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---