Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 66CC6200BD3 for ; Mon, 31 Oct 2016 23:15:48 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 65B2C160B05; Mon, 31 Oct 2016 22:15:48 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 75BEC160B0C for ; Mon, 31 Oct 2016 23:15:47 +0100 (CET) Received: (qmail 69714 invoked by uid 500); 31 Oct 2016 22:15:46 -0000 Mailing-List: contact commits-help@hawq.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hawq.incubator.apache.org Delivered-To: mailing list commits@hawq.incubator.apache.org Received: (qmail 69613 invoked by uid 99); 31 Oct 2016 22:15:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Oct 2016 22:15:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 32739C0C69 for ; Mon, 31 Oct 2016 22:15:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -6.219 X-Spam-Level: X-Spam-Status: No, score=-6.219 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Piu6YhJyO3zD for ; Mon, 31 Oct 2016 22:15:43 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id B580D5F39B for ; Mon, 31 Oct 2016 22:15:42 +0000 (UTC) Received: (qmail 61307 invoked by uid 99); 31 Oct 2016 22:13:11 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Oct 2016 22:13:11 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 6F133EC22D; Mon, 31 Oct 2016 22:13:11 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: yozie@apache.org To: commits@hawq.incubator.apache.org Date: Mon, 31 Oct 2016 22:13:14 -0000 Message-Id: <757dfc58178f40bf90cb9ddc198d987b@git.apache.org> In-Reply-To: <46253954dc4b473fb6d92e410a2918e1@git.apache.org> References: <46253954dc4b473fb6d92e410a2918e1@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [04/50] incubator-hawq-docs git commit: more rework of hdfs plug in page archived-at: Mon, 31 Oct 2016 22:15:48 -0000 more rework of hdfs plug in page Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/5a941a70 Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/5a941a70 Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/5a941a70 Branch: refs/heads/tutorial-proto Commit: 5a941a70bda0e8466b5aa5dd2885840fce14c522 Parents: 2da7a92 Author: Lisa Owen Authored: Tue Oct 18 09:57:09 2016 -0700 Committer: Lisa Owen Committed: Tue Oct 18 09:57:09 2016 -0700 ---------------------------------------------------------------------- pxf/HDFSFileDataPXF.html.md.erb | 63 +++++++++++++++++++----------------- 1 file changed, 33 insertions(+), 30 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/5a941a70/pxf/HDFSFileDataPXF.html.md.erb ---------------------------------------------------------------------- diff --git a/pxf/HDFSFileDataPXF.html.md.erb b/pxf/HDFSFileDataPXF.html.md.erb index e49688e..2f87037 100644 --- a/pxf/HDFSFileDataPXF.html.md.erb +++ b/pxf/HDFSFileDataPXF.html.md.erb @@ -25,11 +25,8 @@ The PXF HDFS plug-in includes the following profiles to support the file formats - `HdfsTextSimple` - text files - `HdfsTextMulti` - text files with embedded line feeds -- `SequenceWritable` - SequenceFile - `Avro` - Avro files - -## Data Type Mapping -jjj +- `SequenceWritable` - SequenceFile (write only?) ## HDFS Shell Commands @@ -112,7 +109,7 @@ $ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_tm.txt /data/pxf_examples/ You will use these HDFS files in later sections. ## Querying External HDFS Data -The PXF HDFS plug-in supports several profiles. These include `HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable`, and `Avro`. +The PXF HDFS plug-in supports several profiles. These include `HdfsTextSimple`, `HdfsTextMulti`, `Avro`, and `SequenceWritable`. Use the following syntax to create a HAWQ external table representing HDFS data:  @@ -134,7 +131,8 @@ HDFS-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](.. | \ | \ is profile-specific. Profile-specific options are discussed in the relevant profile topic later in this section.| | FORMAT 'TEXT' | Use '`TEXT`' `FORMAT` with the `HdfsTextSimple` profile when \ references a plain text delimited file. | | FORMAT 'CSV' | Use '`CSV`' `FORMAT` with `HdfsTextSimple` and `HdfsTextMulti` profiles when \ references a comma-separated value file. | -| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with `Avro` and `SequenceWritable` profiles. The '`CUSTOM`' `FORMAT` supports only the built-in `(formatter='pxfwritable_export')` \ | +| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with the `Avro` profiles. The `Avro` '`CUSTOM`' `FORMAT` supports only the built-in `(formatter='pxfwritable_import')` \ | +| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with the `SequenceWritable` profile. The `SequenceWritable` '`CUSTOM`' `FORMAT` supports only the built-in `(formatter='pxfwritable_export')` \ | \ | \ are profile-specific. Profile-specific formatting options are discussed in the relevant profile topic later in this section. | *Note*: When creating PXF external tables, you cannot use the `HEADER` option in your `FORMAT` specification. @@ -215,30 +213,17 @@ gpadmin=# SELECT * FROM pxf_hdfs_textmulti; (5 rows) ``` -## SequenceWritable Profile - -Use the `SequenceWritable` profile when reading SequenceFile format files. Files of this type consist of binary key/value pairs. Sequence files are a common data transfer format between MapReduce jobs. - -The `SequenceWritable` profile supports the following \: - -| Keyword | Value Description | -|-------|-------------------------------------| -| COMPRESSION_CODEC | The compression codec Java class name.| -| COMPRESSION_TYPE | The compression type of the sequence file; supported values are `RECORD` (the default) or `BLOCK`. | -| DATA-SCHEMA | The name of the writer serialization class. The jar file in which this class resides must be in the PXF class path. This option has no default value. | -| THREAD-SAFE | Boolean value determining if a table query can run in multi-thread mode. Default value is `TRUE` - requests can run in multi-thread mode. When set to `FALSE`, requests will be handled in a single thread. | - -???? MORE HERE - -??? ADDRESS SERIALIZATION - ## Avro Profile -Avro files store metadata with the data. Avro files also allow specification of an independent schema used when reading the file. +Apache Avro is a data serialization framework where the data is serialized in a compact binary format. + +Avro specifies data types be defined in JSON. Avro format files have an independent schema, also defined in JSON. In Avro files, the schema is stored with the data. ### Data Type Mapping -To represent Avro data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. +Avro supports both primitive and complex data types. + +To represent Avro primitive data types in HAWQ, map data values to HAWQ columns of the same type. Avro supports complex data types including arrays, maps, records, enumerations, and fixed types. Map top-level fields of these complex data types to the HAWQ `TEXT` type. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract or further process subcomponents of these complex data types. @@ -246,7 +231,7 @@ The following table summarizes external mapping rules for Avro data. -| Avro Data Type | PXF Type | +| Avro Data Type | PXF/HAWQ Data Type | |-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Primitive type (int, double, float, long, string, bytes, boolean) | Use the corresponding HAWQ built-in data type; see [Data Types](../reference/HAWQDataTypes.html). | | Complex type: Array, Map, Record, or Enum | TEXT, with delimiters inserted between collection items, mapped key-value pairs, and record data. | @@ -255,13 +240,13 @@ The following table summarizes external mapping rules for Avro data. ### Avro-Specific Custom Options -For complex types, the PXF Avro profile inserts default delimiters between collection items and values. You can use non-default delimiter characters by identifying values for specific Avro custom options in the `CREATE EXTERNAL TABLE` call. +For complex types, the PXF `Avro` profile inserts default delimiters between collection items and values. You can use non-default delimiter characters by identifying values for specific `Avro` custom options in the `CREATE EXTERNAL TABLE` call. The Avro profile supports the following \: | Option Name | Description |---------------|--------------------| -| COLLECTION_DELIM | The delimiter character(s) to place between entries in a top-level array, map, or record field when PXF maps a Avro complex data type to a text column. The default is a comma `,` character. | +| COLLECTION_DELIM | The delimiter character(s) to place between entries in a top-level array, map, or record field when PXF maps an Avro complex data type to a text column. The default is a comma `,` character. | | MAPKEY_DELIM | The delimiter character(s) to place between the key and value of a map entry when PXF maps an Avro complex data type to a text column. The default is a colon `:` character. | | RECORDKEY_DELIM | The delimiter character(s) to place between the field name and value of a record entry when PXF maps an Avro complex data type to a text column. The default is a colon `:` character. | | SCHEMA-DATA | The data schema file used to create and read the HDFS file. This option has no default value. | @@ -363,6 +348,7 @@ The generated Avro binary data file is written to `/tmp/pxf_hdfs_avro.avro`. Cop ``` shell $ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_avro.avro /data/pxf_examples/ ``` +### Querying Avro Data Create a queryable external table from this Avro file: @@ -407,6 +393,23 @@ gpadmin=# SELECT username, address FROM followers_view WHERE followers @> '{john jim | {number:9,street:deer creek,city:palo alto} ``` +## SequenceWritable Profile + +Use the `SequenceWritable` profile when writing SequenceFile format files. Files of this type consist of binary key/value pairs. Sequence files are a common data transfer format between MapReduce jobs. + +The `SequenceWritable` profile supports the following \: + +| Keyword | Value Description | +|-------|-------------------------------------| +| COMPRESSION_CODEC | The compression codec Java class name. If this option is not provided, no data compression is performed. | +| COMPRESSION_TYPE | The compression type of the sequence file; supported values are `RECORD` (the default) or `BLOCK`. | +| DATA-SCHEMA | The name of the writer serialization class. The jar file in which this class resides must be in the PXF class path. This option has no default value. | +| THREAD-SAFE | Boolean value determining if a table query can run in multi-thread mode. Default value is `TRUE` - requests can run in multi-thread mode. When set to `FALSE`, requests will be handled in a single thread. | + +???? MORE HERE + +??? ADDRESS SERIALIZATION + ## Reading the Record Key @@ -414,7 +417,7 @@ Sequence file and other file formats that store rows in a key-value format can a The field type of `recordkey` must correspond to the key type, much as the other fields must match the HDFS data.  -`recordkey` can be of the following Hadoop types: +`recordkey` can be any of the following Hadoop types: - BooleanWritable - ByteWritable @@ -449,4 +452,4 @@ The opposite is true when a highly available HDFS cluster is reverted to a singl ## Advanced -If you find that the pre-defined PXF HDFS profiles do not meet your needs, you may choose to create a custom HDFS profile from the existing HDFS Accessors and Resolvers. Refer to [Adding and Updating Profiles](ReadWritePXF.html#addingandupdatingprofiles) for information on creating a custom profile. \ No newline at end of file +If you find that the pre-defined PXF HDFS profiles do not meet your needs, you may choose to create a custom HDFS profile from the existing HDFS serialization and deserialization classes. Refer to [Adding and Updating Profiles](ReadWritePXF.html#addingandupdatingprofiles) for information on creating a custom profile. \ No newline at end of file