Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 100C4200BAC for ; Wed, 26 Oct 2016 20:31:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0E91C160AEE; Wed, 26 Oct 2016 18:31:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 86133160AE1 for ; Wed, 26 Oct 2016 20:31:09 +0200 (CEST) Received: (qmail 46419 invoked by uid 500); 26 Oct 2016 18:31:08 -0000 Mailing-List: contact commits-help@hawq.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hawq.incubator.apache.org Delivered-To: mailing list commits@hawq.incubator.apache.org Received: (qmail 46407 invoked by uid 99); 26 Oct 2016 18:31:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Oct 2016 18:31:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 4BA76C7338 for ; Wed, 26 Oct 2016 18:31:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -6.219 X-Spam-Level: X-Spam-Status: No, score=-6.219 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id eJ7FuXoVSo7l for ; Wed, 26 Oct 2016 18:31:05 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id B232F5FC87 for ; Wed, 26 Oct 2016 18:31:02 +0000 (UTC) Received: (qmail 45376 invoked by uid 99); 26 Oct 2016 18:31:01 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Oct 2016 18:31:01 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 4F8A1E05E1; Wed, 26 Oct 2016 18:31:01 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: yozie@apache.org To: commits@hawq.incubator.apache.org Date: Wed, 26 Oct 2016 18:31:01 -0000 Message-Id: <87fbf97265564c4698136c23368d4eb4@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [01/14] incubator-hawq-docs git commit: start restructuring HDFS plug-in page archived-at: Wed, 26 Oct 2016 18:31:11 -0000 Repository: incubator-hawq-docs Updated Branches: refs/heads/develop f335de127 -> 5673447e0 start restructuring HDFS plug-in page Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/9ca27792 Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/9ca27792 Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/9ca27792 Branch: refs/heads/develop Commit: 9ca277927bebd9c8d79bdf4619dfaf94a695c838 Parents: a819abd Author: Lisa Owen Authored: Fri Oct 14 15:29:22 2016 -0700 Committer: Lisa Owen Committed: Fri Oct 14 15:29:22 2016 -0700 ---------------------------------------------------------------------- pxf/HDFSFileDataPXF.html.md.erb | 622 +++++++++++++++++++++-------------- 1 file changed, 373 insertions(+), 249 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/9ca27792/pxf/HDFSFileDataPXF.html.md.erb ---------------------------------------------------------------------- diff --git a/pxf/HDFSFileDataPXF.html.md.erb b/pxf/HDFSFileDataPXF.html.md.erb index 99c27ba..e1c621f 100644 --- a/pxf/HDFSFileDataPXF.html.md.erb +++ b/pxf/HDFSFileDataPXF.html.md.erb @@ -2,134 +2,403 @@ title: Accessing HDFS File Data --- -## Prerequisites +HDFS is the primary distributed storage mechanism used by Apache Hadoop applications. The PXF HDFS plug-in reads file data stored in HDFS. The plug-in supports plain delimited and comma-separated-value text files. The HDFS plug-in also supports Avro and SequenceFile binary formats. -Before working with HDFS file data using HAWQ and PXF, you should perform the following operations: +This section describes how to use PXF to access HDFS data, including how to create and query an external table from files in the HDFS data store. -- Test PXF on HDFS before connecting to Hive or HBase. -- Ensure that all HDFS users have read permissions to HDFS services and that write permissions have been limited to specific users. +## Prerequisites -## Syntax +Before working with HDFS file data using HAWQ and PXF, ensure that: -The syntax for creating an external HDFS file is as follows:  +- The HDFS plug-in is installed on all cluster nodes. +- All HDFS users have read permissions to HDFS services and that write permissions have been restricted to specific users. -``` sql -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name - ( column_name data_type [, ...] | LIKE other_table ) -LOCATION ('pxf://host[:port]/path-to-data?[&custom-option=value...]') - FORMAT '[TEXT | CSV | CUSTOM]' (); +## HDFS File Formats + +The PXF HDFS plug-in supports the following file formats: + +- TextFile - comma-separated value or delimited format plain text file +- SequenceFile - flat file consisting of binary key/value pairs +- Avro - JSON-defined, schema-based data serialization format + +The PXF HDFS plug-in includes the following profiles to support the file formats listed above: + +- `HdfsTextSimple` +- `HdfsTextMulti` +- `SequenceWritable` +- `Avro` + +## Data Type Mapping +jjj + + +## HDFS Shell Commands +HAWQ includes command-line tools that interact directly with HDFS. These tools support typical file system operations including copying, listing, changing file permissions, etc. + +The HDFS file system command is `hdfs dfs []`. Invoked with no options, `hdfs dfs` lists the file system options supported by the tool. + +`hdfs dfs` options used in this section are listed in the table below: + +| Option | Description | +|-------|-------------------------------------| +| `-cat` | Display file contents | +| `-mkdir` | Create directory in HDFS | +| `-put` | Copy file from local file system to HDFS | + +Create an HDFS directory for PXF example data files: + +``` shell +$ sudo -u hdfs hdfs dfs -mkdir -p /data/pxf_examples ``` -where `` is: +Create a delimited plain text file: + +``` shell +$ vi /tmp/pxf_hdfs_ts.txt +``` + +Add the following data to `pxf_hdfs_ts.txt`: ``` pre - FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class] - | PROFILE=profile-name +Prague,Jan,101,4875.33 +Rome,Mar,87,1557.39 +Bangalore,May,317,8936.99 +Beijing,Jul,411,11600.67 +``` + +Notice the use of the comma `,` to separate field values. + +Add the data file to HDFS: + +``` shell +$ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_ts.txt /data/pxf_examples/ +``` + +Display the contents of `pxf_hdfs_ts.txt` stored in HDFS: + +``` shell +$ sudo -u hdfs hdfs dfs -cat /data/pxf_examples/pxf_hdfs_ts.txt +``` + +Create a second delimited plain text file: + +``` shell +$ vi /tmp/pxf_hdfs_tm.txt +``` + +Add the following data to `pxf_hdfs_tm.txt`: + +``` pre +"4627 Star Rd. +San Francisco, CA 94107":Sept:2017 +"113 Moon St. +San Diego, CA 92093":Jan:2018 +"51 Belt Ct. +Denver, CO 90123":Dec:2016 +"93114 Radial Rd. +Chicago, IL 60605":Jul:2017 +"7301 Brookview Ave. +Columbus, OH 43213":Dec:2018 +``` + +Notice the use of the colon `:` to separate field values. Also notice the quotes around the first/address field. This field includes an embedded line feed. + +Add the data file to HDFS: + +``` shell +$ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_tm.txt /data/pxf_examples/ ``` -**Note:** Omit the `FRAGMENTER` parameter for `READABLE` external tables. +You will use these HDFS files in later sections. + +## Querying External HDFS Data +The PXF HDFS plug-in supports several HDFS-related profiles. These include `HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable`, and `Avro`. -Use an SQL `SELECT` statement to read from an HDFS READABLE table: +Use the following syntax to create a HAWQ external table representing HDFS data:  ``` sql -SELECT ... FROM table_name; +CREATE EXTERNAL TABLE + ( [, ...] | LIKE ) +LOCATION ('pxf://[:]/ + ?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&=[...]]') +FORMAT '[TEXT|CSV|CUSTOM]' (); ``` -Use an SQL `INSERT` statement to add data to an HDFS WRITABLE table: +HDFS-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described in the table below. + +**Note**: Some profile-specific options and properties may be discussed in the relevant profile section later in this topic. + +| Keyword | Value | +|-------|-------------------------------------| +| host | The HDFS NameNode. | +| \ | path to the file in the HDFS data store | +| PROFILE | The `PROFILE` keyword must specify one of the values `HdfsTextSimple`, `HdfsTextMulti`, `SequenceWritable` or `Avro`. | +| \ | \ is profile-specific. | +| FORMAT 'TEXT' | Use `TEXT` `FORMAT` with the `HdfsTextSimple` profile when \ references a plain text delimited file. | +| FORMAT 'CSV' | Use `CSV` `FORMAT` with `HdfsTextSimple` and `HdfsTextMulti` profiles when \ references a comma-separated value file. | +| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with `Avro` and `SequenceWritable` profiles. The `CUSTOM` format supports only the built-in `formatter='pxfwritable_export'` \. | +| \ | \ are profile-specific. | + +Note: When creating PXF external tables, you cannot use the `HEADER` option in your `FORMAT` specification. + +### HdfsTextSimple Profile + +Use the `HdfsTextSimple` profile when reading plain text delimited or csv files where each row is a single record. + +The following SQL call uses the PXF `HdfsTextSimple` profile to create a queryable HAWQ external table from the `pxf_hdfs_ts.txt` file you created and added to HDFS in an earlier section: ``` sql -INSERT INTO table_name ...; +gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, num_orders int, total_sales float8) + LOCATION ('pxf://namenode:51200/data/pxf_examples/pxf_hdfs_ts.txt?PROFILE=HdfsTextSimple') + FORMAT 'TEXT' (delimiter=E','); +gpadmin=# SELECT * FROM pxf_hdfs_textsimple; ``` -To read the data in the files or to write based on the existing format, use `FORMAT`, `PROFILE`, or one of the classes. +``` pre + location | month | num_orders | total_sales +---------------+-------+------------+------------- + Prague | Jan | 101 | 4875.33 + Rome | Mar | 87 | 1557.39 + Bangalore | May | 317 | 8936.99 + Beijing | Jul | 411 | 11600.67 +(4 rows) +``` -This topic describes the following: +Create a second external table from `pxf_hdfs_ts.txt`, this time using the `CSV` `FORMAT`: -- FORMAT clause -- Profile -- Accessor -- Resolver -- Avro +``` sql +gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_csv(location text, month text, num_orders int, total_sales float8) + LOCATION ('pxf://namenode:51200/data/pxf_examples/pxf_hdfs_ts.txt?PROFILE=HdfsTextSimple') + FORMAT 'CSV'; +gpadmin=# SELECT * FROM pxf_hdfs_textsimple_csv; +``` -**Note:** For more details about the API and classes, see [PXF External Tables and API](PXFExternalTableandAPIReference.html#pxfexternaltableandapireference). +When specifying `FORMAT 'CSV'` for a comma-separated value file, no `delimiter` formatter option is required, as comma is the default delimiter. -### FORMAT clause +### HdfsTextMulti Profile -Use one of the following formats to read data with any PXF connector: +Use the `HdfsTextMulti` profile when reading plain text files with delimited single- or multi- line records that include embedded (quoted) linefeed characters. -- `FORMAT 'TEXT'`: Use with plain delimited text files on HDFS. -- `FORMAT 'CSV'`: Use with comma-separated value files on HDFS. -- `FORMAT 'CUSTOM'`: Use with all other files, including Avro format and binary formats. Must always be used with the built-in formatter '`pxfwritable_import`' (for read) or '`pxfwritable_export`' (for write). +The following SQL call uses the PXF `HdfsTextMulti` profile to create a queryable HAWQ external table from the `pxf_hdfs_tm.txt` file you created and added to HDFS in an earlier section: -**Note:** When creating PXF external tables, you cannot use the `HEADER` option in your `FORMAT` specification. +``` sql +gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textmulti(address text, month text, year int) + LOCATION ('pxf://sandbox.hortonworks.com:51200/data/pxf_examples/pxf_hdfs_tm.txt?PROFILE=HdfsTextMulti') + FORMAT 'CSV' (delimiter=E':'); +gpadmin=# SELECT * FROM pxf_hdfs_textmulti; +``` -### Profile +``` pre + address | month | year +--------------------------+-------+------ + 4627 Star Rd. | Sept | 2017 + San Francisco, CA 94107 + 113 Moon St. | Jan | 2018 + San Diego, CA 92093 + 51 Belt Ct. | Dec | 2016 + Denver, CO 90123 + 93114 Radial Rd. | Jul | 2017 + Chicago, IL 60605 + 7301 Brookview Ave. | Dec | 2018 + Columbus, OH 43213 +(5 rows) +``` -For plain or comma-separated text files in HDFS use either the `HdfsTextSimple` or `HdfsTextMulti` Profile, or the classname org.apache.hawq.pxf.plugins.hdfs.*HdfsDataFragmenter*. Use the `Avro` profile for Avro files. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information. +### SequenceWritable Profile -**Note:** For read tables, you must include a Profile or a Fragmenter in the table definition. +Use the `SequenceWritable` profile when reading SequenceFile format files. Files of this type consist of binary key/value pairs. Sequence files are a common data transfer format between MapReduce jobs. -### Accessor +The `SequenceWritable` profile supports the following \ or \: -The choice of an Accessor depends on the HDFS data file type.  +| Keyword | Value Description | +|-------|-------------------------------------| +| COMPRESSION_CODEC | Specifies the compression codec Java class name | +| COMPRESSION_TYPE | The compression type of the sequence file; supported values are `RECORD` (the default) or `BLOCK`. | +| \ | path to the file in the HDFS data store | -**Note:** You must include either a Profile or an Accessor in the table definition. +MORE HERE + +### Avro Profile + +Avro files store metadata with the data. Avro files also allow specification of an independent schema used when reading the file. + +#### Data Type Mapping + +To represent Avro data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. + +Avro supports complex data types including arrays, maps, records, enumerations, and fixed types. Map top-level fields of these complex data types to the HAWQ `TEXT` type. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract or further process subcomponents of these complex data types. + +The following table summarizes external mapping rules for Avro data. + +Table 2. Avro Data Type Mapping + + + +| Avro Data Type | PXF Type | +|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Primitive type (int, double, float, long, string, bytes, boolean) | Use the corresponding HAWQ built-in data type; see [Data Types](../reference/HAWQDataTypes.html). | +| Complex type: Array, Map, Record, or Enum | TEXT, with delimiters inserted between collection items, mapped key-value pairs, and record data. | +| Complex type: Fixed | BYTEA | +| Union | Follows the above conventions for primitive or complex data types, depending on the union; supports Null values. | + +#### Avro-Specific Formatting Options + +For complex types, the PXF Avro profile inserts default delimiters between collection items and values. You can use non-default delimiter characters by identifying values for specific Avro custom options in the `CREATE EXTERNAL TABLE` call. + +The Avro profile supports the following custom options: + +Table N. Avro Formatting Options + +| Option Name | Description +|---------------|--------------------| +| COLLECTION_DELIM | The delimiter character(s) to place between entries in a top-level array, map, or record field when PXF maps a Avro complex data type to a text column. The default is a comma `,` character. | +| MAPKEY_DELIM | The delimiter character(s) to place between the key and value of a map entry when PXF maps an Avro complex data type to a text column. The default is a colon `:` character. | +| RECORDKEY_DELIM | The delimiter character(s) to place between the field name and value of a record entry when PXF maps an Avro complex data type to a text colum. The default is a colon `:` character. | +| SCHEMA-DATA | The data schema file used to create and read the HDFS file. For an Avro file, you may create an avsc. This option has no default value. | +| THREAD-SAFE | Determines if the table query can run in multi-thread mode or not. Allowed values are `TRUE`, `FALSE`. Default value is `TRUE` - requests can run in multithread mode. When set to FALSE, requests will be handled in a single thread. | + +#### Avro Schemas + +Avro schemas are defined using JSON, and composed of the same primitive and complex types identified in the data mapping section above. Avro schema file typically have a `.avsc` suffix. + +Fields in an Avro schema file are defined via an array of objects, each of which is specified by a name and a type. The field type is another schema object + +The examples in this section will be operating on Avro data with the following record schema: + +- id - long +- username - string +- followers - array of string +- fmap - map of long +- address - record comprised of street number (int), street name (string), and city (string) +- relationship - enumerated type + +Create an Avro schema file to represent the above schema: + +``` shell +$ vi /tmp/avro_schema.avsc +``` + +Copy and paste the following text into `avro_schema.avsc`: + +``` json +{ +"type" : "record", + "name" : "example_schema", + "namespace" : "com.example", + "fields" : [ { + "name" : "id", + "type" : "long", + "doc" : "Id of the user account" + }, { + "name" : "username", + "type" : "string", + "doc" : "Name of the user account" + }, { + "name" : "followers", + "type" : {"type": "array", "items": "string"}, + "doc" : "Users followers" + }, { + "name": "fmap", + "type": {"type": "map", "values": "long"} + }, { + "name": "relationship", + "type": { + "type": "enum", + "name": "relationshipEnum", + "symbols": ["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"] + } + }, { + "name": "address", + "type": { + "type": "record", + "name": "addressRecord", + "fields": [ + {"name":"number", "type":"int"}, + {"name":"street", "type":"string"}, + {"name":"city", "type":"string"}] + } + } ], + "doc:" : "A basic schema for storing messages" +} +``` + +An Avro schema, together with its data, is fully self-describing. + +#### Sample Avro Data (JSON) + +Create a text file named `pxf_hdfs_avro.txt`: + +``` shell +$ vi /tmp/pxf_hdfs_avro.txt +``` + +Enter the following data into `pxf_hdfs_avro.txt`: + +``` pre +{"id":1, "username":"john","followers":["kate", "santosh"], "relationship": "FRIEND", "fmap": {"kate":10,"santosh":4}, "address":{"number":1, "street":"renaissance drive", "city":"san jose"}} + +{"id":2, "username":"jim","followers":["john", "pam"], "relationship": "COLLEAGUE", "fmap": {"john":3,"pam":3}, "address":{"number":9, "street":"deer creek", "city":"palo alto"}} +``` + +The sample data uses a comma `,` to separate top level records and a colon `:` to separate map/key values and record field name/values. + +Convert the text file to Avro format. There are various ways to perform the conversion programmatically and via the command line. In this example, we are using the [Java Avro tools](http://avro.apache.org/releases.html) and the jar file resides in the current directory: + +``` shell +$ java -jar ./avro-tools-1.8.1.jar fromjson --schema-file /tmp/avro_schema.avsc /tmp/pxf_hdfs_avro.txt > /tmp/pxf_hdfs_avro.avro +``` + +The generated Avro binary data file is written to `/tmp/pxf_hdfs_avro.avro`. Copy this file to HDFS: + +``` shell +$ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_avro.avro /data/pxf_examples/ +``` + +Create a queryable external table from this Avro file: + +- Map the top-level primitive fields, `id` (type long) and `username` (type string), to their equivalent HAWQ types (bigint and text). +- Map the remaining complex fields to type text. +- Explicitly set the record, map, and collection delimiters using the Avro profile custom options: + +``` sql +gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_avro(id bigint, username text, followers text, fmap text, relationship text, address text) + LOCATION ('pxf://sandbox.hortonworks.com:51200/data/pxf_examples/pxf_hdfs_avro.avro?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:') + FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); +``` + +A simple query of the external table shows the components of the complex type data separated with delimiters: + +``` sql +gpadmin=# SELECT * FROM pxf_hdfs_avro; +``` + +``` pre + id | username | followers | fmap | relationship | address +----+----------+----------------+--------------------+--------------+--------------------------------------------------- + 1 | john | [kate,santosh] | {kate:10,santosh:4} | FRIEND | {number:1,street:renaissance drive,city:san jose} + 2 | jim | [john,pam] | {pam:3,john:3} | COLLEAGUE | {number:9,street:deer creek,city:palo alto} +(2 rows) +``` + +Process the delimited components in the text columns as necessary for your application. For example, the following command uses the `string_to_array` function to convert entries in the `followers` field to a text array column in a new view. The view is then queried to filter rows based on whether a particular follower appears in the array: + +``` sql +gpadmin=# CREATE VIEW followers_view AS + SELECT username, address, string_to_array(substring(followers FROM 2 FOR (char_length(followers) - 2)), ',')::text[] + AS followers + FROM pxf_hdfs_avro; + +gpadmin=# SELECT username, address FROM followers_view WHERE followers @> '{john}' +``` + +``` pre + username | address +----------+--------------------------------------------- + jim | {number:9,street:deer creek,city:palo alto} +``` + +~~XXXXX - ------ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
File TypeAccessorFORMAT clauseComments
Plain Text delimitedorg.apache.hawq.pxf.plugins. hdfs.LineBreakAccessorFORMAT 'TEXT' (format param list) Read + Write -

You cannot use the HEADER option.

Plain Text CSV org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessorFORMAT 'CSV' (format param list

LineBreakAccessor is parallel and faster.

-

Use if each logical data row is a physical data line.

-

Read + Write 

-

You cannot use the HEADER option.

Plain Text CSV org.apache.hawq.pxf.plugins. hdfs.QuotedLineBreakAccessorFORMAT 'CSV' (format param list

QuotedLineBreakAccessor is slower and non-parallel.

-

Use if the data includes embedded (quoted) linefeed characters.

-

Read Only 

-

You cannot use the HEADER option.

SequenceFileorg.apache.hawq.pxf.plugins. hdfs.SequenceFileAccessorFORMAT 'CUSTOM' (formatter='pxfwritable_import') Read + Write (use formatter='pxfwritable_export' for write)
AvroFileorg.apache.hawq.pxf.plugins. hdfs.AvroFileAccessorFORMAT 'CUSTOM' (formatter='pxfwritable_import') Read Only
### Resolver @@ -274,17 +543,19 @@ The class file must follow the following requirements: -## Accessing Data on a High Availability HDFS Cluster +## Accessing HDFS Data in a High Availability HDFS Cluster -To access data on a High Availability HDFS cluster, change the authority in the URI in the LOCATION. Use *HA\_nameservice* instead of *name\_node\_host:51200*. +To access data in a High Availability HDFS cluster, change the \ provided in the URI LOCATION clause. Use *HA\_nameservice* rather than *name\_node\_host:51200*. ``` sql -CREATE [READABLE|WRITABLE] EXTERNAL TABLE () -LOCATION ('pxf:///?Profile=profile[&=]') -FORMAT '[TEXT | CSV | CUSTOM]' (); +CREATE EXTERNAL TABLE + ( [, ...] | LIKE ) +LOCATION ('pxf:/// + ?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&=[...]]') +FORMAT '[TEXT|CSV|CUSTOM]' (); ``` -The opposite is true when a highly available HDFS cluster is reverted to a single namenode configuration. In that case, any table definition that has the nameservice specified should use the <NN host>:<NN rest port> syntax.  +The opposite is true when a highly available HDFS cluster is reverted to a single namenode configuration. In that case, any table definition that has the \ specified should use the \[:\] syntax.  ## Using a Record Key with Key-Value File Formats @@ -356,152 +627,5 @@ CREATE EXTERNAL TABLE babies_1940_2 (name text, birthday text, weight float) SELECT * FROM babies_1940_2; ``` -## Working with Avro Files - -Avro files combine their data with a schema, and can support complex data types such as arrays, maps, records, enumerations, and fixed types. When you create a PXF external table to represent Avro data, map top-level fields in the schema that use a primitive data type to HAWQ columns of the same type. Map top-level fields that use a complex data type to a TEXT column in the external table. The PXF Avro profile automatically separates components of a complex type by inserting delimiters in the text column. You can then use functions or application code to further process components of the complex data. - -The following table summarizes external table mapping rules for Avro data. - -Table 2. Avro Data Type Mapping - - - -| Avro Data Type | PXF Type | -|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Primitive type (int, double, float, long, string, bytes, boolean) | Corresponding HAWQ data type. See [Data Types](../reference/HAWQDataTypes.html). | -| Complex type: Array, Map, Record, or Enum | TEXT, with default delimiters inserted between collection items, mapped key-value pairs, and record data. | -| Complex type: Fixed | BYTEA | -| Union | Follows the above conventions for primitive or complex data types, depending on the union. Null values are supported in Unions. | - -For complex types, the PXF Avro profile inserts default delimiters between collection items and values. You can use non-default delimiter characters by including the `COLLECTION_DELIM`, `MAPKEY_DELIM`, and/or `RECORDKEY_DELIM` optional parameters on the Avro profile. See [Additional PXF Options](#additionaloptions__table_skq_kpz_4p) for a description of the parameters. - -### Example - -The following example uses the Avro schema shown in [Sample Avro Schema](#topic_tr3_dpg_ts__section_m2p_ztg_ts) and the associated data file shown in [Sample Avro Data (JSON)](#topic_tr3_dpg_ts__section_spk_15g_ts). - -#### Sample Avro Schema - -``` json -{ - "type" : "record", - "name" : "example_schema", - "namespace" : "com.example", - "fields" : [ { - "name" : "id", - "type" : "long", - "doc" : "Id of the user account" - }, { - "name" : "username", - "type" : "string", - "doc" : "Name of the user account" - }, { - "name" : "followers", - "type" : {"type": "array", "items": "string"}, - "doc" : "Users followers" - }, { - "name": "rank", - "type": ["null", "int"], - "default": null - }, { - "name": "fmap", - "type": {"type": "map", "values": "long"} - }, { - "name": "address", - "type": { - "type": "record", - "name": "addressRecord", - "fields": [ - {"name":"number", "type":"int"}, - {"name":"street", "type":"string"}, - {"name":"city", "type":"string"}] - } - }, { - "name": "relationship", - "type": { - "type": "enum", - "name": "relationshipEnum", - "symbols": ["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"] - } - }, { - "name" : "md5", - "type": { - "type" : "fixed", - "name" : "md5Fixed", - "size" : 4 - } - } ], - "doc:" : "A basic schema for storing messages" -} -``` - -#### Sample Avro Data (JSON) - -``` pre -{"id":1, "username":"john","followers":["kate", "santosh"], "rank":null, "relationship": "FRIEND", "fmap": {"kate":10,"santosh":4}, -"address":{"street":"renaissance drive", "number":1,"city":"san jose"}, "md5":\u3F00\u007A\u0073\u0074} - -{"id":2, "username":"jim","followers":["john", "pam"], "rank":3, "relationship": "COLLEAGUE", "fmap": {"john":3,"pam":3}, -"address":{"street":"deer creek", "number":9,"city":"palo alto"}, "md5":\u0010\u0021\u0003\u0004} -``` - -To map this Avro file to an external table, the top-level primitive fields ("id" of type long and "username" of type string) are mapped to their equivalent HAWQ types (bigint and text). The remaining complex fields are mapped to text columns: - -``` sql -gpadmin=# CREATE EXTERNAL TABLE avro_complex - (id bigint, - username text, - followers text, - rank int, - fmap text, - address text, - relationship text, - md5 bytea) -LOCATION ('pxf://namehost:51200/tmp/avro_complex?PROFILE=Avro') -FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -``` - -The above command uses default delimiters for separating components of the complex types. This command is equivalent to the one above, but it explicitly sets the delimiters using the Avro profile parameters: - -``` sql -gpadmin=# CREATE EXTERNAL TABLE avro_complex - (id bigint, - username text, - followers text, - rank int, - fmap text, - address text, - relationship text, - md5 bytea) -LOCATION ('pxf://localhost:51200/tmp/avro_complex?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:') -FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -``` - -A simple query of the external table shows the components of the complex type data separated with delimiters: - -``` sql -gpadmin=# select * from avro_complex; -``` - -``` pre -id | username | followers | rank | fmap | address | relationship | md5 -1| john | [kate,santosh] | | {kate:10,santosh:4} | {number:1,street:renaissance drive,city:san jose} | FRIEND | ?zst -2| jim | [john,pam] | 3 | {john:3,pam:3} | {number:9,street:deer creek,city:palo alto} | COLLEAGUE | \020!\003\004 -``` - -You can process the delimited components in the text columns as necessary for your application. For example, the following command uses the `string_to_array` function to convert entries in the "followers" field to a text array column in a new view. The view is then queried to filter rows based on whether a particular follower appears in the array: - -``` sql -gpadmin=# create view followers_view as - select username, address, string_to_array(substring(followers from 2 for (char_length(followers) - 2)), ',')::text[] - as followers - from avro_complex; - -gpadmin=# select username, address -from followers_view -where john = ANY(followers); -``` - -``` pre -username | address -jim | {number:9,street:deer creek,city:palo alto} -``` +## Advanced +If you find that the pre-defined PXF HDFS profiles do not meet your needs, you may choose to create a custom HDFS profile from the existing HDFS Accessors and Resolvers. Refer to [XX]() for information on \ No newline at end of file