hawq-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yo...@apache.org
Subject [01/14] incubator-hawq-docs git commit: start restructuring HDFS plug-in page
Date Wed, 26 Oct 2016 18:31:01 GMT
Repository: incubator-hawq-docs
Updated Branches:
  refs/heads/develop f335de127 -> 5673447e0


start restructuring HDFS plug-in page


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/9ca27792
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/9ca27792
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/9ca27792

Branch: refs/heads/develop
Commit: 9ca277927bebd9c8d79bdf4619dfaf94a695c838
Parents: a819abd
Author: Lisa Owen <lowen@pivotal.io>
Authored: Fri Oct 14 15:29:22 2016 -0700
Committer: Lisa Owen <lowen@pivotal.io>
Committed: Fri Oct 14 15:29:22 2016 -0700

----------------------------------------------------------------------
 pxf/HDFSFileDataPXF.html.md.erb | 622 +++++++++++++++++++++--------------
 1 file changed, 373 insertions(+), 249 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/9ca27792/pxf/HDFSFileDataPXF.html.md.erb
----------------------------------------------------------------------
diff --git a/pxf/HDFSFileDataPXF.html.md.erb b/pxf/HDFSFileDataPXF.html.md.erb
index 99c27ba..e1c621f 100644
--- a/pxf/HDFSFileDataPXF.html.md.erb
+++ b/pxf/HDFSFileDataPXF.html.md.erb
@@ -2,134 +2,403 @@
 title: Accessing HDFS File Data
 ---
 
-## <a id="installingthepxfhdfsplugin"></a>Prerequisites
+HDFS is the primary distributed storage mechanism used by Apache Hadoop applications. The
PXF HDFS plug-in reads file data stored in HDFS.  The plug-in supports plain delimited and
comma-separated-value text files.  The HDFS plug-in also supports Avro and SequenceFile binary
formats.
 
-Before working with HDFS file data using HAWQ and PXF, you should perform the following operations:
+This section describes how to use PXF to access HDFS data, including how to create and query
an external table from files in the HDFS data store.
 
--   Test PXF on HDFS before connecting to Hive or HBase.
--   Ensure that all HDFS users have read permissions to HDFS services and that write permissions
have been limited to specific users.
+## <a id="hdfsplugin_prereq"></a>Prerequisites
 
-## <a id="syntax1"></a>Syntax
+Before working with HDFS file data using HAWQ and PXF, ensure that:
 
-The syntax for creating an external HDFS file is as follows: 
+-   The HDFS plug-in is installed on all cluster nodes.
+-   All HDFS users have read permissions to HDFS services and that write permissions have
been restricted to specific users.
 
-``` sql
-CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
-    ( column_name data_type [, ...] | LIKE other_table )
-LOCATION ('pxf://host[:port]/path-to-data?<pxf parameters>[&custom-option=value...]')
-      FORMAT '[TEXT | CSV | CUSTOM]' (<formatting_properties>);
+## <a id="hdfsplugin_fileformats"></a>HDFS File Formats
+
+The PXF HDFS plug-in supports the following file formats:
+
+- TextFile - comma-separated value or delimited format plain text file
+- SequenceFile - flat file consisting of binary key/value pairs
+- Avro - JSON-defined, schema-based data serialization format
+
+The PXF HDFS plug-in includes the following profiles to support the file formats listed above:
+
+- `HdfsTextSimple`
+- `HdfsTextMulti`
+- `SequenceWritable`
+- `Avro`
+
+## <a id="hdfsplugin_datatypemap"></a>Data Type Mapping
+jjj
+
+
+## <a id="hdfsplugin_cmdline"></a>HDFS Shell Commands
+HAWQ includes command-line tools that interact directly with HDFS.  These tools support typical
file system operations including copying, listing, changing file permissions, etc. 
+
+The HDFS file system command is `hdfs dfs <options> [<file>]`. Invoked with no
options, `hdfs dfs` lists the file system options supported by the tool.
+
+`hdfs dfs` options used in this section are listed in the table below:
+
+| Option  | Description |
+|-------|-------------------------------------|
+| `-cat`    | Display file contents |
+| `-mkdir`    | Create directory in HDFS |
+| `-put`    | Copy file from local file system to HDFS |
+
+Create an HDFS directory for PXF example data files:
+
+``` shell
+$ sudo -u hdfs hdfs dfs -mkdir -p /data/pxf_examples
 ```
 
-where `<pxf parameters>` is:
+Create a delimited plain text file:
+
+``` shell
+$ vi /tmp/pxf_hdfs_ts.txt
+```
+
+Add the following data to `pxf_hdfs_ts.txt`:
 
 ``` pre
-   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
- | PROFILE=profile-name
+Prague,Jan,101,4875.33
+Rome,Mar,87,1557.39
+Bangalore,May,317,8936.99
+Beijing,Jul,411,11600.67
+```
+
+Notice the use of the comma `,` to separate field values.
+
+Add the data file to HDFS:
+
+``` shell
+$ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_ts.txt /data/pxf_examples/
+```
+
+Display the contents of `pxf_hdfs_ts.txt` stored in HDFS:
+
+``` shell
+$ sudo -u hdfs hdfs dfs -cat /data/pxf_examples/pxf_hdfs_ts.txt
+```
+
+Create a second delimited plain text file:
+
+``` shell
+$ vi /tmp/pxf_hdfs_tm.txt
+```
+
+Add the following data to `pxf_hdfs_tm.txt`:
+
+``` pre
+"4627 Star Rd.
+San Francisco, CA  94107":Sept:2017
+"113 Moon St.
+San Diego, CA  92093":Jan:2018
+"51 Belt Ct.
+Denver, CO  90123":Dec:2016
+"93114 Radial Rd.
+Chicago, IL  60605":Jul:2017
+"7301 Brookview Ave.
+Columbus, OH  43213":Dec:2018
+```
+
+Notice the use of the colon `:` to separate field values. Also notice the quotes around the
first/address field. This field includes an embedded line feed.
+
+Add the data file to HDFS:
+
+``` shell
+$ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_tm.txt /data/pxf_examples/
 ```
 
-**Note:** Omit the `FRAGMENTER` parameter for `READABLE` external tables.
+You will use these HDFS files in later sections.
+
+## <a id="hdfsplugin_queryextdata"></a>Querying External HDFS Data
+The PXF HDFS plug-in supports several HDFS-related profiles. These include `HdfsTextSimple`,
`HdfsTextMulti`, `SequenceWritable`, and `Avro`.
 
-Use an SQL `SELECT` statement to read from an HDFS READABLE table:
+Use the following syntax to create a HAWQ external table representing HDFS data: 
 
 ``` sql
-SELECT ... FROM table_name;
+CREATE EXTERNAL TABLE <table_name> 
+    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
+LOCATION ('pxf://<host>[:<port>]/<path-to-hdfs-file>
+    ?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&<custom-option>=<value>[...]]')
+FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
 ```
 
-Use an SQL `INSERT` statement to add data to an HDFS WRITABLE table:
+HDFS-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html)
call are described in the table below.
+
+**Note**: Some profile-specific options and properties may be discussed in the relevant profile
section later in this topic.
+
+| Keyword  | Value |
+|-------|-------------------------------------|
+| host    | The HDFS NameNode. |
+| \<path-to-hdfs-file\>    | path to the file in the HDFS data store |
+| PROFILE    | The `PROFILE` keyword must specify one of the values `HdfsTextSimple`, `HdfsTextMulti`,
`SequenceWritable` or `Avro`. |
+| \<custom-option\>  | \<custom-option\> is profile-specific. |
+| FORMAT 'TEXT' | Use `TEXT` `FORMAT` with the `HdfsTextSimple` profile when \<path-to-hdfs-file\>
references a plain text delimited file.  |
+| FORMAT 'CSV' | Use `CSV` `FORMAT` with `HdfsTextSimple` and `HdfsTextMulti` profiles when
\<path-to-hdfs-file\> references a comma-separated value file.  |
+| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with `Avro` and `SequenceWritable` profiles.
The `CUSTOM` format supports only the built-in `formatter='pxfwritable_export'` \<formatting-property\>.
 |
+| \<formatting-properties\> | \<formatting-properties\> are profile-specific.
|
+
+Note: When creating PXF external tables, you cannot use the `HEADER` option in your `FORMAT`
specification.
+
+### <a id="profile_hdfstextsimple"></a>HdfsTextSimple Profile
+
+Use the `HdfsTextSimple` profile when reading plain text delimited or csv files where each
row is a single record.
+
+The following SQL call uses the PXF `HdfsTextSimple` profile to create a queryable HAWQ external
table from the `pxf_hdfs_ts.txt` file you created and added to HDFS in an earlier section:
 
 ``` sql
-INSERT INTO table_name ...;
+gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, num_orders
int, total_sales float8)
+            LOCATION ('pxf://namenode:51200/data/pxf_examples/pxf_hdfs_ts.txt?PROFILE=HdfsTextSimple')

+          FORMAT 'TEXT' (delimiter=E',');
+gpadmin=# SELECT * FROM pxf_hdfs_textsimple;          
 ```
 
-To read the data in the files or to write based on the existing format, use `FORMAT`, `PROFILE`,
or one of the classes.
+``` pre
+   location    | month | num_orders | total_sales 
+---------------+-------+------------+-------------
+ Prague        | Jan   |        101 |     4875.33
+ Rome          | Mar   |         87 |     1557.39
+ Bangalore     | May   |        317 |     8936.99
+ Beijing       | Jul   |        411 |    11600.67
+(4 rows)
+```
 
-This topic describes the following:
+Create a second external table from `pxf_hdfs_ts.txt`, this time using the `CSV` `FORMAT`:
 
--   FORMAT clause
--   Profile
--   Accessor
--   Resolver
--   Avro
+``` sql
+gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_csv(location text, month text, num_orders
int, total_sales float8)
+            LOCATION ('pxf://namenode:51200/data/pxf_examples/pxf_hdfs_ts.txt?PROFILE=HdfsTextSimple')

+          FORMAT 'CSV';
+gpadmin=# SELECT * FROM pxf_hdfs_textsimple_csv;          
+```
 
-**Note:** For more details about the API and classes, see [PXF External Tables and API](PXFExternalTableandAPIReference.html#pxfexternaltableandapireference).
+When specifying `FORMAT 'CSV'` for a comma-separated value file, no `delimiter` formatter
option is required, as comma is the default delimiter.
 
-### <a id="formatclause"></a>FORMAT clause
+### <a id="profile_hdfstextmulti"></a>HdfsTextMulti Profile
 
-Use one of the following formats to read data with any PXF connector:
+Use the `HdfsTextMulti` profile when reading plain text files with delimited single- or multi-
line records that include embedded (quoted) linefeed characters.
 
--   `FORMAT 'TEXT'`: Use with plain delimited text files on HDFS.
--   `FORMAT 'CSV'`: Use with comma-separated value files on HDFS.
--   `FORMAT 'CUSTOM'`: Use with all other files, including Avro format and binary formats.
Must always be used with the built-in formatter '`pxfwritable_import`' (for read) or '`pxfwritable_export`'
(for write).
+The following SQL call uses the PXF `HdfsTextMulti` profile to create a queryable HAWQ external
table from the `pxf_hdfs_tm.txt` file you created and added to HDFS in an earlier section:
 
-**Note:** When creating PXF external tables, you cannot use the `HEADER` option in your `FORMAT`
specification.
+``` sql
+gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textmulti(address text, month text, year int)
+            LOCATION ('pxf://sandbox.hortonworks.com:51200/data/pxf_examples/pxf_hdfs_tm.txt?PROFILE=HdfsTextMulti')

+          FORMAT 'CSV' (delimiter=E':');
+gpadmin=# SELECT * FROM pxf_hdfs_textmulti;
+```
 
-### <a id="topic_ab2_sxy_bv"></a>Profile
+``` pre
+         address          | month | year 
+--------------------------+-------+------
+ 4627 Star Rd.            | Sept  | 2017
+ San Francisco, CA  94107           
+ 113 Moon St.             | Jan   | 2018
+ San Diego, CA  92093               
+ 51 Belt Ct.              | Dec   | 2016
+ Denver, CO  90123                  
+ 93114 Radial Rd.         | Jul   | 2017
+ Chicago, IL  60605                 
+ 7301 Brookview Ave.      | Dec   | 2018
+ Columbus, OH  43213                
+(5 rows)
+```
 
-For plain or comma-separated text files in HDFS use either the `HdfsTextSimple` or `HdfsTextMulti`
Profile, or the classname org.apache.hawq.pxf.plugins.hdfs.*HdfsDataFragmenter*. Use the `Avro`
profile for Avro files. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf)
for more information.
+### <a id="profile_hdfsseqwritable"></a>SequenceWritable Profile 
 
-**Note:** For read tables, you must include a Profile or a Fragmenter in the table definition.
+Use the `SequenceWritable` profile when reading SequenceFile format files. Files of this
type consist of binary key/value pairs. Sequence files are a common data transfer format between
MapReduce jobs. 
 
-### <a id="accessor"></a>Accessor
+The `SequenceWritable` profile supports the following \<custom-options\> or \<formatting-properties\>:
 
-The choice of an Accessor depends on the HDFS data file type. 
+| Keyword  | Value Description |
+|-------|-------------------------------------|
+| COMPRESSION_CODEC    | Specifies the compression codec Java class name  |
+| COMPRESSION_TYPE    | The compression type of the sequence file; supported values are `RECORD`
(the default) or `BLOCK`. |
+| \<path-to-hdfs-file\>    | path to the file in the HDFS data store |
 
-**Note:** You must include either a Profile or an Accessor in the table definition.
+MORE HERE
+
+### <a id="profile_hdfsavro"></a>Avro Profile
+
+Avro files store metadata with the data. Avro files also allow specification of an independent
schema used when reading the file. 
+
+#### <a id="profile_hdfsavrodatamap"></a>Data Type Mapping
+
+To represent Avro data in HAWQ, map data values that use a primitive data type to HAWQ columns
of the same type. 
+
+Avro supports complex data types including arrays, maps, records, enumerations, and fixed
types. Map top-level fields of these complex data types to the HAWQ `TEXT` type. While HAWQ
does not natively support these types, you can create HAWQ functions or application code to
extract or further process subcomponents of these complex data types.
+
+The following table summarizes external mapping rules for Avro data.
+
+<caption><span class="tablecap">Table 2. Avro Data Type Mapping</span></caption>
+
+<a id="topic_oy3_qwm_ss__table_j4s_h1n_ss"></a>
+
+| Avro Data Type                                                    | PXF Type          
                                                                                         
                                                                                       |
+|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Primitive type (int, double, float, long, string, bytes, boolean) | Use the corresponding
HAWQ built-in data type; see [Data Types](../reference/HAWQDataTypes.html). |
+| Complex type: Array, Map, Record, or Enum                         | TEXT, with delimiters
inserted between collection items, mapped key-value pairs, and record data.              
                                                                            |
+| Complex type: Fixed                                               | BYTEA             
                                                                                         
                                                                                       |
+| Union                                                             | Follows the above conventions
for primitive or complex data types, depending on the union; supports Null values.       
                                                             |
+
+#### <a id="profile_hdfsavroptipns"></a>Avro-Specific Formatting Options
+
+For complex types, the PXF Avro profile inserts default delimiters between collection items
and values. You can use non-default delimiter characters by identifying values for specific
Avro custom options in the `CREATE EXTERNAL TABLE` call. 
+
+The Avro profile supports the following custom options:
+
+<caption><span class="tablecap">Table N. Avro Formatting Options</span></caption>
+
+| Option Name   | Description       
+|---------------|--------------------|                                                  
                                     
+| COLLECTION_DELIM | The delimiter character(s) to place between entries in a top-level array,
map, or record field when PXF maps a Avro complex data type to a text column. The default
is a comma `,` character. |
+| MAPKEY_DELIM | The delimiter character(s) to place between the key and value of a map entry
when PXF maps an Avro complex data type to a text column. The default is a colon `:` character.
|
+| RECORDKEY_DELIM | The delimiter character(s) to place between the field name and value
of a record entry when PXF maps an Avro complex data type to a text colum. The default is
a colon `:` character. |
+| SCHEMA-DATA | The data schema file used to create and read the HDFS file. For an Avro
file, you may create an avsc. This option has no default value. |
+| THREAD-SAFE | Determines if the table query can run in multi-thread mode or not. Allowed
values are `TRUE`, `FALSE`. Default value is `TRUE` - requests can run in multithread mode.
When set to FALSE, requests will be handled in a single thread. |
+
+#### <a id="topic_tr3_dpg_ts__section_m2p_ztg_ts"></a>Avro Schemas
+
+Avro schemas are defined using JSON, and composed of the same primitive and complex types
identified in the data mapping section above. Avro schema file typically have a `.avsc` suffix.
+
+Fields in an Avro schema file are defined via an array of objects, each of which is specified
by a name and a type. The field type is another schema object
+
+The examples in this section will be operating on Avro data with the following record schema:
+
+- id - long
+- username - string
+- followers - array of string
+- fmap - map of long
+- address - record comprised of street number (int), street name (string), and city (string)
+- relationship - enumerated type
+
+Create an Avro schema file to represent the above schema:
+
+``` shell
+$ vi /tmp/avro_schema.avsc
+```
+
+Copy and paste the following text into `avro_schema.avsc`:
+
+``` json
+{
+"type" : "record",
+  "name" : "example_schema",
+  "namespace" : "com.example",
+  "fields" : [ {
+    "name" : "id",
+    "type" : "long",
+    "doc" : "Id of the user account"
+  }, {
+    "name" : "username",
+    "type" : "string",
+    "doc" : "Name of the user account"
+  }, {
+    "name" : "followers",
+    "type" : {"type": "array", "items": "string"},
+    "doc" : "Users followers"
+  }, {
+    "name": "fmap",
+    "type": {"type": "map", "values": "long"}
+  }, {
+    "name": "relationship",
+    "type": {
+        "type": "enum",
+        "name": "relationshipEnum",
+        "symbols": ["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"]
+    }
+  }, {
+    "name": "address",
+    "type": {
+        "type": "record",
+        "name": "addressRecord",
+        "fields": [
+            {"name":"number", "type":"int"},
+            {"name":"street", "type":"string"},
+            {"name":"city", "type":"string"}]
+    }
+  } ],
+  "doc:" : "A basic schema for storing messages"
+}
+```
+
+An Avro schema, together with its data, is fully self-describing.  
+
+#### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data (JSON)
+
+Create a text file named `pxf_hdfs_avro.txt`:
+
+``` shell
+$ vi /tmp/pxf_hdfs_avro.txt
+```
+
+Enter the following data into `pxf_hdfs_avro.txt`:
+
+``` pre
+{"id":1, "username":"john","followers":["kate", "santosh"], "relationship": "FRIEND", "fmap":
{"kate":10,"santosh":4}, "address":{"number":1, "street":"renaissance drive", "city":"san
jose"}}
+    
+{"id":2, "username":"jim","followers":["john", "pam"], "relationship": "COLLEAGUE", "fmap":
{"john":3,"pam":3}, "address":{"number":9, "street":"deer creek", "city":"palo alto"}}
+```
+
+The sample data uses a comma `,` to separate top level records and a colon `:` to separate
map/key values and record field name/values.
+
+Convert the text file to Avro format. There are various ways to perform the conversion programmatically
and via the command line. In this example, we are using the [Java Avro tools](http://avro.apache.org/releases.html)
and the jar file resides in the current directory:
+
+``` shell
+$ java -jar ./avro-tools-1.8.1.jar fromjson --schema-file /tmp/avro_schema.avsc /tmp/pxf_hdfs_avro.txt
> /tmp/pxf_hdfs_avro.avro
+```
+
+The generated Avro binary data file is written to `/tmp/pxf_hdfs_avro.avro`. Copy this file
to HDFS:
+
+``` shell
+$ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_avro.avro /data/pxf_examples/
+```
+
+Create a queryable external table from this Avro file:
+
+-  Map the top-level primitive fields, `id` (type long) and `username` (type string), to
their equivalent HAWQ types (bigint and text). 
+-  Map the remaining complex fields to type text.
+-  Explicitly set the record, map, and collection delimiters using the Avro profile custom
options:
+
+``` sql
+gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_avro(id bigint, username text, followers text, fmap
text, relationship text, address text)
+            LOCATION ('pxf://sandbox.hortonworks.com:51200/data/pxf_examples/pxf_hdfs_avro.avro?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
+          FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+```
+
+A simple query of the external table shows the components of the complex type data separated
with delimiters:
+
+``` sql
+gpadmin=# SELECT * FROM pxf_hdfs_avro;
+```
+
+``` pre
+ id | username |   followers    |        fmap         | relationship |                  
   address                      
+----+----------+----------------+--------------------+--------------+---------------------------------------------------
+  1 | john     | [kate,santosh] | {kate:10,santosh:4} | FRIEND       | {number:1,street:renaissance
drive,city:san jose}
+  2 | jim      | [john,pam]     | {pam:3,john:3}      | COLLEAGUE    | {number:9,street:deer
creek,city:palo alto}
+(2 rows)
+```
+
+Process the delimited components in the text columns as necessary for your application. For
example, the following command uses the `string_to_array` function to convert entries in the
`followers` field to a text array column in a new view. The view is then queried to filter
rows based on whether a particular follower appears in the array:
+
+``` sql
+gpadmin=# CREATE VIEW followers_view AS 
+  SELECT username, address, string_to_array(substring(followers FROM 2 FOR (char_length(followers)
- 2)), ',')::text[] 
+    AS followers 
+  FROM pxf_hdfs_avro;
+
+gpadmin=# SELECT username, address FROM followers_view WHERE followers @> '{john}'
+```
+
+``` pre
+ username |                   address                   
+----------+---------------------------------------------
+ jim      | {number:9,street:deer creek,city:palo alto}
+```
+
+~~XXXXX
 
-<table>
-<colgroup>
-<col width="25%" />
-<col width="25%" />
-<col width="25%" />
-<col width="25%" />
-</colgroup>
-<thead>
-<tr class="header">
-<th>File Type</th>
-<th>Accessor</th>
-<th>FORMAT clause</th>
-<th>Comments</th>
-</tr>
-</thead>
-<tbody>
-<tr class="odd">
-<td>Plain Text delimited</td>
-<td>org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessor</td>
-<td>FORMAT 'TEXT' (<em>format param list</em>)</td>
-<td> Read + Write
-<p>You cannot use the <code class="ph codeph">HEADER</code> option.</p></td>
-</tr>
-<tr class="even">
-<td>Plain Text CSV </td>
-<td>org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessor</td>
-<td>FORMAT 'CSV' (<em>format param list</em>) </td>
-<td><p>LineBreakAccessor is parallel and faster.</p>
-<p>Use if each logical data row is a physical data line.</p>
-<p>Read + Write </p>
-<p>You cannot use the <code class="ph codeph">HEADER</code> option.</p></td>
-</tr>
-<tr class="odd">
-<td>Plain Text CSV </td>
-<td>org.apache.hawq.pxf.plugins. hdfs.QuotedLineBreakAccessor</td>
-<td>FORMAT 'CSV' (<em>format param list</em>) </td>
-<td><p>QuotedLineBreakAccessor is slower and non-parallel.</p>
-<p>Use if the data includes embedded (quoted) linefeed characters.</p>
-<p>Read Only </p>
-<p>You cannot use the <code class="ph codeph">HEADER</code> option.</p></td>
-</tr>
-<tr class="even">
-<td>SequenceFile</td>
-<td>org.apache.hawq.pxf.plugins. hdfs.SequenceFileAccessor</td>
-<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td>
-<td> Read + Write (use formatter='pxfwritable_export' for write)</td>
-</tr>
-<tr class="odd">
-<td>AvroFile</td>
-<td>org.apache.hawq.pxf.plugins. hdfs.AvroFileAccessor</td>
-<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td>
-<td> Read Only</td>
-</tr>
-</tbody>
-</table>
 
 ### <a id="resolver"></a>Resolver
 
@@ -274,17 +543,19 @@ The class file must follow the following requirements:
 </tbody>
 </table>
 
-## <a id="accessingdataonahighavailabilityhdfscluster"></a>Accessing Data on
a High Availability HDFS Cluster
+## <a id="accessingdataonahighavailabilityhdfscluster"></a>Accessing HDFS Data
in a High Availability HDFS Cluster
 
-To access data on a High Availability HDFS cluster, change the authority in the URI in
the LOCATION. Use *HA\_nameservice* instead of *name\_node\_host:51200*.
+To access data in a High Availability HDFS cluster, change the \<host\> provided
in the URI LOCATION clause. Use *HA\_nameservice* rather than  *name\_node\_host:51200*.
 
 ``` sql
-CREATE [READABLE|WRITABLE] EXTERNAL TABLE <tbl name> (<attr list>)
-LOCATION ('pxf://<HA nameservice>/<path to file or directory>?Profile=profile[&<additional
options>=<value>]')
-FORMAT '[TEXT | CSV | CUSTOM]' (<formatting properties>);
+CREATE EXTERNAL TABLE <table_name> 
+    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
+LOCATION ('pxf://<HA-nameservice>/<path-to-hdfs-file>
+    ?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&<custom-option>=<value>[...]]')
+FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
 ```
 
-The opposite is true when a highly available HDFS cluster is reverted to a single namenode
configuration. In that case, any table definition that has the nameservice specified should
use the &lt;NN host&gt;:&lt;NN rest port&gt; syntax. 
+The opposite is true when a highly available HDFS cluster is reverted to a single namenode
configuration. In that case, any table definition that has the \<HA-nameservice\> specified
should use the \<host\>[:\<port\>] syntax. 
 
 ## <a id="recordkeyinkey-valuefileformats"></a>Using a Record Key with Key-Value
File Formats
 
@@ -356,152 +627,5 @@ CREATE EXTERNAL TABLE babies_1940_2 (name text, birthday text, weight
float)
 SELECT * FROM babies_1940_2; 
 ```
 
-## <a id="topic_oy3_qwm_ss"></a>Working with Avro Files
-
-Avro files combine their data with a schema, and can support complex data types such as arrays,
maps, records, enumerations, and fixed types. When you create a PXF external table to represent
Avro data, map top-level fields in the schema that use a primitive data type to HAWQ columns
of the same type. Map top-level fields that use a complex data type to a TEXT column in the
external table. The PXF Avro profile automatically separates components of a complex type
by inserting delimiters in the text column. You can then use functions or application code
to further process components of the complex data.
-
-The following table summarizes external table mapping rules for Avro data.
-
-<caption><span class="tablecap">Table 2. Avro Data Type Mapping</span></caption>
-
-<a id="topic_oy3_qwm_ss__table_j4s_h1n_ss"></a>
-
-| Avro Data Type                                                    | PXF Type          
                                                                                         
                                                                                       |
-|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Primitive type (int, double, float, long, string, bytes, boolean) | Corresponding HAWQ
data type. See [Data Types](../reference/HAWQDataTypes.html). |
-| Complex type: Array, Map, Record, or Enum                         | TEXT, with default
delimiters inserted between collection items, mapped key-value pairs, and record data.   
                                                                                       |
-| Complex type: Fixed                                               | BYTEA             
                                                                                         
                                                                                       |
-| Union                                                             | Follows the above conventions
for primitive or complex data types, depending on the union. Null values are supported in
Unions.                                                                     |
-
-For complex types, the PXF Avro profile inserts default delimiters between collection items
and values. You can use non-default delimiter characters by including the `COLLECTION_DELIM`,
`MAPKEY_DELIM`, and/or `RECORDKEY_DELIM` optional parameters on the Avro profile. See [Additional
PXF Options](#additionaloptions__table_skq_kpz_4p) for a description of the parameters.
-
-### <a id="topic_tr3_dpg_ts"></a>Example
-
-The following example uses the Avro schema shown in [Sample Avro Schema](#topic_tr3_dpg_ts__section_m2p_ztg_ts)
and the associated data file shown in [Sample Avro Data (JSON)](#topic_tr3_dpg_ts__section_spk_15g_ts).
-
-#### <a id="topic_tr3_dpg_ts__section_m2p_ztg_ts"></a>Sample Avro Schema
-
-``` json
-{
-  "type" : "record",
-  "name" : "example_schema",
-  "namespace" : "com.example",
-  "fields" : [ {
-    "name" : "id",
-    "type" : "long",
-    "doc" : "Id of the user account"
-  }, {
-    "name" : "username",
-    "type" : "string",
-    "doc" : "Name of the user account"
-  }, {
-    "name" : "followers",
-    "type" : {"type": "array", "items": "string"},
-    "doc" : "Users followers"
-  }, {
-    "name": "rank",
-    "type": ["null", "int"],
-    "default": null
-  }, {
-    "name": "fmap",
-    "type": {"type": "map", "values": "long"}
-  }, {
-    "name": "address",
-    "type": {
-        "type": "record",
-        "name": "addressRecord",
-        "fields": [
-            {"name":"number", "type":"int"},
-            {"name":"street", "type":"string"},
-            {"name":"city", "type":"string"}]
-    }
-  }, {
-   "name": "relationship",
-    "type": {
-        "type": "enum",
-        "name": "relationshipEnum",
-        "symbols": ["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"]
-    }
-  }, {
-    "name" : "md5",
-    "type": {
-        "type" : "fixed",
-        "name" : "md5Fixed",
-        "size" : 4
-    }
-  } ],
-  "doc:" : "A basic schema for storing messages"
-}
-```
-
-#### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data (JSON)
-
-``` pre
-{"id":1, "username":"john","followers":["kate", "santosh"], "rank":null, "relationship":
"FRIEND", "fmap": {"kate":10,"santosh":4},
-"address":{"street":"renaissance drive", "number":1,"city":"san jose"}, "md5":\u3F00\u007A\u0073\u0074}
-
-{"id":2, "username":"jim","followers":["john", "pam"], "rank":3, "relationship": "COLLEAGUE",
"fmap": {"john":3,"pam":3}, 
-"address":{"street":"deer creek", "number":9,"city":"palo alto"}, "md5":\u0010\u0021\u0003\u0004}
-```
-
-To map this Avro file to an external table, the top-level primitive fields ("id" of type
long and "username" of type string) are mapped to their equivalent HAWQ types (bigint and
text). The remaining complex fields are mapped to text columns:
-
-``` sql
-gpadmin=# CREATE EXTERNAL TABLE avro_complex 
-  (id bigint, 
-  username text, 
-  followers text, 
-  rank int, 
-  fmap text, 
-  address text, 
-  relationship text,
-  md5 bytea) 
-LOCATION ('pxf://namehost:51200/tmp/avro_complex?PROFILE=Avro')
-FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
-```
-
-The above command uses default delimiters for separating components of the complex types.
This command is equivalent to the one above, but it explicitly sets the delimiters using the
Avro profile parameters:
-
-``` sql
-gpadmin=# CREATE EXTERNAL TABLE avro_complex 
-  (id bigint, 
-  username text, 
-  followers text, 
-  rank int, 
-  fmap text, 
-  address text, 
-  relationship text,
-  md5 bytea) 
-LOCATION ('pxf://localhost:51200/tmp/avro_complex?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
-FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
-```
-
-A simple query of the external table shows the components of the complex type data separated
with delimiters:
-
-``` sql
-gpadmin=# select * from avro_complex;
-```
-
-``` pre
-id | username |  followers  |    rank     |  fmap   |    address  |  relationship  |  md5
-1| john | [kate,santosh] |   | {kate:10,santosh:4} | {number:1,street:renaissance drive,city:san
jose} | FRIEND | ?zst
-2| jim | [john,pam] | 3 | {john:3,pam:3} | {number:9,street:deer creek,city:palo alto} |
COLLEAGUE | \020!\003\004
-```
-
-You can process the delimited components in the text columns as necessary for your application.
For example, the following command uses the `string_to_array` function to convert entries
in the "followers" field to a text array column in a new view. The view is then queried to
filter rows based on whether a particular follower appears in the array:
-
-``` sql
-gpadmin=# create view followers_view as 
-  select username, address, string_to_array(substring(followers from 2 for (char_length(followers)
- 2)), ',')::text[] 
-    as followers 
-  from avro_complex;
-
-gpadmin=# select username, address 
-from followers_view 
-where john = ANY(followers);
-```
-
-``` pre
-username | address
-jim | {number:9,street:deer creek,city:palo alto}
-```
+## <a id="hdfs_advanced"></a>Advanced
+If you find that the pre-defined PXF HDFS profiles do not meet your needs, you may choose
to create a custom HDFS profile from the existing HDFS Accessors and Resolvers. Refer to [XX]()
for information on
\ No newline at end of file


Mime
View raw message