hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dyozie <...@git.apache.org>
Subject [GitHub] incubator-hawq-docs pull request #33: HAWQ-1107 - enhance PXF HDFS plugin do...
Date Tue, 25 Oct 2016 21:20:24 GMT
Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/33#discussion_r85003214
  
    --- Diff: pxf/HDFSFileDataPXF.html.md.erb ---
    @@ -2,388 +2,282 @@
     title: Accessing HDFS File Data
     ---
     
    -## <a id="installingthepxfhdfsplugin"></a>Prerequisites
    +HDFS is the primary distributed storage mechanism used by Apache Hadoop applications.
The PXF HDFS plug-in reads file data stored in HDFS.  The plug-in supports plain delimited
and comma-separated-value format text files.  The HDFS plug-in also supports the Avro binary
format.
     
    -Before working with HDFS file data using HAWQ and PXF, you should perform the following
operations:
    +This section describes how to use PXF to access HDFS data, including how to create and
query an external table from files in the HDFS data store.
     
    --   Test PXF on HDFS before connecting to Hive or HBase.
    --   Ensure that all HDFS users have read permissions to HDFS services and that write
permissions have been limited to specific users.
    +## <a id="hdfsplugin_prereq"></a>Prerequisites
     
    -## <a id="syntax1"></a>Syntax
    +Before working with HDFS file data using HAWQ and PXF, ensure that:
     
    -The syntax for creating an external HDFS file is as follows: 
    +-   The HDFS plug-in is installed on all cluster nodes.
    +-   All HDFS users have read permissions to HDFS services and that write permissions
have been restricted to specific users.
     
    -``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://host[:port]/path-to-data?<pxf parameters>[&custom-option=value...]')
    -      FORMAT '[TEXT | CSV | CUSTOM]' (<formatting_properties>);
    -```
    +## <a id="hdfsplugin_fileformats"></a>HDFS File Formats
     
    -where `<pxf parameters>` is:
    +The PXF HDFS plug-in supports reading the following file formats:
     
    -``` pre
    -   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +- Text File - comma-separated value (.csv) or delimited format plain text file
    +- Avro - JSON-defined, schema-based data serialization format
     
    -**Note:** Omit the `FRAGMENTER` parameter for `READABLE` external tables.
    +The PXF HDFS plug-in includes the following profiles to support the file formats listed
above:
     
    -Use an SQL `SELECT` statement to read from an HDFS READABLE table:
    +- `HdfsTextSimple` - text files
    +- `HdfsTextMulti` - text files with embedded line feeds
    +- `Avro` - Avro files
     
    -``` sql
    -SELECT ... FROM table_name;
    -```
     
    -Use an SQL `INSERT` statement to add data to an HDFS WRITABLE table:
    +## <a id="hdfsplugin_cmdline"></a>HDFS Shell Commands
    +Hadoop includes command-line tools that interact directly with HDFS.  These tools support
typical file system operations including copying and listing files, changing file permissions,
etc. 
     
    -``` sql
    -INSERT INTO table_name ...;
    -```
    +The HDFS file system command is `hdfs dfs <options> [<file>]`. Invoked with
no options, `hdfs dfs` lists the file system options supported by the tool.
    +
    +`hdfs dfs` options used in this section are identified in the table below:
    +
    +| Option  | Description |
    +|-------|-------------------------------------|
    +| `-cat`    | Display file contents. |
    +| `-mkdir`    | Create directory in HDFS. |
    +| `-put`    | Copy file from local file system to HDFS. |
    +
    +### <a id="hdfsplugin_cmdline_create"></a>Create Data Files
    +
    +Perform the following steps to create data files used in subsequent exercises:
    +
    +1. Create an HDFS directory for PXF example data files:
    +
    +    ``` shell
    +     $ sudo -u hdfs hdfs dfs -mkdir -p /data/pxf_examples
    +    ```
    +
    +2. Create a delimited plain text file:
    +
    +    ``` shell
    +    $ vi /tmp/pxf_hdfs_simple.txt
    +    ```
    +
    +3. Copy and paste the following data into `pxf_hdfs_simple.txt`:
    +
    +    ``` pre
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    ```
    +
    +    Notice the use of the comma `,` to separate the four data fields.
    +
    +4. Add the data file to HDFS:
    +
    +    ``` shell
    +    $ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_simple.txt /data/pxf_examples/
    +    ```
    +
    +5. Display the contents of the `pxf_hdfs_simple.txt` file stored in HDFS:
    +
    +    ``` shell
    +    $ sudo -u hdfs hdfs dfs -cat /data/pxf_examples/pxf_hdfs_simple.txt
    +    ```
    +
    +6. Create a second delimited plain text file:
    +
    +    ``` shell
    +    $ vi /tmp/pxf_hdfs_multi.txt
    +    ```
     
    -To read the data in the files or to write based on the existing format, use `FORMAT`,
`PROFILE`, or one of the classes.
    -
    -This topic describes the following:
    -
    --   FORMAT clause
    --   Profile
    --   Accessor
    --   Resolver
    --   Avro
    -
    -**Note:** For more details about the API and classes, see [PXF External Tables and API](PXFExternalTableandAPIReference.html#pxfexternaltableandapireference).
    -
    -### <a id="formatclause"></a>FORMAT clause
    -
    -Use one of the following formats to read data with any PXF connector:
    -
    --   `FORMAT 'TEXT'`: Use with plain delimited text files on HDFS.
    --   `FORMAT 'CSV'`: Use with comma-separated value files on HDFS.
    --   `FORMAT 'CUSTOM'`: Use with all other files, including Avro format and binary formats.
Must always be used with the built-in formatter '`pxfwritable_import`' (for read) or '`pxfwritable_export`'
(for write).
    -
    -**Note:** When creating PXF external tables, you cannot use the `HEADER` option in your
`FORMAT` specification.
    -
    -### <a id="topic_ab2_sxy_bv"></a>Profile
    -
    -For plain or comma-separated text files in HDFS use either the `HdfsTextSimple` or `HdfsTextMulti`
Profile, or the classname org.apache.hawq.pxf.plugins.hdfs.*HdfsDataFragmenter*. Use the `Avro`
profile for Avro files. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf)
for more information.
    -
    -**Note:** For read tables, you must include a Profile or a Fragmenter in the table definition.
    -
    -### <a id="accessor"></a>Accessor
    -
    -The choice of an Accessor depends on the HDFS data file type. 
    -
    -**Note:** You must include either a Profile or an Accessor in the table definition.
    -
    -<table>
    -<colgroup>
    -<col width="25%" />
    -<col width="25%" />
    -<col width="25%" />
    -<col width="25%" />
    -</colgroup>
    -<thead>
    -<tr class="header">
    -<th>File Type</th>
    -<th>Accessor</th>
    -<th>FORMAT clause</th>
    -<th>Comments</th>
    -</tr>
    -</thead>
    -<tbody>
    -<tr class="odd">
    -<td>Plain Text delimited</td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessor</td>
    -<td>FORMAT 'TEXT' (<em>format param list</em>)</td>
    -<td> Read + Write
    -<p>You cannot use the <code class="ph codeph">HEADER</code> option.</p></td>
    -</tr>
    -<tr class="even">
    -<td>Plain Text CSV </td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessor</td>
    -<td>FORMAT 'CSV' (<em>format param list</em>) </td>
    -<td><p>LineBreakAccessor is parallel and faster.</p>
    -<p>Use if each logical data row is a physical data line.</p>
    -<p>Read + Write </p>
    -<p>You cannot use the <code class="ph codeph">HEADER</code> option.</p></td>
    -</tr>
    -<tr class="odd">
    -<td>Plain Text CSV </td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.QuotedLineBreakAccessor</td>
    -<td>FORMAT 'CSV' (<em>format param list</em>) </td>
    -<td><p>QuotedLineBreakAccessor is slower and non-parallel.</p>
    -<p>Use if the data includes embedded (quoted) linefeed characters.</p>
    -<p>Read Only </p>
    -<p>You cannot use the <code class="ph codeph">HEADER</code> option.</p></td>
    -</tr>
    -<tr class="even">
    -<td>SequenceFile</td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.SequenceFileAccessor</td>
    -<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td>
    -<td> Read + Write (use formatter='pxfwritable_export' for write)</td>
    -</tr>
    -<tr class="odd">
    -<td>AvroFile</td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.AvroFileAccessor</td>
    -<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td>
    -<td> Read Only</td>
    -</tr>
    -</tbody>
    -</table>
    -
    -### <a id="resolver"></a>Resolver
    -
    -Choose the Resolver format if data records are serialized in the HDFS file. 
    -
    -**Note:** You must include a Profile or a Resolver in the table definition.
    -
    -<table>
    -<colgroup>
    -<col width="33%" />
    -<col width="33%" />
    -<col width="33%" />
    -</colgroup>
    -<thead>
    -<tr class="header">
    -<th>Record Serialization</th>
    -<th>Resolver</th>
    -<th>Comments</th>
    -</tr>
    -</thead>
    -<tbody>
    -<tr class="odd">
    -<td>Avro</td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.AvroResolver</td>
    -<td><ul>
    -<li>Avro files include the record schema, Avro serialization can be used in other
file types (e.g, Sequence File). </li>
    -<li>For Avro serialized records outside of an Avro file, include a schema file
name (.avsc) in the url under the optional <code class="ph codeph">Schema-Data </code>option.</li>
    -<li>Deserialize Only (Read) .</li>
    -</ul></td>
    -</tr>
    -<tr class="even">
    -<td>Java Writable</td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.WritableResolver</td>
    -<td><ul>
    -<li>Include the name of the Java class that uses Writable serialization in the
URL under the optional <code class="ph codeph">Schema-Data.</code></li>
    -<li>The class file must exist in the public stage directory (or in Hadoop's class
path).</li>
    -<li>Deserialize and Serialize (Read + Write). </li>
    -<li>See <a href="#customizedwritableschemafileguidelines">Customized Writable
Schema File Guidelines</a>.</li>
    -</ul></td>
    -</tr>
    -<tr class="odd">
    -<td>None (plain text)</td>
    -<td>org.apache.hawq.pxf.plugins. hdfs.StringPassResolver</td>
    -<td><ul>
    -<li>Does not serialize plain text records. The database parses plain records.
Passes records as they are.</li>
    -<li>Deserialize and Serialize (Read + Write).</li>
    -</ul></td>
    -</tr>
    -</tbody>
    -</table>
    -
    -#### <a id="customizedwritableschemafileguidelines"></a>Schema File Guidelines
for WritableResolver
    -
    -When using a WritableResolver, a schema file needs to be defined. The file needs to be
a Java class file and must be on the class path of PXF.
    -
    -The class file must follow the following requirements:
    -
    -1.  Must implement org.apache.hadoop.io.Writable interface.
    -2.  WritableResolver uses reflection to recreate the schema and populate its fields (for
both read and write). Then it uses the Writable interface functions to read/write. Therefore,
fields must be public, to enable access to them. Private fields will be ignored.
    -3.  Fields are accessed and populated in the order in which they are declared in the
class file.
    -4.  Supported field types:
    -    -   boolean
    -    -   byte array
    -    -   double
    -    -   float
    -    -   int
    -    -   long
    -    -   short
    -    -   string
    -
    -    Arrays of any of the above types are supported, but the constructor must define the
array size so the reflection will work.
    -
    -### <a id="additionaloptions"></a>Additional Options
    -
    -<a id="additionaloptions__table_skq_kpz_4p"></a>
    -
    -<table>
    -<caption><span class="tablecap">Table 1. Additional PXF Options</span></caption>
    -<colgroup>
    -<col width="50%" />
    -<col width="50%" />
    -</colgroup>
    -<thead>
    -<tr class="header">
    -<th>Option Name</th>
    -<th>Description</th>
    -</tr>
    -</thead>
    -<tbody>
    -<tr class="odd">
    -<td>COLLECTION_DELIM</td>
    -<td>(Avro or Hive profiles only.) The delimiter character(s) to place between entries
in a top-level array, map, or record field when PXF maps a Hive or Avro complex data type
to a text column. The default is a &quot;,&quot; character.</td>
    -</tr>
    -<tr class="even">
    -<td>COMPRESSION_CODEC</td>
    -<td><ul>
    -<li>Useful for WRITABLE PXF tables.</li>
    -<li>Specifies the compression codec class name for compressing the written data.
The class must implement the org.apache.hadoop.io.compress.CompressionCodec interface.</li>
    -<li> Some valid values are org.apache.hadoop.io.compress.DefaultCodec org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.BZip2Codec.</li>
    -<li>Note: org.apache.hadoop.io.compress.BZip2Codec runs in a single thread and
can be slow.</li>
    -<li>This option has no default value. </li>
    -<li>When the option is not defined, no compression will be done.</li>
    -</ul></td>
    -</tr>
    -<tr class="odd">
    -<td>COMPRESSION_TYPE</td>
    -<td><ul>
    -<li>Useful WRITABLE PXF tables with SequenceFileAccessor.</li>
    -<li>Ignored when COMPRESSION_CODEC is not defined.</li>
    -<li>Specifies the compression type for sequence file.</li>
    -<li>Valid options are: 
    -<ul>
    -<li>RECORD - only the value part of each row is compressed.</li>
    -<li>BLOCK - both keys and values are collected in 'blocks' separately and compressed.</li>
    -</ul></li>
    -<li>Default value: RECORD.</li>
    -</ul></td>
    -</tr>
    -<tr class="even">
    -<td>MAPKEY_DELIM</td>
    -<td>(Avro or Hive profiles only.) The delimiter character(s) to place between the
key and value of a map entry when PXF maps a Hive or Avro complex data type to a text colum.
The default is a &quot;:&quot; character.</td>
    -</tr>
    -<tr class="odd">
    -<td>RECORDKEY_DELIM</td>
    -<td>(Avro profile only.) The delimiter character(s) to place between the field
name and value of a record entry when PXF maps an Avro complex data type to a text colum.
The default is a &quot;:&quot; character.</td>
    -</tr>
    -<tr class="even">
    -<td>SCHEMA-DATA</td>
    -<td>The data schema file used to create and read the HDFS file. For example, you
could create an avsc (for Avro), or a Java class (for Writable Serialization) file. Make sure
that you have added any JAR files containing the schema to <code class="ph codeph">pxf-public.classpath</code>.
    -<p>This option has no default value.</p></td>
    -</tr>
    -<tr class="odd">
    -<td>THREAD-SAFE</td>
    -<td>Determines if the table query can run in multithread mode or not. When set
to FALSE, requests will be handled in a single thread.
    -<p>Should be set when a plug-in or other elements that are not thread safe are
used (e.g. compression codec).</p>
    -<p>Allowed values: TRUE, FALSE. Default value is TRUE - requests can run in multithread
mode.</p></td>
    -</tr>
    -<tr class="even">
    -<td> &lt;custom&gt;</td>
    -<td>Any option added to the pxf URI string will be accepted and passed, along with
its value, to the Fragmenter, Accessor, and Resolver implementations.</td>
    -</tr>
    -</tbody>
    -</table>
    -
    -## <a id="accessingdataonahighavailabilityhdfscluster"></a>Accessing Data
on a High Availability HDFS Cluster
    -
    -To access data on a High Availability HDFS cluster, change the authority in the URI
in the LOCATION. Use *HA\_nameservice* instead of *name\_node\_host:51200*.
    +7. Copy/paste the following data into `pxf_hdfs_multi.txt`:
    +
    +    ``` pre
    +    "4627 Star Rd.
    +    San Francisco, CA  94107":Sept:2017
    +    "113 Moon St.
    +    San Diego, CA  92093":Jan:2018
    +    "51 Belt Ct.
    +    Denver, CO  90123":Dec:2016
    +    "93114 Radial Rd.
    +    Chicago, IL  60605":Jul:2017
    +    "7301 Brookview Ave.
    +    Columbus, OH  43213":Dec:2018
    +    ```
    +
    +    Notice the use of the colon `:` to separate the three fields. Also notice the quotes
around the first (address) field. This field includes an embedded line feed.
    +
    +8. Add the data file to HDFS:
    +
    +    ``` shell
    +    $ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_multi.txt /data/pxf_examples/
    +    ```
    +
    +You will use these HDFS files in later sections.
    +
    +## <a id="hdfsplugin_queryextdata"></a>Querying External HDFS Data
    +The PXF HDFS plug-in supports the `HdfsTextSimple`, `HdfsTextMulti`, and `Avro` profiles.
    +
    +Use the following syntax to create a HAWQ external table representing HDFS data: 
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE <tbl name> (<attr list>)
    -LOCATION ('pxf://<HA nameservice>/<path to file or directory>?Profile=profile[&<additional
options>=<value>]')
    -FORMAT '[TEXT | CSV | CUSTOM]' (<formatting properties>);
    +CREATE EXTERNAL TABLE <table_name> 
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<path-to-hdfs-file>
    +    ?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro[&<custom-option>=<value>[...]]')
    +FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
     ```
     
    -The opposite is true when a highly available HDFS cluster is reverted to a single namenode
configuration. In that case, any table definition that has the nameservice specified should
use the &lt;NN host&gt;:&lt;NN rest port&gt; syntax. 
    +HDFS-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html)
call are described in the table below.
    +
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:\<port\>]    | The HDFS NameNode and port. |
    +| \<path-to-hdfs-file\>    | The path to the file in the HDFS data store. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values `HdfsTextSimple`,
`HdfsTextMulti`, or `Avro`. |
    +| \<custom-option\>  | \<custom-option\> is profile-specific. Profile-specific
options are discussed in the relevant profile topic later in this section.|
    +| FORMAT 'TEXT' | Use '`TEXT`' `FORMAT` with the `HdfsTextSimple` profile when \<path-to-hdfs-file\>
references a plain text delimited file.  |
    +| FORMAT 'CSV' | Use '`CSV`' `FORMAT` with `HdfsTextSimple` and `HdfsTextMulti` profiles
when \<path-to-hdfs-file\> references a comma-separated value file.  |
    +| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with  the `Avro` profile. The `Avro` '`CUSTOM`'
`FORMAT` supports only the built-in `(formatter='pxfwritable_import')` \<formatting-property\>
|
    + \<formatting-properties\>    | \<formatting-properties\> are profile-specific.
Profile-specific formatting options are discussed in the relevant profile topic later in this
section. |
    +
    +*Note*: When creating PXF external tables, you cannot use the `HEADER` option in your
`FORMAT` specification.
     
    -## <a id="recordkeyinkey-valuefileformats"></a>Using a Record Key with Key-Value
File Formats
    +## <a id="profile_hdfstextsimple"></a>HdfsTextSimple Profile
     
    -For sequence file and other file formats that store rows in a key-value format, the key
value can be accessed through HAWQ by using the saved keyword '`recordkey`' as a field name.
    +Use the `HdfsTextSimple` profile when reading plain text delimited or .csv files where
each row is a single record.
     
    -The field type must correspond to the key type, much as the other fields must match the
HDFS data. 
    +\<formatting-properties\> supported by the `HdfsTextSimple` profile include:
     
    -WritableResolver supports read and write of recordkey, which can be of the following
Writable Hadoop types:
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| delimiter    | The delimiter character in the file. Default value is a comma `,`.|
     
    --   BooleanWritable
    --   ByteWritable
    --   DoubleWritable
    --   FloatWritable
    --   IntWritable
    --   LongWritable
    --   Text
     
    -If the `recordkey` field is not defined, the key is ignored in read, and a default value
(segment id as LongWritable) is written in write.
    +The following SQL call uses the PXF `HdfsTextSimple` profile to create a queryable HAWQ
external table from the `pxf_hdfs_simple.txt` file you created and added to HDFS in an earlier
section:
    +
    +``` sql
    +gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, num_orders
int, total_sales float8)
    +            LOCATION ('pxf://namenode:51200/data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=HdfsTextSimple')
    +          FORMAT 'TEXT' (delimiter=E',');
    +gpadmin=# SELECT * FROM pxf_hdfs_textsimple;          
    +```
     
    -### <a id="example1"></a>Example
    +``` pre
    +   location    | month | num_orders | total_sales 
    +---------------+-------+------------+-------------
    + Prague        | Jan   |        101 |     4875.33
    + Rome          | Mar   |         87 |     1557.39
    + Bangalore     | May   |        317 |     8936.99
    + Beijing       | Jul   |        411 |    11600.67
    +(4 rows)
    +```
     
    -A data schema `Babies.class` contains three fields: (name text, birthday text, weight
float). An external table must include these three fields, and can either include or ignore
the recordkey.
    +Create a second external table from `pxf_hdfs_simple.txt`, this time using the `CSV`
`FORMAT`:
     
     ``` sql
    --- writable table with recordkey
    -CREATE WRITABLE EXTERNAL TABLE babies_registry (recordkey int, name text, birthday text,
weight float)
    -  LOCATION ('pxf://namenode_host:51200/babies_1940s'
    -            '?ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor'
    -            '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver'
    -            '&DATA-SCHEMA=Babies')
    -  FORMAT 'CUSTOM' (formatter='pxfwritable_export');
    -INSERT INTO babies_registry VALUES (123456, "James Paul McCartney", "June 18, 1942",
3.800);
    -
    --- writable table without recordkey
    -CREATE WRITABLE EXTERNAL TABLE babies_registry2 (name text, birthday text, weight float)
    -  LOCATION ('pxf://namenode_host:51200/babies_1940s'
    -            '?ACCESSOR=org.apache.hawq.pxf.plugins.SequenceFileAccessor'
    -            '&RESOLVER=org.apache.hawq.pxf.plugins.WritableResolver'
    -            '&DATA-SCHEMA=Babies')
    -  FORMAT 'CUSTOM' (formatter='pxfwritable_export');
    -
    --- this record's key will have some default value
    -INSERT INTO babies_registry VALUES ("Richard Starkey", "July 7, 1940", 4.0);
    +gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_csv(location text, month text, num_orders
int, total_sales float8)
    +            LOCATION ('pxf://namenode:51200/data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=HdfsTextSimple')
    +          FORMAT 'CSV';
    +gpadmin=# SELECT * FROM pxf_hdfs_textsimple_csv;          
     ```
     
    -The same goes for reading data from an existing file with a key-value format, e.g. a
Sequence file.
    +Notice: When specifying `FORMAT 'CSV'` for a comma-separated value file, no `delimiter`
formatter option is required, as comma is the default.
    +
    +## <a id="profile_hdfstextmulti"></a>HdfsTextMulti Profile
    +
    +Use the `HdfsTextMulti` profile when reading plain text files with delimited single-
or multi- line records that include embedded (quoted) linefeed characters.
    +
    +\<formatting-properties\> supported by the `HdfsTextMulti` profile include:
    +
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| delimiter    | The delimiter character in the file. |
    +
    +The following SQL call uses the PXF `HdfsTextMulti` profile to create a queryable HAWQ
external table from the `pxf_hdfs_multi.txt` file you created and added to HDFS in an earlier
section:
     
     ``` sql
    --- readable table with recordkey
    -CREATE EXTERNAL TABLE babies_1940 (recordkey int, name text, birthday text, weight float)
    -  LOCATION ('pxf://namenode_host:51200/babies_1940s'
    -            '?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter'
    -            '&ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor'
    -            '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver'
    -            '&DATA-SCHEMA=Babies')
    -  FORMAT 'CUSTOM' (formatter='pxfwritable_import');
    --- retrieve each record's key
    -SELECT * FROM babies_1940;
    -
    --- readable table without recordkey
    -CREATE EXTERNAL TABLE babies_1940_2 (name text, birthday text, weight float)
    -  LOCATION ('pxf://namenode_host:51200/babies_1940s'
    -            '?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter'
    -            '&ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor'
    -            '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver'
    -            '&DATA-SCHEMA=Babies') 
    -  FORMAT 'CUSTOM' (formatter='pxfwritable_import');
    --- ignores the records' key
    -SELECT * FROM babies_1940_2; 
    +gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textmulti(address text, month text, year int)
    +            LOCATION ('pxf://namenode:51200/data/pxf_examples/pxf_hdfs_multi.txt?PROFILE=HdfsTextMulti')
    +          FORMAT 'CSV' (delimiter=E':');
    +gpadmin=# SELECT * FROM pxf_hdfs_textmulti;
    +```
    +
    +``` pre
    +         address          | month | year 
    +--------------------------+-------+------
    + 4627 Star Rd.            | Sept  | 2017
    + San Francisco, CA  94107           
    + 113 Moon St.             | Jan   | 2018
    + San Diego, CA  92093               
    + 51 Belt Ct.              | Dec   | 2016
    + Denver, CO  90123                  
    + 93114 Radial Rd.         | Jul   | 2017
    + Chicago, IL  60605                 
    + 7301 Brookview Ave.      | Dec   | 2018
    + Columbus, OH  43213                
    +(5 rows)
     ```
     
    -## <a id="topic_oy3_qwm_ss"></a>Working with Avro Files
    +## <a id="profile_hdfsavro"></a>Avro Profile
    +
    +Apache Avro is a data serialization framework where the data is serialized in a compact
binary format. 
    +
    +Avro specifies data types be defined in JSON. Avro format files have an independent schema,
also defined in JSON. In Avro files, the schema is stored with the data. An Avro schema, together
with its data, is fully self-describing.
     
    -Avro files combine their data with a schema, and can support complex data types such
as arrays, maps, records, enumerations, and fixed types. When you create a PXF external table
to represent Avro data, map top-level fields in the schema that use a primitive data type
to HAWQ columns of the same type. Map top-level fields that use a complex data type to a TEXT
column in the external table. The PXF Avro profile automatically separates components of a
complex type by inserting delimiters in the text column. You can then use functions or application
code to further process components of the complex data.
    +### <a id="profile_hdfsavrodatamap"></a>Data Type Mapping
     
    -The following table summarizes external table mapping rules for Avro data.
    +Avro supports both primitive and complex data types. 
     
    -<caption><span class="tablecap">Table 2. Avro Data Type Mapping</span></caption>
    +To represent Avro primitive data types in HAWQ, map data values to HAWQ columns of the
same type. 
    +
    +Avro supports complex data types including arrays, maps, records, enumerations, and fixed
types. Map top-level fields of these complex data types to the HAWQ `TEXT` type. While HAWQ
does not natively support these types, you can create HAWQ functions or application code to
extract or further process subcomponents of these complex data types.
    +
    +The following table summarizes external mapping rules for Avro data.
     
     <a id="topic_oy3_qwm_ss__table_j4s_h1n_ss"></a>
     
    -| Avro Data Type                                                    | PXF Type      
                                                                                         
                                                                                         
 |
    +| Avro Data Type                                                    | PXF/HAWQ Data Type
                                                                                         
                                                                                         
       |
     |-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
    -| Primitive type (int, double, float, long, string, bytes, boolean) | Corresponding HAWQ
data type. See [Data Types](../reference/HAWQDataTypes.html). |
    -| Complex type: Array, Map, Record, or Enum                         | TEXT, with default
delimiters inserted between collection items, mapped key-value pairs, and record data.   
                                                                                       |
    +| Primitive type (int, double, float, long, string, bytes, boolean) | Use the corresponding
HAWQ built-in data type; see [Data Types](../reference/HAWQDataTypes.html). |
    +| Complex type: Array, Map, Record, or Enum                         | TEXT, with delimiters
inserted between collection items, mapped key-value pairs, and record data.              
                                                                            |
     | Complex type: Fixed                                               | BYTEA         
                                                                                         
                                                                                         
 |
    -| Union                                                             | Follows the above
conventions for primitive or complex data types, depending on the union. Null values are supported
in Unions.                                                                     |
    +| Union                                                             | Follows the above
conventions for primitive or complex data types, depending on the union; supports Null values.
                                                                    |
    +
    +### <a id="profile_hdfsavroptipns"></a>Avro-Specific Custom Options
    +
    +For complex types, the PXF `Avro` profile inserts default delimiters between collection
items and values. You can use non-default delimiter characters by identifying values for specific
`Avro` custom options in the `CREATE EXTERNAL TABLE` call. 
    +
    +The `Avro` profile supports the following \<custom-options\>:
    +
    +| Option Name   | Description       
    +|---------------|--------------------|                                              
                                         
    +| COLLECTION_DELIM | The delimiter character(s) to place between entries in a top-level
array, map, or record field when PXF maps an Avro complex data type to a text column. The
default is a comma `,` character. |
    +| MAPKEY_DELIM | The delimiter character(s) to place between the key and value of a map
entry when PXF maps an Avro complex data type to a text column. The default is a colon `:`
character. |
    +| RECORDKEY_DELIM | The delimiter character(s) to place between the field name and value
of a record entry when PXF maps an Avro complex data type to a text column. The default is
a colon `:` character. |
    +
     
    -For complex types, the PXF Avro profile inserts default delimiters between collection
items and values. You can use non-default delimiter characters by including the `COLLECTION_DELIM`,
`MAPKEY_DELIM`, and/or `RECORDKEY_DELIM` optional parameters on the Avro profile. See [Additional
PXF Options](#additionaloptions__table_skq_kpz_4p) for a description of the parameters.
    +### <a id="topic_tr3_dpg_ts__section_m2p_ztg_ts"></a>Avro Schemas
     
    -### <a id="topic_tr3_dpg_ts"></a>Example
    +Avro schemas are defined using JSON, and composed of the same primitive and complex types
identified in the data mapping section above. Avro schema files typically have a `.avsc` suffix.
    --- End diff --
    
    The earlier section said an Avro schema is stored along with its data, rather than in
a separate file...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message