hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dyozie <...@git.apache.org>
Subject [GitHub] incubator-hawq-docs pull request #33: HAWQ-1107 - enhance PXF HDFS plugin do...
Date Tue, 25 Oct 2016 21:20:01 GMT
Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/33#discussion_r85003579
  
    --- Diff: pxf/HDFSFileDataPXF.html.md.erb ---
    @@ -415,93 +312,101 @@ The following example uses the Avro schema shown in [Sample Avro
Schema](#topic_
                 {"name":"street", "type":"string"},
                 {"name":"city", "type":"string"}]
         }
    -  }, {
    -   "name": "relationship",
    -    "type": {
    -        "type": "enum",
    -        "name": "relationshipEnum",
    -        "symbols": ["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"]
    -    }
    -  }, {
    -    "name" : "md5",
    -    "type": {
    -        "type" : "fixed",
    -        "name" : "md5Fixed",
    -        "size" : 4
    -    }
       } ],
       "doc:" : "A basic schema for storing messages"
     }
     ```
     
    -#### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data (JSON)
    +### <a id="topic_tr3_dpg_ts__section_spk_15g_ts"></a>Sample Avro Data (JSON)
    +
    +Create a text file named `pxf_hdfs_avro.txt`:
    +
    +``` shell
    +$ vi /tmp/pxf_hdfs_avro.txt
    +```
    +
    +Enter the following data into `pxf_hdfs_avro.txt`:
     
     ``` pre
    -{"id":1, "username":"john","followers":["kate", "santosh"], "rank":null, "relationship":
"FRIEND", "fmap": {"kate":10,"santosh":4},
    -"address":{"street":"renaissance drive", "number":1,"city":"san jose"}, "md5":\u3F00\u007A\u0073\u0074}
    +{"id":1, "username":"john","followers":["kate", "santosh"], "relationship": "FRIEND",
"fmap": {"kate":10,"santosh":4}, "address":{"number":1, "street":"renaissance drive", "city":"san
jose"}}
    +    
    +{"id":2, "username":"jim","followers":["john", "pam"], "relationship": "COLLEAGUE", "fmap":
{"john":3,"pam":3}, "address":{"number":9, "street":"deer creek", "city":"palo alto"}}
    +```
    +
    +The sample data uses a comma `,` to separate top level records and a colon `:` to separate
map/key values and record field name/values.
     
    -{"id":2, "username":"jim","followers":["john", "pam"], "rank":3, "relationship": "COLLEAGUE",
"fmap": {"john":3,"pam":3}, 
    -"address":{"street":"deer creek", "number":9,"city":"palo alto"}, "md5":\u0010\u0021\u0003\u0004}
    +Convert the text file to Avro format. There are various ways to perform the conversion
programmatically and via the command line. In this example, we use the [Java Avro tools](http://avro.apache.org/releases.html),
and the jar file resides in the current directory:
    +
    +``` shell
    +$ java -jar ./avro-tools-1.8.1.jar fromjson --schema-file /tmp/avro_schema.avsc /tmp/pxf_hdfs_avro.txt
> /tmp/pxf_hdfs_avro.avro
     ```
     
    -To map this Avro file to an external table, the top-level primitive fields ("id" of type
long and "username" of type string) are mapped to their equivalent HAWQ types (bigint and
text). The remaining complex fields are mapped to text columns:
    +The generated Avro binary data file is written to `/tmp/pxf_hdfs_avro.avro`. Copy this
file to HDFS:
     
    -``` sql
    -gpadmin=# CREATE EXTERNAL TABLE avro_complex 
    -  (id bigint, 
    -  username text, 
    -  followers text, 
    -  rank int, 
    -  fmap text, 
    -  address text, 
    -  relationship text,
    -  md5 bytea) 
    -LOCATION ('pxf://namehost:51200/tmp/avro_complex?PROFILE=Avro')
    -FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
    +``` shell
    +$ sudo -u hdfs hdfs dfs -put /tmp/pxf_hdfs_avro.avro /data/pxf_examples/
     ```
    +### <a id="topic_avro_querydata"></a>Querying Avro Data
    +
    +Create a queryable external table from this Avro file:
     
    -The above command uses default delimiters for separating components of the complex types.
This command is equivalent to the one above, but it explicitly sets the delimiters using the
Avro profile parameters:
    +-  Map the top-level primitive fields, `id` (type long) and `username` (type string),
to their equivalent HAWQ types (bigint and text). 
    +-  Map the remaining complex fields to type text.
    +-  Explicitly set the record, map, and collection delimiters using the Avro profile custom
options:
     
     ``` sql
    -gpadmin=# CREATE EXTERNAL TABLE avro_complex 
    -  (id bigint, 
    -  username text, 
    -  followers text, 
    -  rank int, 
    -  fmap text, 
    -  address text, 
    -  relationship text,
    -  md5 bytea) 
    -LOCATION ('pxf://localhost:51200/tmp/avro_complex?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
    -FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
    +gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_avro(id bigint, username text, followers text,
fmap text, relationship text, address text)
    +            LOCATION ('pxf://namenode:51200/data/pxf_examples/pxf_hdfs_avro.avro?PROFILE=Avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
    +          FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
     ```
     
     A simple query of the external table shows the components of the complex type data separated
with delimiters:
     
     ``` sql
    -gpadmin=# select * from avro_complex;
    +gpadmin=# SELECT * FROM pxf_hdfs_avro;
     ```
     
     ``` pre
    -id | username |  followers  |    rank     |  fmap   |    address  |  relationship  |
 md5
    -1| john | [kate,santosh] |   | {kate:10,santosh:4} | {number:1,street:renaissance drive,city:san
jose} | FRIEND | ?zst
    -2| jim | [john,pam] | 3 | {john:3,pam:3} | {number:9,street:deer creek,city:palo alto}
| COLLEAGUE | \020!\003\004
    + id | username |   followers    |        fmap         | relationship |              
       address                      
    +----+----------+----------------+--------------------+--------------+---------------------------------------------------
    +  1 | john     | [kate,santosh] | {kate:10,santosh:4} | FRIEND       | {number:1,street:renaissance
drive,city:san jose}
    +  2 | jim      | [john,pam]     | {pam:3,john:3}      | COLLEAGUE    | {number:9,street:deer
creek,city:palo alto}
    +(2 rows)
     ```
     
    -You can process the delimited components in the text columns as necessary for your application.
For example, the following command uses the `string_to_array` function to convert entries
in the "followers" field to a text array column in a new view. The view is then queried to
filter rows based on whether a particular follower appears in the array:
    +Process the delimited components in the text columns as necessary for your application.
For example, the following command uses the HAWQ internal `string_to_array` function to convert
entries in the `followers` field to a text array column in a new view.
     
     ``` sql
    -gpadmin=# create view followers_view as 
    -  select username, address, string_to_array(substring(followers from 2 for (char_length(followers)
- 2)), ',')::text[] 
    -    as followers 
    -  from avro_complex;
    -
    -gpadmin=# select username, address 
    -from followers_view 
    -where john = ANY(followers);
    +gpadmin=# CREATE VIEW followers_view AS 
    +  SELECT username, address, string_to_array(substring(followers FROM 2 FOR (char_length(followers)
- 2)), ',')::text[] 
    +    AS followers 
    +  FROM pxf_hdfs_avro;
    +```
    +
    +Query the view to filter rows based on whether a particular follower appears in the array:
    +
    +```
    +gpadmin=# SELECT username, address FROM followers_view WHERE followers @> '{john}';
     ```
     
     ``` pre
    -username | address
    -jim | {number:9,street:deer creek,city:palo alto}
    + username |                   address                   
    +----------+---------------------------------------------
    + jim      | {number:9,street:deer creek,city:palo alto}
    +```
    +
    +## <a id="accessdataonahavhdfscluster"></a>Accessing HDFS Data in a High
Availability HDFS Cluster
    +
    +To access external HDFS data in a High Availability HDFS cluster, change the `CREATE
EXTERNAL TABLE` `LOCATION` clause to use \<HA-nameservice\> rather than  \<host\>[:\<port\>].
    +
    +``` sql
    +gpadmin=# CREATE EXTERNAL TABLE <table_name> ( <column_name> <data_type>
[, ...] | LIKE <other_table> )
    +            LOCATION ('pxf://<HA-nameservice>/<path-to-hdfs-file>?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro|SequenceWritable[&<custom-option>=<value>[...]]')
    +         FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
     ```
    +
    +The opposite is true when a highly available HDFS cluster is reverted to a single NameNode
configuration. In that case, any table definition that has specified \<HA-nameservice\>
should use the \<host\>[:\<port\>] syntax. 
    +
    +
    +## <a id="hdfs_advanced"></a>Advanced
    --- End diff --
    
    This should probably be mentioned earlier, when we list the built-in profiles.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message