Return-Path: X-Original-To: apmail-pig-commits-archive@www.apache.org Delivered-To: apmail-pig-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C9453D84F for ; Thu, 13 Dec 2012 08:14:07 +0000 (UTC) Received: (qmail 57213 invoked by uid 500); 13 Dec 2012 08:14:07 -0000 Delivered-To: apmail-pig-commits-archive@pig.apache.org Received: (qmail 55935 invoked by uid 500); 13 Dec 2012 08:13:58 -0000 Mailing-List: contact commits-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list commits@pig.apache.org Received: (qmail 55884 invoked by uid 99); 13 Dec 2012 08:13:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Dec 2012 08:13:55 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Dec 2012 08:13:52 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id E77C62388962; Thu, 13 Dec 2012 08:13:30 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1421117 - in /pig/trunk: CHANGES.txt src/docs/src/documentation/content/xdocs/func.xml src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java Date: Thu, 13 Dec 2012 08:13:30 -0000 To: commits@pig.apache.org From: billgraham@apache.org X-Mailer: svnmailer-1.0.8-patched Message-Id: <20121213081330.E77C62388962@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: billgraham Date: Thu Dec 13 08:13:29 2012 New Revision: 1421117 URL: http://svn.apache.org/viewvc?rev=1421117&view=rev Log: PIG-2341: Need better documentation on Pig/HBase integration (jthakrar and billgraham via billgraham) Modified: pig/trunk/CHANGES.txt pig/trunk/src/docs/src/documentation/content/xdocs/func.xml pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java Modified: pig/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/pig/trunk/CHANGES.txt?rev=1421117&r1=1421116&r2=1421117&view=diff ============================================================================== --- pig/trunk/CHANGES.txt (original) +++ pig/trunk/CHANGES.txt Thu Dec 13 08:13:29 2012 @@ -24,6 +24,8 @@ INCOMPATIBLE CHANGES IMPROVEMENTS +PIG-2341: Need better documentation on Pig/HBase integration (jthakrar and billgraham via billgraham) + PIG-3075: Allow AvroStorage STORE Operations To Use Schema Specified By URI (nwhite via cheolsoo) PIG-3062: Change HBaseStorage to permit overriding pushProjection (billgraham) Modified: pig/trunk/src/docs/src/documentation/content/xdocs/func.xml URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/func.xml?rev=1421117&r1=1421116&r2=1421117&view=diff ============================================================================== --- pig/trunk/src/docs/src/documentation/content/xdocs/func.xml (original) +++ pig/trunk/src/docs/src/documentation/content/xdocs/func.xml Thu Dec 13 08:13:29 2012 @@ -1568,8 +1568,137 @@ a = load '1.txt' as (a0:{t:(m:map[int],d A = LOAD 'data' USING TextLoader(); - - + + + +
+ HBaseStorage +

Loads and stores data from an HBase table.

+ +
+ Syntax + + + + +
+

HBaseStorage('columns', ['options'])

+
+
+ +
+ Terms + + + + + + + + + +
+

columns

+
+

A list of qualified HBase columns to read data from or store data to. + The column family name and column qualifier are seperated by a colon (:). + Only the columns used in the Pig script need to be specified. Columns are specified + in one of three different ways as described below.

+
    +
  • Explicitly specify a column family and column qualifier (e.g., user_info:id). This + will produce a scalar in the resultant tuple.
  • +
  • Specify a column family and a portion of column qualifier name as a prefix followed + by an asterisk (i.e., user_info:address_*). This approach is used to read one or + more columns from the same column family with a matching descriptor prefix. + The datatype for this field will be a map of column descriptor name to field value. + Note that combining this style of prefix with a long list of fully qualified + column descriptor names could cause perfomance degredation on the HBase scan. + This will produce a Pig map in the resultant tuple with column descriptors as keys.
  • +
  • Specify all the columns of a column family using the column family name followed + by an asterisk (i.e., user_info:*). This will produce a Pig map in the resultant + tuple with column descriptors as keys.
  • +
+
+

'options'

+
+

A string that contains space-separated options (‘-optionA=valueA -optionB=valueB -optionC=valueC’)

+

Currently supported options are:

+
    +
  • -loadKey=(true|false) Load the row key as the first value in every tuple + returned from HBase (default=false)
  • +
  • -gt=minKeyVal Return rows with a rowKey greater than minKeyVal
  • +
  • -lt=maxKeyVal Return rows with a rowKey less than maxKeyVal
  • +
  • -gte=minKeyVal Return rows with a rowKey greater than or equal to minKeyVal
  • +
  • -lte=maxKeyVal Return rows with a rowKey less than or equal to maxKeyVal
  • +
  • -limit=numRowsPerRegion Max number of rows to retrieve per region
  • +
  • -caching=numRows Number of rows to cache (faster scans, more memory)
  • +
  • -delim=delimiter Column delimiter in columns list (default is whitespace)
  • +
  • -ignoreWhitespace=(true|false) When delim is set to something other than + whitespace, ignore spaces when parsing column list (default=true)
  • +
  • -caster=(HBaseBinaryConverter|Utf8StorageConverter) Class name of Caster to use + to convert values (default=Utf8StorageConverter). The default caster can be + overridden with the pig.hbase.caster config param. Casters must implement LoadStoreCaster.
  • +
  • -noWAL=(true|false) During storage, sets the write ahead to false for faster + loading into HBase (default=false). To be used with extreme caution since this + could result in data loss (see http://hbase.apache.org/book.html#perf.hbase.client.putwal).
  • +
  • -minTimestamp=timestamp Return cell values that have a creation timestamp + greater or equal to this value
  • +
  • -maxTimestamp=timestamp Return cell values that have a creation timestamp + less than this value
  • +
  • -timestamp=timestamp Return cell values that have a creation timestamp equal to + this value
  • +
+
+
+ +
+ Usage +

HBaseStorage stores and loads data from HBase. The function takes two arguments. The first + argument is a space seperated list of columns. The second optional argument is a + space seperated list of options. Column syntax and available options are listed above.

+
+ +
+ Load Example +

In this example HBaseStorage is used with the LOAD function with an explicit schema.

+ +raw = LOAD 'hbase://SomeTableName' + USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( + 'info:first_name info:last_name tags:work_* info:*', '-loadKey=true -limit=5') AS + (id:bytearray, first_name:chararray, last_name:chararray, tags_map:map[], info_map:map[]); + +

The datatypes of the columns are declared with the "AS" clause. The first_name and last_name + columns are specified as fully qualified column names with a chararray datatype. The third + specification of tags:work_* requests a set of columns in the tags column family that begin + with "work_". There can be zero, one or more columns of that type in the HBase table. The + type is specified as tags_map:map[]. This indicates that the set of column values returned + will be accessed as a map, where the key is the column name and the value is the cell value + of the column. The fourth column specification is also a map of column descriptors to cell + values.

+

When the type of the column is specified as a map in the "AS" clause, the map keys are the + column descriptor names and the data type is chararray. The datatype of the columns values can + be declared explicitly as shown in the examples below:

+
    +
  • tags_map[chararray] - In this case, the column values are all declared to be of type chararray
  • +
  • tags_map[int] - In this case, the column values are all declared to be of type int.
  • +
+
+ +
+ Store Example +

In this example HBaseStorage is used to store a relation into HBase.

+ +A = LOAD 'hdfs_users' AS (id:bytearray, first_name:chararray, last_name:chararray); +STORE A INTO 'hbase://users_table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( + 'info:first_name info:last_name'); + +

In the example above relation A is loaded from HDFS and stored in HBase. Note that the schema + of relation A is a tuple of size 3, but only two column descriptor names are passed to the + HBaseStorage constructor. This is because the first entry in the tuple is used as the HBase + rowKey.

+
+
+ Modified: pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java URL: http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java?rev=1421117&r1=1421116&r2=1421117&view=diff ============================================================================== --- pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java (original) +++ pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java Thu Dec 13 08:13:29 2012 @@ -124,8 +124,7 @@ import com.google.common.collect.Lists; *
{@code
  * copy = STORE raw INTO 'hbase://SampleTableCopy'
  *       USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
- *       'info:first_name info:last_name friends:* info:*')
- *       AS (info:first_name info:last_name buddies:* info:*);
+ *       'info:first_name info:last_name friends:* info:*');
  * }
* Note that STORE will expect the first value in the tuple to be the row key. * Scalars values need to map to an explicit column descriptor and maps need to