pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From billgra...@apache.org
Subject svn commit: r1421121 - in /pig/branches/branch-0.11: CHANGES.txt src/docs/src/documentation/content/xdocs/func.xml src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java
Date Thu, 13 Dec 2012 08:16:04 GMT
Author: billgraham
Date: Thu Dec 13 08:16:03 2012
New Revision: 1421121

URL: http://svn.apache.org/viewvc?rev=1421121&view=rev
PIG-2341: Need better documentation on Pig/HBase integration (jthakrar and billgraham via


Modified: pig/branches/branch-0.11/CHANGES.txt
URL: http://svn.apache.org/viewvc/pig/branches/branch-0.11/CHANGES.txt?rev=1421121&r1=1421120&r2=1421121&view=diff
--- pig/branches/branch-0.11/CHANGES.txt (original)
+++ pig/branches/branch-0.11/CHANGES.txt Thu Dec 13 08:16:03 2012
@@ -30,6 +30,8 @@ PIG-1891 Enable StoreFunc to make intell
+PIG-2341: Need better documentation on Pig/HBase integration (jthakrar and billgraham via
 PIG-3044: Trigger POPartialAgg compaction under GC pressure (dvryaboy)
 PIG-2907: Publish pig jars for Hadoop2/23 to maven (rohini)

Modified: pig/branches/branch-0.11/src/docs/src/documentation/content/xdocs/func.xml
URL: http://svn.apache.org/viewvc/pig/branches/branch-0.11/src/docs/src/documentation/content/xdocs/func.xml?rev=1421121&r1=1421120&r2=1421121&view=diff
--- pig/branches/branch-0.11/src/docs/src/documentation/content/xdocs/func.xml (original)
+++ pig/branches/branch-0.11/src/docs/src/documentation/content/xdocs/func.xml Thu Dec 13
08:16:03 2012
@@ -1509,8 +1509,137 @@ a = load '1.txt' as (a0:{t:(m:map[int],d
 A = LOAD 'data' USING TextLoader();
-   </section></section></section>
+   </section></section>
+  <!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
+   <section id="HBaseStorage">
+   <title>HBaseStorage</title>
+   <p>Loads and stores data from an HBase table.</p>
+   <section>
+   <title>Syntax</title>
+   <table>
+       <tr>
+            <td>
+               <p>HBaseStorage('columns', ['options'])</p>
+            </td>
+         </tr>
+   </table>
+   </section>
+   <section>
+   <title>Terms</title>
+   <table>
+       <tr>
+            <td>
+               <p>columns</p>
+            </td>
+            <td>
+               <p>A list of qualified HBase columns to read data from or store data
+                  The column family name and column qualifier are seperated by a colon (:).

+                  Only the columns used in the Pig script need to be specified. Columns are
+                  in one of three different ways as described below.</p>
+               <ul>
+               <li>Explicitly specify a column family and column qualifier (e.g., user_info:id).
+                   will produce a scalar in the resultant tuple.</li>
+               <li>Specify a column family and a portion of column qualifier name as
a prefix followed
+                   by an asterisk (i.e., user_info:address_*). This approach is used to read
one or
+                   more columns from the same column family with a matching descriptor prefix.
+                   The datatype for this field will be a map of column descriptor name to
field value. 
+                   Note that combining this style of prefix with a long list of fully qualified
+                   column descriptor names could cause perfomance degredation on the HBase
+                   This will produce a Pig map in the resultant tuple with column descriptors
as keys.</li>
+               <li>Specify all the columns of a column family using the column family
name followed
+                   by an asterisk (i.e., user_info:*). This will produce a Pig map in the
+                   tuple with column descriptors as keys.</li>
+               </ul>
+            </td>
+         </tr>
+       <tr>
+            <td>
+               <p>'options'</p>
+            </td>
+            <td>
+               <p>A string that contains space-separated options (&lsquo;-optionA=valueA
-optionB=valueB -optionC=valueC&rsquo;)</p>
+               <p>Currently supported options are:</p>
+               <ul>
+                <li>-loadKey=(true|false) Load the row key as the first value in every
+                    returned from HBase (default=false)</li>
+                <li>-gt=minKeyVal Return rows with a rowKey greater than minKeyVal</li>
+                <li>-lt=maxKeyVal Return rows with a rowKey less than maxKeyVal</li>
+                <li>-gte=minKeyVal Return rows with a rowKey greater than or equal
to minKeyVal</li>
+                <li>-lte=maxKeyVal Return rows with a rowKey less than or equal to
+                <li>-limit=numRowsPerRegion Max number of rows to retrieve per region</li>
+                <li>-caching=numRows Number of rows to cache (faster scans, more memory)</li>
+                <li>-delim=delimiter Column delimiter in columns list (default is whitespace)</li>
+                <li>-ignoreWhitespace=(true|false) When delim is set to something other
+                    whitespace, ignore spaces when parsing column list (default=true)</li>
+                <li>-caster=(HBaseBinaryConverter|Utf8StorageConverter) Class name
of Caster to use
+                    to convert values (default=Utf8StorageConverter). The default caster
can be
+                    overridden with the pig.hbase.caster config param. Casters must implement
+                <li>-noWAL=(true|false) During storage, sets the write ahead to false
for faster
+                    loading into HBase (default=false). To be used with extreme caution since
+                    could result in data loss (see <a href="http://hbase.apache.org/book.html#perf.hbase.client.putwal">http://hbase.apache.org/book.html#perf.hbase.client.putwal</a>).</li>
+                <li>-minTimestamp=timestamp Return cell values that have a creation
+                    greater or equal to this value</li>
+                <li>-maxTimestamp=timestamp Return cell values that have a creation
+                    less than this value</li>
+                <li>-timestamp=timestamp Return cell values that have a creation timestamp
equal to
+                    this value</li>
+               </ul>
+            </td>
+         </tr>
+   </table>
+   </section>
+   <section>
+   <title>Usage</title>
+   <p>HBaseStorage stores and loads data from HBase. The function takes two arguments.
The first
+       argument is a space seperated list of columns. The second optional argument is a
+       space seperated list of options. Column syntax and available options are listed above.</p>
+   </section>
+   <section>
+   <title>Load Example</title>
+   <p>In this example HBaseStorage is used with the LOAD function with an explicit
+raw = LOAD 'hbase://SomeTableName'
+      USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
+      'info:first_name info:last_name tags:work_* info:*', '-loadKey=true -limit=5') AS
+      (id:bytearray, first_name:chararray, last_name:chararray, tags_map:map[], info_map:map[]);
+   <p>The datatypes of the columns are declared with the "AS" clause. The first_name
and last_name
+       columns are specified as fully qualified column names with a chararray datatype. The
+       specification of tags:work_* requests a set of columns in the tags column family that
+       with "work_". There can be zero, one or more columns of that type in the HBase table.
+       type is specified as tags_map:map[]. This indicates that the set of column values
+       will be accessed as a map, where the key is the column name and the value is the cell
+       of the column. The fourth column specification is also a map of column descriptors
to cell
+       values.</p>
+   <p>When the type of the column is specified as a map in the "AS" clause, the map
keys are the
+       column descriptor names and the data type is chararray. The datatype of the columns
values can
+       be declared explicitly as shown in the examples below:</p>
+   <ul>
+   <li>tags_map[chararray] - In this case, the column values are all declared to be
of type chararray</li>
+   <li>tags_map[int] - In this case, the column values are all declared to be of type
+   </ul>
+   </section>
+   <section>
+   <title>Store Example</title>
+   <p>In this example HBaseStorage is used to store a relation into HBase.</p>
+A = LOAD 'hdfs_users' AS (id:bytearray, first_name:chararray, last_name:chararray);
+STORE A INTO 'hbase://users_table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
+    'info:first_name info:last_name');
+   <p>In the example above relation A is loaded from HDFS and stored in HBase. Note
that the schema
+       of relation A is a tuple of size 3, but only two column descriptor names are passed
to the
+       HBaseStorage constructor. This is because the first entry in the tuple is used as
the HBase
+       rowKey.</p>
+   </section>
+   </section>
 <!-- ======================================================== -->  
 <!-- ======================================================== -->  

Modified: pig/branches/branch-0.11/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java
URL: http://svn.apache.org/viewvc/pig/branches/branch-0.11/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java?rev=1421121&r1=1421120&r2=1421121&view=diff
--- pig/branches/branch-0.11/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java (original)
+++ pig/branches/branch-0.11/src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java Thu
Dec 13 08:16:03 2012
@@ -125,8 +125,7 @@ import com.google.common.collect.Lists;
  * <pre>{@code
  * copy = STORE raw INTO 'hbase://SampleTableCopy'
  *       USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
- *       'info:first_name info:last_name friends:* info:*')
- *       AS (info:first_name info:last_name buddies:* info:*);
+ *       'info:first_name info:last_name friends:* info:*');
  * }</pre>
  * Note that STORE will expect the first value in the tuple to be the row key.
  * Scalars values need to map to an explicit column descriptor and maps need to

View raw message