hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HBaseIntegration" by JohnSichi
Date Thu, 04 Mar 2010 22:19:18 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseIntegration" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseIntegration?action=diff&rev1=10&rev2=11

--------------------------------------------------

  within HBase.  If not specified, then the Hive and HBase table names
  will be identical.
  
- After executing the command above, you should be able to see the new table in the HBase
shell:
+ After executing the command above, you should be able to see the new (empty) table in the
HBase shell:
  
  {{{
  $ hbase shell
@@ -64, +64 @@

  hbase(main):001:0> list
  xyz                                                                                    
                      
  1 row(s) in 0.0530 seconds
- hbase(main):002:0> describe 'xyz'
+ hbase(main):002:0> describe "xyz"
  DESCRIPTION                                                             ENABLED        
                      
   {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', COMPRESSION => 'NONE', VE true
                                 
   RSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>    
                                  
    'false', BLOCKCACHE => 'true'}]}                                                  
                         
  1 row(s) in 0.0220 seconds
+ hbase(main):003:0> scan "xyz"
+ ROW                          COLUMN+CELL                                               
                      
+ 0 row(s) in 0.0060 seconds
  }}}
  
+ Notice that even though a column name "val" is specified in the mapping, only the column
family name "cf1" appears in the DESCRIBE output in the HBase shell.  This is because in HBase,
only column families (not columns) are known in the table-level metadata; column names within
a column family are only present at the per-row level.
+ 
+ Here's how to move data from Hive into the HBase table:
+ 
+ {{{
+ INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98;
+ }}}
+ 
+ Use HBase shell to verify that the data actually got loaded:
+ 
+ {{{
+ hbase(main):009:0> scan "xyz"
+ ROW                          COLUMN+CELL                                               
                      
+  98                          column=cf1:val, timestamp=1267737987733, value=val_98     
                      
+ 1 row(s) in 0.0110 seconds
+ }}}
+ 
+ And then query it back via Hive:
+ 
+ {{{
+ hive> select * from hbase_table_1;
+ Total MapReduce jobs = 1
+ Launching Job 1 out of 1
+ ...
+ OK
+ 98	val_98
+ Time taken: 4.582 seconds
+ }}}
+ 
- If instead you want to give Hive access to an existing HBase table,
+ If you want to give Hive access to an existing HBase table,
  use CREATE EXTERNAL TABLE:
  
  {{{
@@ -93, +125 @@

  The column mapping support currently available is somewhat
  cumbersome and restrictive:
  
-  * the first column in the Hive table automatically becomes the key in the HBase table
+  * the first column in the Hive table always maps to the key in the HBase table
-  * for each subsequent Hive column, the table creator must specify a corresponding entry
in the comma-delimited {{{hbase.columns.mapping}}} string (so for a Hive table with n columns,
the string should have n-1 entries)
+  * for each subsequent Hive column, the table creator must specify a corresponding entry
in the comma-delimited {{{hbase.columns.mapping}}} string (so for a Hive table with n columns,
the string should have n-1 entries); whitespace should '''not''' be used in between entries
since these will be interperted as part of the column name, which is almost certainly not
what you want
   * a mapping entry is of the form {{{column-family-name:[column-name]}}}
   * if no column-name is given, then the Hive column will map to all columns in the corresponding
HBase column family, and the Hive MAP datatype must be used to allow access to these (possibly
sparse) columns
+  * there is currently no way to access the HBase timestamp attribute, and queries always
access data with the latest timestamp.
+  * since HBase does not associate datatype information with columns, the serde converts
everything to string representation before storing it in HBase; there is currently no way
to plug in a custom serde per column
  
- TBD: details on how primitive and map values are serialized
+ The next few sections provide detailed examples of the kinds of column mappings currently
possible.
+ 
+ == Multiple Columns and Families ==
+ 
+ Here's an example with three Hive columns and two HBase column
+ families, with two of the Hive columns ({{{value1}}} and {{{value2}}})
+ corresponding to one of the column families ({{{a}}}, with HBase
+ column names {{{b}}} and {{{c}}}), and the other Hive column
+ corresponding to a single column ({{{e}}}) in its own column family
+ ({{{d}}}).
+ 
+ {{{
+ CREATE TABLE hbase_table_1(key int, value1 string, value2 int, value3 int) 
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES (
+ "hbase.columns.mapping" = "a:b,a:c,d:e"
+ );
+ INSERT OVERWRITE TABLE hbase_table_1 SELECT foo, bar, foo+1, foo+2 
+ FROM pokes WHERE foo=98 OR foo=100;
+ }}}
+ 
+ Here's how this looks in HBase:
+ 
+ {{{
+ hbase(main):014:0> describe "hbase_table_1"
+ DESCRIPTION                                                             ENABLED        
                      
+  {NAME => 'hbase_table_1', FAMILIES => [{NAME => 'a', COMPRESSION => 'N true
                                 
+  ONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_M       
                               
+  EMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'd', COMPRESSION =>     
                                  
+  'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN       
                               
+  _MEMORY => 'false', BLOCKCACHE => 'true'}]}                                     
                            
+ 1 row(s) in 0.0170 seconds
+ hbase(main):015:0> scan "hbase_table_1"
+ ROW                          COLUMN+CELL                                               
                      
+  100                         column=a:b, timestamp=1267740457648, value=val_100        
                      
+  100                         column=a:c, timestamp=1267740457648, value=101            
                      
+  100                         column=d:e, timestamp=1267740457648, value=102            
                      
+  98                          column=a:b, timestamp=1267740457648, value=val_98         
                      
+  98                          column=a:c, timestamp=1267740457648, value=99             
                      
+  98                          column=d:e, timestamp=1267740457648, value=100            
                      
+ 2 row(s) in 0.0240 seconds
+ }}}
+ 
+ And when queried back into Hive:
+ 
+ {{{
+ hive> select * from hbase_table_1;
+ Total MapReduce jobs = 1
+ Launching Job 1 out of 1
+ ...
+ OK
+ 100	val_100	101	102
+ 98	val_98	99	100
+ Time taken: 4.054 seconds
+ }}}
+ 
+ == Hive MAP to HBase Column Family ==
+ 
+ Here's how a Hive MAP datatype can be used to access an entire column
+ family.  Each row can have a different set of columns, where the
+ column names correspond to the map keys and the column values
+ correspond to the map values.
+ 
+ {{{
+ CREATE TABLE hbase_table_1(key int, value map<string,int>) 
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES (
+ "hbase.columns.mapping" = "cf:"
+ );
+ INSERT OVERWRITE TABLE hbase_table_1 SELECT foo, map(bar, foo) FROM pokes 
+ WHERE foo=98 OR foo=100;
+ }}}
+ 
+ Here's how this looks in HBase (with different column names in different rows):
+ 
+ {{{
+ hbase(main):012:0> scan "hbase_table_1"
+ ROW                          COLUMN+CELL                                               
                      
+  100                         column=cf:val_100, timestamp=1267739509194, value=100     
                      
+  98                          column=cf:val_98, timestamp=1267739509194, value=98       
                      
+ 2 row(s) in 0.0080 seconds
+ }}}
+ 
+ And when queried back into Hive:
+ 
+ {{{
+ hive> select * from hbase_table_1;
+ Total MapReduce jobs = 1
+ Launching Job 1 out of 1
+ ...
+ OK
+ 100	{"val_100":100}
+ 98	{"val_98":98}
+ Time taken: 3.808 seconds
+ }}}
+ 
+ Note that the key of the MAP must have datatype string, since it is
+ used for naming the HBase column, so the following table definition will fail:
+ 
+ {{{
+ CREATE TABLE hbase_table_1(key int, value map<int,int>) 
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES (
+ "hbase.columns.mapping" = "cf:"
+ );
+ }}}
+ 
+ == Illegal:  Hive Primitive to HBase Column Family ==
+ 
+ Table definitions such as the following should be avoided because a
+ Hive column mapped to a column family must have MAP type:
+ 
+ {{{
+ CREATE TABLE hbase_table_1(key int, value string) 
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES (
+ "hbase.columns.mapping" = "cf:"
+ );
+ }}}
+ 
+ Currently, the CREATE TABLE will succeed, but attempts to insert data
+ will fail with this internal error:
+ 
+ {{{
+ java.lang.RuntimeException: org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyStringObjectInspector
cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector
+ }}}
+ 
+ An improvement would be to catch this at CREATE TABLE time and reject
+ it as invalid.
  
  = Potential Followups =
+ 
+ There are a number of areas where Hive/HBase integration could definitely use more love:
  
   * more flexible column mapping (HIVE-806)
   * default column mapping in cases where no mapping spec is given
   * filter/projection pushdown
-  * implement virtual partitions corresponding to HBase timestamps
+  * expose timestamp attribute, possibly also with support for treating it as a partition
key
   * allow per-table hbase.master configuration
   * run profiler and minimize any per-row overhead in column mapping
   * user defined routines for lookups and data loads via HBase client API (HIVE-758 and HIVE-791)
-  * support a fast-path mode in which no map/reduce is used for simple queries (go through
HBase client API instead?)
+  * support a fast-path mode in which no map/reduce is used for simple queries and loads
(go through HBase client API instead)
   * logging is very noisy, with a lot of spurious exceptions; investigate these and either
fix their cause or squelch them
  
  = Build =
@@ -133, +297 @@

  Positive QL tests are under {{{hbase-handler/src/test/queries}}}.
  These use a HBase+Zookeeper mini-cluster for hosting the fixture
  tables in-process, so no free-standing HBase installation is needed in order to run them.
- Run them like this:
+ The QL tests can be executed via ant like this:
  
  {{{
  ant test -Dtestcase=TestHBaseCliDriver -Dqfile=hbase_queries.q

Mime
View raw message