hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HBaseIntegration" by JohnSichi
Date Thu, 04 Mar 2010 22:48:29 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseIntegration" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseIntegration?action=diff&rev1=11&rev2=12

--------------------------------------------------

  
  Notice that even though a column name "val" is specified in the mapping, only the column
family name "cf1" appears in the DESCRIBE output in the HBase shell.  This is because in HBase,
only column families (not columns) are known in the table-level metadata; column names within
a column family are only present at the per-row level.
  
- Here's how to move data from Hive into the HBase table:
+ Here's how to move data from Hive into the HBase table (see [[Hive/GettingStarted]] for
how to create the example table {{{pokes}}} in Hive first):
  
  {{{
  INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98;
@@ -262, +262 @@

  An improvement would be to catch this at CREATE TABLE time and reject
  it as invalid.
  
+ = Key Uniqueness =
+ 
+ One subtle difference between HBase tables and Hive tables is that HBase tables have a unique
key, whereas Hive tables do not.  When multiple rows with the same key are inserted into HBase,
only one of them is stored (the choice is arbitrary, so do not rely on HBase to pick the right
one).  This is in contrast to Hive, which is happy to store multiple rows with the same key
and different values.
+ 
+ For example, the pokes table contains rows with duplicate keys.  If it is copied into another
Hive table, the duplicates are preserved:
+ 
+ {{{
+ CREATE TABLE pokes2(foo INT, bar STRING);
+ INSERT OVERWRITE TABLE pokes2 SELECT * FROM pokes;
+ -- this will return 3
+ SELECT COUNT(1) FROM POKES WHERE foo=498;
+ -- this will also return 3
+ SELECT COUNT(1) FROM pokes2 WHERE foo=498;
+ }}}
+ 
+ But in HBase, the duplicates are silently eliminated:
+ 
+ {{{
+ CREATE TABLE pokes3(foo INT, bar STRING)
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES (
+ "hbase.columns.mapping" = "cf:bar"
+ );
+ INSERT OVERWRITE TABLE pokes3 SELECT * FROM pokes;
+ -- this will return 1 instead of 3
+ SELECT COUNT(1) FROM pokes3 WHERE foo=498;
+ }}}
+ 
  = Potential Followups =
  
  There are a number of areas where Hive/HBase integration could definitely use more love:

Mime
View raw message