hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Schubert Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-806) Hive with HBase as data store to support MapReduce and direct query
Date Wed, 09 Sep 2009 03:15:57 GMT

    [ https://issues.apache.org/jira/browse/HIVE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752877#action_12752877

Schubert Zhang commented on HIVE-806:

@Zheng, we are in desigining and coding now.  and we had a talk with Samuel days ago.  Because
this is involved in one of our ongoing project, I am sorry the update will be not so quick.
I describe something of out consideration bellow, and will update when we complete our implementation
and verification.

1. A new HBaseInputFormat.

The current TableInputFormat always scan the whole HBase HTable, it is usually unnecessary,
especially when we know one or more row-range.
A new HBaseInputFormat will be implemented to provide more parameters to control the behavior
of HTable scan. e.g.:
(1) row-ranges (one or more startRow and endRow paires)
(2) column list (some times we need not read all columns, HBase is a column-oriented store)
(3) filter tree (predicate pushdow, filter rows/columns at region server)
(4) maybe, we can do some computing on region server. (optional)

2. SerDe

We use more flexible SerDe for engineering practice. 
(1) we will support the MAP data type to map to HBase's (sparse) column family:column qualifers.
This is a rigid mapping between Hive table schema and HTable schema, and sometimes it is not
so effective for structurized data.
(2) use a nested SerDe to implement the codec of RowKey and Columns. Since usually, the rowkey
in HTable are a combination of more than one hive-columns; and we support do store a column
list in to a HTable column family but do not use HBase's column quailfer feature, but the
columns in a column family are self-coded (such as use of comma delimiter).
      RowSerDe { RowKeySerDe,  ColumnSerDe}

This is example of above SerDe design.

CREATE TABLE t1(rowkey1 int, rowkey2 string, value1 string, valuer2 int, value3 long, valuer
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseRowSerDe'

"rowkey.serde.class"="org.apache.hadoop.hive.serde2.hbase.SimpleRowkeySerDe" //this will be
a build-in SerDe for rowkey
"rowkey.columns"="rowkey2,rowkey1"  //the rowkey in HTable is a combination of tow hive-columns.
"rowkey.column.lengths"="12,2"             //the lengths of the two hive-columns in rowkey
"rowkey.column.delimiter"=","                 //the delimiter in rowkey (it may be omit if
not be defined)

"column.families"="cf1:(value1,value2); cf2:(value3,value4)"  //there two column families
in HTable, cf1 and cf2 have tow column respectively
cf2:org.apache.hadoop.hive.serde2.hbase.ColumnSerDe1" //cf1 and cf2 can use different SerDe


(Note: we have complete above code and verified)

We shall also support the rigid mapping (MAP) like HIVE-705, e.g.

CREATE TABLE hbase_table_1(rowkey1 int, rowkey2 string, value1 string, valuer2 int,  abcd
MAP<string, string>)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseRowSerDe'


"column.families"="cf1:(value1,value2); cf2:=abcd"


3. To support direct query (scan or get) from HBase HTable

Some straightforward query target to HTable need not use mapreduce,  we can difectly scan
or get rows from HTable, since HTable is a global indexed store. We can use some features
of HBase to improve the performance.
(1) rowkey or rowkey ranges
(2) column list
(3) filter tree (predicate pushdow)
(4) .....

(Note: we have complete above code and verified)

4. other...

> Hive with HBase as data store to support MapReduce and direct query
> -------------------------------------------------------------------
>                 Key: HIVE-806
>                 URL: https://issues.apache.org/jira/browse/HIVE-806
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Schubert Zhang
> Current Hive uses only HDFS as the underlayer data store, it can query and analyze files
in HDFS via MapReduce.
> But in some engineering cases, our data are stored/organised/indexed in HBase or other
data stores. This jira-issue will implement hive to use HBase as data store.  And except for
supporting MapReduce on HBase, we will support direct query on HBase.
> This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705).
Because this implementation and use cases have some differences from HIVE-705, this jira-issue
is created to avoid confusions. It is possible to combine the two issues in the future.
> Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message