hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Pestritto <m...@pestritto.com>
Subject Re: [jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
Date Mon, 24 Aug 2009 02:43:54 GMT
Hi All.  I see a lot of good work being done on HBase/Hive integration
especially around how to express hbase metadata in hive and how to load data
from/to hbase/hive.

Has any thought be been put into how to use HBase data as lookup data in a
query and not load all of the data as a normal hive query ?

My use case is as follows:  I have a table < users > with 50m users.  I have
a 5gb daily clickstream file that only touchs 150k of those users on a daily
basis.  It would be much more efficient if I didn't have to load all of the
data in HBase to a hive table and write a traditional hive query but just do
150k lookups in the map ( or reduce ) phase of the MR job.  If the hbase
lookups were done in realtime it would be much faster than sourcing the
original user table with 50m rows.

Thoughts ?


On Sun, Aug 23, 2009 at 8:20 AM, Samuel Guo (JIRA) <jira@apache.org> wrote:

>    [
> https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746592#action_12746592]
> Samuel Guo commented on HIVE-705:
> ---------------------------------
> Attach a new patch.
> 1) move the related hbase code to the contrib package, as hbase just an
> optional storage for hive, not neccessary.
> I have tried to avoid modifying the hive original code and just add a hbase
> serde to connect hive with hbase. But the hbase storage model is quite
> different with file storage model. For example, a loadwork is used to
> rename/copy files from temp dir to the target table's dir if a query's
> target is a hive table. But in a hbased hive table, we can't rename a table
> now. So it's hard to let a hbased hive table to follow the logic of a normal
> file-based hive table.  So I add some code(HiveFormatUtils) to distinguish a
> file-based table from a not-file-based table.
> 2) fix some bugs in the draft patch, such as "select *" return nothing.
> ----------------------------------------------------------------------------------------------
> How to use the hbase as hive's storage?
> 1) remember to add the contrib jar and the hbase jar in the hive's auxPath,
> so m/r can populate the neccessary hbase-related jars to the whole hadoop
> m/r cluster.
> > $HIVE_HOME/bin/hive -auxPath ${contrib_jar},${hbase_jar}
> 2) modify the configuration to add the following configuration parameters.
> "hbase.master" : pointer to the hbase's master.
> "hive.othermetadata.handlers" :
> "org.apache.hadoop.hive.contrib.hbase.HiveHBaseTableInputFormat:org.apache.hadoop.hive.contrib.hbase.HBaseMetadataHandler"
> "hive.othermetadata.handlers" collects the metadata handlers to handle the
> other metadata operations in the not-file-based hive tables. Take hbase as
> an example. HBaseMetadataHandler will create the neccessary hbase table and
> its family columns when we create a hbased hive table from hive's client. It
> also drop the hbase table when we drop the hive table.
> The metastore read the registered handlers map from the configuration file
> during initialization. The registered handlers map is formated as
> "table_format_classname:table_metadata_handler_classname,table_format_classname:table_metadata_handler_classname,...".
> 3) enjoy "hive over hbase"!
> ------------------------------------------------------------------------
> Other problems.
> 1) Altering a hased-hive table is not supported now. :(
> renaming a table in hbase is not supported now, so I just do not support
> rename operation. ( maybe if we rename a hive table, we do not need to
> rename the base hbase table.)
> adding/replacing cloumns.
> Now we need to specify the schema mapping in the SerDe properties
> explicitly. If we want to adding columns, we need to call 'alter' twice to
> adding columns: change the serde properties and the hive columns.  Either
> change the serde properties first or change the hive columns first will fail
> now, because we validate the schema mapping during SerDe initialization. One
> of the hbase serde validation is to check the counts of hive columns and
> hbase mapping columns. If we first change the hive columns, the number of
> hive columns will be more than hbase mapping columns, the HBase Serde
> initialization will fail this alter operation.  (maybe we need to remove the
> validation code from HBaseSerDe initialization and do it in other place?)
> 2) more flexible schema mapping?
> As Schubert metioned before, more flexible schema mapping will be useful
> for user. This feature will be added later.
> welcome for comments~
> > Let Hive can analyse hbase's tables
> > -----------------------------------
> >
> >                 Key: HIVE-705
> >                 URL: https://issues.apache.org/jira/browse/HIVE-705
> >             Project: Hadoop Hive
> >          Issue Type: New Feature
> >            Reporter: Samuel Guo
> >         Attachments: hbase-0.19.3-test.jar, hbase-0.19.3.jar,
> HIVE-705_draft.patch, HIVE-705_revision806905.patch
> >
> >
> > Add a serde over the hbase's tables, so that hive can analyse the data
> stored in hbase easily.
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message