hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <ashok.sa...@wipro.com>
Subject RE: Performance: hive+hbase integration query against the row_key
Date Wed, 12 Sep 2012 03:25:32 GMT
after loading the data into hive tables, the files gets automatically deleted from HDFS...how
to stop that?

Thanks
Ashok

-----Original Message-----
From: Alan Gates [mailto:gates@hortonworks.com] 
Sent: 12 September 2012 06:51
To: user@hive.apache.org
Subject: Re: Performance: hive+hbase integration query against the row_key

 
On Sep 11, 2012, at 7:00 AM, bharath vissapragada wrote:

> Hey,
> 
> Hive does all kinds of parsing , metadata lookups, query tree building and stuff before
executing the query. Not sure if this all was included in those 36 seconds ! 
> 
> Also what hive does is, it builds a scan object with ranges based on predicates (and
mappers too ) on key column and not a direct "get" call as in hbase shell. This might incur
some overhead too!

Since Hive does this in a MapReduce job it definitely incurs overhead.  It does not run directly
against HBase as you might wish it did here.

Alan.

> 
> On Tue, Sep 11, 2012 at 7:10 PM, Shengjie Min <kelvin.msj@gmail.com> wrote:
> Hi,
> 
> I am trying to get hive working on top of my hbase table following the guide below:
> https://cwiki.apache.org/Hive/hbaseintegration.html
> 
> CREATE EXTERNAL TABLE hive_hbase_test (key string, a string, b string, c string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES
> ("hbase.columns.mapping"=":key,cf:a,cf:b,cf:c") TBLPROPERTIES ("hbase.table.name"="test");
> 
> this hive table creation makes my mapping roughly look like this:
> 
> hive_hbase_test  VS   test
> Hive key  -   hbase row_key
> Hive column a -  hbase cf:a
> Hive column b  -  hbase cf:b
> Hive column c  -  hbase cf:c
> 
> From my understanding on how HBaseStorageHandler works, it's supposed to take advantage
of the hbase row_key index as much as possible. So I would expect, 
> 
> 1. if you do a hive query against the row key like "select * from hive_hbase_test where
key='blabla'", this would utilize the hbase row_key index which give you very quick nearly
real-time response just like hbase does.
> 
> 2. of coz, if you do a hive query against a column like "select * from hive_hbase_test
where a='blabla'", in this case, it queries against a specific column, it probably uses mapred
because there is nothing from Hbase side can be utilized.
> 
> From my test, query 1 doesn't seem fast at all, still taking ages, so 
> select * from hive_hbase_test where key='blabla'   36secs
> vs
> get 'test', 'blabla'      less than 1 sec
> still shows a huge difference.
> 
> Anybody has tried this before? Is there anyway I can do sort of query plan analysis against
hive query? or I am not mapping hive table against hbase table correctly?
> 
> -- 
> All the best,
> Shengjie Min
> 
> 
> 
> 
> -- 
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v


The information contained in this electronic message and any attachments to this message are
intended for the exclusive use of the addressee(s) and may contain proprietary, confidential
or privileged information. If you are not the intended recipient, you should not disseminate,
distribute or copy this e-mail. Please notify the sender immediately and destroy all copies
of this message and any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should check this email
and any attachments for the presence of viruses. The company accepts no liability for any
damage caused by any virus transmitted by this email.

www.wipro.com

Mime
View raw message