hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amey Barve <>
Subject Re: Pro and Cons of using HBase table as an external table in HIVE
Date Fri, 09 Jun 2017 09:50:03 GMT
Hi Michael,

"If there is predicate pushdown, then you will be faster, assuming that the
query triggers an implied range scan"
---> Does this bring results faster than plain hive querying over ORC /
Text file formats

In other words Is querying over plain hive (ORC or Text) *always* faster
than through HiveStorageHandler?


On 9 June 2017 at 15:08, Michael Segel <> wrote:

> The pro’s is that you have the ability to update a table without having to
> worry about duplication of the row.  Tez is doing some form of compaction
> for you that already exists in HBase.
> The cons:
> 1) Its slower. Reads from HBase have more overhead with them than just
> reading a file.  Read Lars George’s book on what takes place when you do a
> read.
> 2) HBase is not a relational store. (You have to think about what that
> implies)
> 3) You need to query against your row key for best performance, otherwise
> it will always be a complete table scan.
> HBase was designed to give you fast access for direct get() and limited
> range scans.  Otherwise you have to perform full table scans.  This means
> that unless you’re able to do a range scan, your full table scan will be
> slower than if you did this on a flat file set.  Again the reason why you
> would want to use HBase if your data set is mutable.
> You also have to trigger a range scan when you write your hive query and
> you have make sure that you’re querying off your row key.
> HBase was designed as a <key,value> store. Plain and simple.  If you don’t
> use the key, you have to do a full table scan. So even though you are
> partitioning on row key, you never use your partitions.  However in Hive or
> Spark, you can create an alternative partition pattern.  (e.g your key is
> the transaction_id, yet you partition on month/year portion of the
> transaction_date)
> You can speed things up a little by using an inverted table as a secondary
> index. However this assumes that you want to use joins. If you have a
> single base table with no joins then you can limit your range scans based
> on making sure you are querying against the row key.  Note: This will mean
> that you have limited querying capabilities.
> And yes, I’ve done this before but can’t share it with you.
> P.S.
> I haven’t tried Hive queries where you have what would be the equivalent
> of a get() .
> In earlier versions of hive, the issue would be “SELECT * FROM foo where
> rowkey=BAR”  would still do a full table scan because of the lack of
> predicate pushdown.
> This may have been fixed in later releases of hive. That would be your
> test case.   If there is predicate pushdown, then you will be faster,
> assuming that the query triggers an implied range scan.
> This would be a simple thing. However keep in mind that you’re going to
> generate a map/reduce job (unless using a query engine like Tez) where you
> wouldn’t if you just wrote your code in Java.
> > On Jun 7, 2017, at 5:13 AM, Ramasubramanian Narayanan <
>> wrote:
> >
> > Hi,
> >
> > Can you please let us know Pro and Cons of using HBase table as an
> external table in HIVE.
> >
> > Will there be any performance degrade when using Hive over HBase instead
> of using direct HIVE table.
> >
> > The table that I am planning to use in HBase will be master table like
> account, customer. Wanting to achieve Slowly Changing Dimension. Please
> through some lights on that too if you have done any such implementations.
> >
> > Thanks and Regards,
> > Rams

View raw message