hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <msegel_had...@hotmail.com>
Subject Fwd: Pro and Cons of using HBase table as an external table in HIVE
Date Fri, 09 Jun 2017 12:46:12 GMT
Sorry. Need to send via right email address.

Begin forwarded message:

From: Michael Segel <msegel@segel.com<mailto:msegel@segel.com>>
Subject: Re: Pro and Cons of using HBase table as an external table in HIVE
Date: June 9, 2017 at 7:37:22 AM CDT
To: user@hive.apache.org<mailto:user@hive.apache.org>

Hey Edward,

Yes, that’s the gist of it.
However… if you can exclude data… your query in HBase could be faster.
Having said that…

I should have included hardware in to the equation… Also data locality could come in to
play…  But that really would confuse the issue and the OP even more. ;-)


-Mike

On Jun 9, 2017, at 7:14 AM, Edward Capriolo <edlinuxguru@gmail.com<mailto:edlinuxguru@gmail.com>>
wrote:

Think about it like this one system is scanning a local file ORC, using an hbase scanner (over
the network), and scanning the data in sstable format?

On Fri, Jun 9, 2017 at 5:50 AM, Amey Barve <ameybarve15@gmail.com<mailto:ameybarve15@gmail.com>>
wrote:
Hi Michael,

"If there is predicate pushdown, then you will be faster, assuming that the query triggers
an implied range scan"
---> Does this bring results faster than plain hive querying over ORC / Text file formats

In other words Is querying over plain hive (ORC or Text) always faster than through HiveStorageHandler?

Regards,
Amey

On 9 June 2017 at 15:08, Michael Segel <msegel_hadoop@hotmail.com<mailto:msegel_hadoop@hotmail.com>>
wrote:
The pro’s is that you have the ability to update a table without having to worry about duplication
of the row.  Tez is doing some form of compaction for you that already exists in HBase.

The cons:

1) Its slower. Reads from HBase have more overhead with them than just reading a file.  Read
Lars George’s book on what takes place when you do a read.

2) HBase is not a relational store. (You have to think about what that implies)

3) You need to query against your row key for best performance, otherwise it will always be
a complete table scan.

HBase was designed to give you fast access for direct get() and limited range scans.  Otherwise
you have to perform full table scans.  This means that unless you’re able to do a range
scan, your full table scan will be slower than if you did this on a flat file set.  Again
the reason why you would want to use HBase if your data set is mutable.

You also have to trigger a range scan when you write your hive query and you have make sure
that you’re querying off your row key.

HBase was designed as a <key,value> store. Plain and simple.  If you don’t use the
key, you have to do a full table scan. So even though you are partitioning on row key, you
never use your partitions.  However in Hive or Spark, you can create an alternative partition
pattern.  (e.g your key is the transaction_id, yet you partition on month/year portion of
the transaction_date)

You can speed things up a little by using an inverted table as a secondary index. However
this assumes that you want to use joins. If you have a single base table with no joins then
you can limit your range scans based on making sure you are querying against the row key.
 Note: This will mean that you have limited querying capabilities.

And yes, I’ve done this before but can’t share it with you.

HTH

P.S.
I haven’t tried Hive queries where you have what would be the equivalent of a get() .

In earlier versions of hive, the issue would be “SELECT * FROM foo where rowkey=BAR” 
would still do a full table scan because of the lack of predicate pushdown.
This may have been fixed in later releases of hive. That would be your test case.   If there
is predicate pushdown, then you will be faster, assuming that the query triggers an implied
range scan.
This would be a simple thing. However keep in mind that you’re going to generate a map/reduce
job (unless using a query engine like Tez) where you wouldn’t if you just wrote your code
in Java.




> On Jun 7, 2017, at 5:13 AM, Ramasubramanian Narayanan <ramasubramanian.narayanan@gmail.com<mailto:ramasubramanian.narayanan@gmail.com>>
wrote:
>
> Hi,
>
> Can you please let us know Pro and Cons of using HBase table as an external table in
HIVE.
>
> Will there be any performance degrade when using Hive over HBase instead of using direct
HIVE table.
>
> The table that I am planning to use in HBase will be master table like account, customer.
Wanting to achieve Slowly Changing Dimension. Please through some lights on that too if you
have done any such implementations.
>
> Thanks and Regards,
> Rams





Mime
View raw message