hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Hive footprint
Date Wed, 20 Apr 2016 15:02:07 GMT
A caveat here.

An OLTP database much like Oracle or SAP ASE will use indexes for point
queries in other words when the search is via index scan. In that case the
search will be very fast because typically few blocks will be needed using
Index scan and using RowID pointer to the underlying data blocks to get the
records from the disk.

When an OLAP type read is required there is a lesser need for index as the
optimiser does a serial scan and the work would be pretty efficient. As a
rule of sum (if I am correct), if Oracle CBO decides that the result set
will be more than 4% of the underlying rows it will favour a table scan

The issue with Hive are two fold (excluding storage index in ORC tables)

1) Hive does not take advantage of indexes (index in a conventional sense)
at the moment. Yes you can even create bitwise indexes on FACT tables in
Hive but they are not used by the Optimiser yet.
0: jdbc:hive2://rhes564:10010/default> show index on sales;
INFO  : OK
+-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
|       idx_name        |       tab_name        |       col_names
|               idx_tab_name               |       idx_type        |
comment  |
+-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
| sales_cust_bix        | sales                 | cust_id               |
oraclehadoop__sales_sales_cust_bix__     | bitmap                |
|
| sales_channel_bix     | sales                 | channel_id            |
oraclehadoop__sales_sales_channel_bix__  | bitmap                |
|
| sales_prod_bix        | sales                 | prod_id               |
oraclehadoop__sales_sales_prod_bix__     | bitmap                |
|
| sales_promo_bix       | sales                 | promo_id              |
oraclehadoop__sales_sales_promo_bix__    | bitmap                |
|
| sales_time_bix        | sales                 | time_id               |
oraclehadoop__sales_sales_time_bix__     | bitmap                |
|
+-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+

2) The blocks in Hive table are not stored sequentially. Actually the issue
with this is that HDFS lacks the ability to co-locate blocks. So really
table scan in the sense of conventional RDBMS does not exist. However I
believe there are plans to start making indexes available in Hive for COB
which in that case indexes will speed up the queries. Alan Gates may have
more on this.

HTH,




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 20 April 2016 at 13:07, Sabarish Sasidharan <sabarish.spk@gmail.com>
wrote:

> HBase is very good for direct key based lookups. And when you want to do
> scans for a range of keys (data is sorted by keys)
>
> Whereas Hive is not good for seeks (needle in haystack problem). You can
> optimize with ORCs, stripes, sorting etc. But still it is a needle in a
> haystack problem.
>
> Apache Kylin takes a different approach. It maintains the cubes in HBase
> but routes adhoc queries to Hive. So that's one way to see them as
> complementary technologies solving problems relevant to their space in an
> efficient manner.
>
> Regards
> Sab
>
> On Wed, Apr 20, 2016 at 11:20 AM, Amey Barve <ameybarve15@gmail.com>
> wrote:
>
>> Thanks Peyman,
>>
>> Is running and evaluating TPCH queries with HBaseStorageHandler vs
>> Hive's Text format comparable?
>> What is the standard set of queries generally used for performance
>> comparison, What queries did you use above?
>>
>> Regards,
>> Amey
>>
>>
>>
>> On Tue, Apr 19, 2016 at 7:28 PM, Peyman Mohajerian <mohajeri@gmail.com>
>> wrote:
>>
>>> Hi Amey,
>>>
>>> It is about seek vs scan. HBase is great in case a rowkey or a range of
>>> rowkeys is part of the where clause, then you do a seek and ORC/Parquest
>>> reading off HDFS would not do better in absence of an index. However for
>>> Data Warehouse that is generally not what you do, you mostly do scan, e.g.
>>> doing aggregation you aren't looking for a particular record(s). In this
>>> case the IO throughput dominates (generally), because you have to read lots
>>> of data, then reading large blocks of data and using headers info
>>> (predicate push-down) in ORC or Parquet will be faster compared to reading
>>> lots of HFiles in HBase. Of course compaction in HBase can turn the files
>>> to larger chunks but still 'typically' it will be slower.
>>> I should super emphasized that making statements about what is faster or
>>> not is very dangerous, there could be many exceptions depending on the type
>>> of query and other factors. When I did this test I was using map/reduce and
>>> with newer engines queries will be faster. Also caching in HBase is
>>> critical, if all you data is cached and you got lots of memory and system
>>> isn't busy handling compaction and lots of new write then your read
>>> performance in all cases will improve. Always do your own POC and use your
>>> own data to test.
>>>
>>> Thanks,
>>> Peyman
>>>
>>>
>>>
>>> On Tue, Apr 19, 2016 at 2:26 AM, Amey Barve <ameybarve15@gmail.com>
>>> wrote:
>>>
>>>> Hi Peyman,
>>>>
>>>> You say: "you can use Hive storage handler to read data from HBase the
>>>> performance would be lower than reading from HDFS directly for analytic."
>>>> Why is it so? Is it slow as compared to ORC, Parquet, and even Text
>>>> file format?
>>>>
>>>> Regards,
>>>> Amey
>>>>
>>>> On Tue, Apr 19, 2016 at 4:32 AM, Peyman Mohajerian <mohajeri@gmail.com>
>>>> wrote:
>>>>
>>>>> HBase can handle high read/write throughput, e.g. IOT use cases. It is
>>>>> not an analytic engine even though you can use Hive storage handler to
read
>>>>> data from HBase the performance would be lower than reading from HDFS
>>>>> directly for analytic.  But HBase has index, rowkey and you can add
>>>>> secondary index, usually with Elasticsearch or other means. You can also
>>>>> run Phoenix over HBase to do analytic but again only if you data
>>>>> collection/use case mandates HBase, e.g. small amount of data from millions
>>>>> of devices. It is common to copy data from HBase to HDFS (even though
HBase
>>>>> is sitting on top of HDFS), as ORC/Parquet for very large analytic. But
>>>>> again you do have the choice of using Phoenix or Hive to run analytic
over
>>>>> HBase if you don't want to pay for the cost of data copying.
>>>>> HBase can only be part of a DW solution in a limited way, e.g. as
>>>>> index to data in HDFS, partition discovery, etc. Pretty soon it will
be the
>>>>> metadata for Hive (optional instead of RDMS). HBase can  sits on the
edge
>>>>> of DW for collect fast landing data.
>>>>> I don't see any compete between Hive and HBase, they work together and
>>>>> I don't see modern DW having a monolithic engine, Tez+Spark+MPP+...
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 18, 2016 at 3:51 PM, Marcin Tustin <mtustin@handybook.com>
>>>>> wrote:
>>>>>
>>>>>> We use a hive with ORC setup now. Queries may take thousands of
>>>>>> seconds with joins, and potentially tens of seconds with selects
on very
>>>>>> large tables.
>>>>>>
>>>>>> My understanding is that the goal of hbase is to provide much lower
>>>>>> latency for queries. Obviously, this comes at the cost of not being
able to
>>>>>> perform joins. I don't actually use hbase, so I hesitate to say more
about
>>>>>> it.
>>>>>>
>>>>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Marcin.
>>>>>>>
>>>>>>> What is the definition of low latency here? Are you referring
to the
>>>>>>> performance of SQL against HBase tables compared to Hive. As
I understand
>>>>>>> HBase is a columnar database. Would it be possible to use Hive
against ORC
>>>>>>> to achieve the same?
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 18 April 2016 at 23:43, Marcin Tustin <mtustin@handybook.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> HBase has a different use case - it's for low-latency querying
of
>>>>>>>> big tables. If you combined it with Hive, you might have
something nice for
>>>>>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>>>>>
>>>>>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I notice that Impala is rarely mentioned these days.
 I may be
>>>>>>>>> missing something. However, I gather it is coming to
end now as I don't
>>>>>>>>> recall many use cases for it (or customers asking for
it). In contrast,
>>>>>>>>> Hive has hold its ground with the new addition of Spark
and Tez as
>>>>>>>>> execution engines, support for ACID and ORC and new stuff
in Hive 2. In
>>>>>>>>> addition provided a good choice for its metastore it
scales well.
>>>>>>>>>
>>>>>>>>> If Hive had the ability (organic) to have local variable
and
>>>>>>>>> stored procedure support then it would be top notch Data
Warehouse. Given
>>>>>>>>> its metastore, I don't see any technical reason why it
cannot support these
>>>>>>>>> constructs.
>>>>>>>>>
>>>>>>>>> I was recently asked to comment on migration from commercial
DWs
>>>>>>>>> to Big Data (primarily for TCO reason) and really could
not recall any
>>>>>>>>> better candidate than Hive. Is HBase a viable alternative?
Obviously
>>>>>>>>> whatever one decides there is still HDFS, a good engine
for Hive (sounds
>>>>>>>>> like many prefer TEZ although I am a Spark fan) and the
ubiquitous
>>>>>>>>> YARN.
>>>>>>>>>
>>>>>>>>> Let me know your thoughts.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Want to work at Handy? Check out our culture deck and open
roles
>>>>>>>> <http://www.handy.com/careers>
>>>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>>>> Handy just raised $50m
>>>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
led
>>>>>>>> by Fidelity
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>> <http://www.handy.com/careers>
>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>> Handy just raised $50m
>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
led
>>>>>> by Fidelity
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message