hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sabarish Sasidharan <sabarish....@gmail.com>
Subject Re: Hive footprint
Date Wed, 20 Apr 2016 12:07:06 GMT
HBase is very good for direct key based lookups. And when you want to do
scans for a range of keys (data is sorted by keys)

Whereas Hive is not good for seeks (needle in haystack problem). You can
optimize with ORCs, stripes, sorting etc. But still it is a needle in a
haystack problem.

Apache Kylin takes a different approach. It maintains the cubes in HBase
but routes adhoc queries to Hive. So that's one way to see them as
complementary technologies solving problems relevant to their space in an
efficient manner.

Regards
Sab

On Wed, Apr 20, 2016 at 11:20 AM, Amey Barve <ameybarve15@gmail.com> wrote:

> Thanks Peyman,
>
> Is running and evaluating TPCH queries with HBaseStorageHandler vs Hive's
> Text format comparable?
> What is the standard set of queries generally used for performance
> comparison, What queries did you use above?
>
> Regards,
> Amey
>
>
>
> On Tue, Apr 19, 2016 at 7:28 PM, Peyman Mohajerian <mohajeri@gmail.com>
> wrote:
>
>> Hi Amey,
>>
>> It is about seek vs scan. HBase is great in case a rowkey or a range of
>> rowkeys is part of the where clause, then you do a seek and ORC/Parquest
>> reading off HDFS would not do better in absence of an index. However for
>> Data Warehouse that is generally not what you do, you mostly do scan, e.g.
>> doing aggregation you aren't looking for a particular record(s). In this
>> case the IO throughput dominates (generally), because you have to read lots
>> of data, then reading large blocks of data and using headers info
>> (predicate push-down) in ORC or Parquet will be faster compared to reading
>> lots of HFiles in HBase. Of course compaction in HBase can turn the files
>> to larger chunks but still 'typically' it will be slower.
>> I should super emphasized that making statements about what is faster or
>> not is very dangerous, there could be many exceptions depending on the type
>> of query and other factors. When I did this test I was using map/reduce and
>> with newer engines queries will be faster. Also caching in HBase is
>> critical, if all you data is cached and you got lots of memory and system
>> isn't busy handling compaction and lots of new write then your read
>> performance in all cases will improve. Always do your own POC and use your
>> own data to test.
>>
>> Thanks,
>> Peyman
>>
>>
>>
>> On Tue, Apr 19, 2016 at 2:26 AM, Amey Barve <ameybarve15@gmail.com>
>> wrote:
>>
>>> Hi Peyman,
>>>
>>> You say: "you can use Hive storage handler to read data from HBase the
>>> performance would be lower than reading from HDFS directly for analytic."
>>> Why is it so? Is it slow as compared to ORC, Parquet, and even Text file
>>> format?
>>>
>>> Regards,
>>> Amey
>>>
>>> On Tue, Apr 19, 2016 at 4:32 AM, Peyman Mohajerian <mohajeri@gmail.com>
>>> wrote:
>>>
>>>> HBase can handle high read/write throughput, e.g. IOT use cases. It is
>>>> not an analytic engine even though you can use Hive storage handler to read
>>>> data from HBase the performance would be lower than reading from HDFS
>>>> directly for analytic.  But HBase has index, rowkey and you can add
>>>> secondary index, usually with Elasticsearch or other means. You can also
>>>> run Phoenix over HBase to do analytic but again only if you data
>>>> collection/use case mandates HBase, e.g. small amount of data from millions
>>>> of devices. It is common to copy data from HBase to HDFS (even though HBase
>>>> is sitting on top of HDFS), as ORC/Parquet for very large analytic. But
>>>> again you do have the choice of using Phoenix or Hive to run analytic over
>>>> HBase if you don't want to pay for the cost of data copying.
>>>> HBase can only be part of a DW solution in a limited way, e.g. as index
>>>> to data in HDFS, partition discovery, etc. Pretty soon it will be the
>>>> metadata for Hive (optional instead of RDMS). HBase can  sits on the edge
>>>> of DW for collect fast landing data.
>>>> I don't see any compete between Hive and HBase, they work together and
>>>> I don't see modern DW having a monolithic engine, Tez+Spark+MPP+...
>>>>
>>>>
>>>>
>>>> On Mon, Apr 18, 2016 at 3:51 PM, Marcin Tustin <mtustin@handybook.com>
>>>> wrote:
>>>>
>>>>> We use a hive with ORC setup now. Queries may take thousands of
>>>>> seconds with joins, and potentially tens of seconds with selects on very
>>>>> large tables.
>>>>>
>>>>> My understanding is that the goal of hbase is to provide much lower
>>>>> latency for queries. Obviously, this comes at the cost of not being able
to
>>>>> perform joins. I don't actually use hbase, so I hesitate to say more
about
>>>>> it.
>>>>>
>>>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Thanks Marcin.
>>>>>>
>>>>>> What is the definition of low latency here? Are you referring to
the
>>>>>> performance of SQL against HBase tables compared to Hive. As I understand
>>>>>> HBase is a columnar database. Would it be possible to use Hive against
ORC
>>>>>> to achieve the same?
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 18 April 2016 at 23:43, Marcin Tustin <mtustin@handybook.com>
>>>>>> wrote:
>>>>>>
>>>>>>> HBase has a different use case - it's for low-latency querying
of
>>>>>>> big tables. If you combined it with Hive, you might have something
nice for
>>>>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>>>>
>>>>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I notice that Impala is rarely mentioned these days.  I may
be
>>>>>>>> missing something. However, I gather it is coming to end
now as I don't
>>>>>>>> recall many use cases for it (or customers asking for it).
In contrast,
>>>>>>>> Hive has hold its ground with the new addition of Spark and
Tez as
>>>>>>>> execution engines, support for ACID and ORC and new stuff
in Hive 2. In
>>>>>>>> addition provided a good choice for its metastore it scales
well.
>>>>>>>>
>>>>>>>> If Hive had the ability (organic) to have local variable
and stored
>>>>>>>> procedure support then it would be top notch Data Warehouse.
Given its
>>>>>>>> metastore, I don't see any technical reason why it cannot
support these
>>>>>>>> constructs.
>>>>>>>>
>>>>>>>> I was recently asked to comment on migration from commercial
DWs to
>>>>>>>> Big Data (primarily for TCO reason) and really could not
recall any better
>>>>>>>> candidate than Hive. Is HBase a viable alternative? Obviously
whatever one
>>>>>>>> decides there is still HDFS, a good engine for Hive (sounds
like many
>>>>>>>> prefer TEZ although I am a Spark fan) and the ubiquitous
YARN.
>>>>>>>>
>>>>>>>> Let me know your thoughts.
>>>>>>>>
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>>> <http://www.handy.com/careers>
>>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>>> Handy just raised $50m
>>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
led
>>>>>>> by Fidelity
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>> <http://www.handy.com/careers>
>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>> Handy just raised $50m
>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
led
>>>>> by Fidelity
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message