hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Roessler <michael.roess...@keyevent.com>
Subject Re: Hive for a large datawarehouse
Date Tue, 25 Jan 2011 22:41:06 GMT
Our experience with Hive, and my personal opinion, is that while it is a
remarkable achievement to enable simple select and simple "group by"
sql-type statements at scale, the HQL language remains too rudimentary to
date to enable many of the select and ETL type SQL statements common in many
warehouses. Most database vendors, and Postgresql, all have SQL extensions -
(SQL99, SQL2003), in-line views etc., that enable much more sophisticated
aggregations within the warehouse. HQL does have user-defined functions to
help to overcome this limitation, yet in my experience the breadth of
functionality in UDF's still has a long way to go to meet basic SQL.

If your data will not require this type of functionality then Hive may be an
excellent choice. In my experience, Hive provides very good query
performance relative to other Hadoop-based options for the most simple of
queries. It is also simple, which is a tremendously positive trait, in my
book at least.

I believe that Hive shines when used in combination with other available
open-source, Hadoop-based tools. In conjunction with other tools, Hive can
help to provide access to sophisticated aggregations and analytics that I
find to be on par with sophisticated, yet traditional (non-Hadoop)
datawarehousing technologies - yet can exceed in scale most of those
traditional (non-Hadoop) solutions.

Using Hive for the very simple aggregations and for access to large data
sets, other Hadoop-based tools for the sophisticated aggregations and
analytics, and perhaps something like Postgresql for more rapid access to
indexed, highly aggregated and therefore smaller data sets works well in my
experience at the data size you mentioned.


On Tue, Jan 25, 2011 at 11:49 AM, Sheetal Dolas <sheetal_dolas@yahoo.com>wrote:

> Hi,
>
> We are exploring hive for a very large data warehouse (Up to 2 PB data
> size) and would like to get some information
>
> 1. What are your experiences on using hive for large data warehouses
> 2. What is biggest hive implementation that you have seen
> 3. How is the query performance with peta bytes of data
> 4. Details on configurations that you have used/seen (such as CPU numbers
> and capacity, Disk sizes, cost per node etc)
>
> Any help on this will enable us to take better decision.
>
> Thanks for your help,
> Sheetal
>
>

Mime
View raw message