From Ashish Thusoo <>
Subject RE: Hive Performance
Date Mon, 09 Nov 2009 19:53:07 GMT
There are a bunch of optimizations that deal with skewed data in Hive as well. The optimizer
is rule based and the user has to hint the query - similar to what is done in RDBMS. We have
mostly done our performance work on the benchmark published in the SIGMOD paper.


-----Original Message-----
From: Edward Capriolo [] 
Sent: Saturday, November 07, 2009 11:19 AM
Subject: Re: Hive Performance

A friend and I were disgussing pig vs hive in general yesterday. On the surface hive is an
sql like language.pig is its own language 'pig latin' however in the end I think they both
end up doing column projections, joins,etc. In the end it is a similar operation happening
on the same cluster. So performance wise I expect the performance will eventually be similair.
now pig offering more sql support is a large undertaking.

 While pig looks very versatile I resently emultated the example on cloudera's blog for geoip
locating traffic in pig. I did this in hive with an external perl script using map/transform.
(It did not take a page long pig program) I also think the hive udf framework can be used
in place of many piggybank functions. Also unless I am missing something a udf is native java.
Seems like piggybank functions are going to be piping /streaming output I can't see that performing

To backtrack if pig adds sql, will we need hive? If hive adds something like tsql will we
need pig?

On 11/7/09, Rob Stewart <> wrote:
> Hi there. I'm in the process of writing a paper, and part of it I aim 
> to write (yet another) comparative study on various interfaces with Hadoop.
> This will almost certainly include Pig and Hive, probably MapReduce, 
> and maybe JAQL.
> I have read the papers published on the Hive JIRA (pig vs hive vs 
> MapReduce for 2 queries, an aggregation, and a join). I am, however, 
> wanting to know a bit from the Hive community.
> 1. Do you guys (the Hive developers) have a standardized benchmarking 
> tool to use prior to each Hive release? I am thinking of something 
> similar to PigMix, used by the Pig developers. In case you don't know, 
> PigMix is a set of 12 designed queries, implemented in Pig and Java 
> Hadoop, and comparisons are made on execution time. Does the Hive community have something
> 2. The Pig wiki point out some unique features of Pig that allow 
> optimal execution performance. For instance, they have a methods to 
> optimize queries on skewed data (by taking samples of the data for 
> reduce key allocations. Is there something about the implementation of 
> Hive that gives it some functionality not found in other interfaces. 
> And better still, would there some Hive implementation that could work 
> as a proof of concept to show any optimized features of Hive?
> 3. One section suggested for investigation within the Pig development 
> team is to create a SQL like language that could be compiled down 
> through Pig to MR jobs. If such a project was to achieve parity with 
> Hive's SQL like interface, where would be the distinction be between Pig and Hive.
> Certainly, from a users perspective, there would be very little difference.
> If the only difference turns out to be the execution performance 
> achieved by one interface over another, where would this put the 
> inferior interface (be that either Pig or Hive) in terms of its 
> relevance in the Hadoop software stack?
> Many thanks,
> Rob Stewart

