Return-Path: Delivered-To: apmail-hadoop-hive-dev-archive@minotaur.apache.org Received: (qmail 4075 invoked from network); 9 Nov 2009 19:54:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Nov 2009 19:54:09 -0000 Received: (qmail 95547 invoked by uid 500); 9 Nov 2009 19:54:09 -0000 Delivered-To: apmail-hadoop-hive-dev-archive@hadoop.apache.org Received: (qmail 95514 invoked by uid 500); 9 Nov 2009 19:54:09 -0000 Mailing-List: contact hive-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hive-dev@hadoop.apache.org Delivered-To: mailing list hive-dev@hadoop.apache.org Received: (qmail 95504 invoked by uid 99); 9 Nov 2009 19:54:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2009 19:54:09 +0000 X-ASF-Spam-Status: No, hits=-6.1 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_MED X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of athusoo@facebook.com designates 69.63.179.25 as permitted sender) Received: from [69.63.179.25] (HELO mailout-sf2p.facebook.com) (69.63.179.25) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2009 19:54:07 +0000 Received: from mail.thefacebook.com (intlb01.snat.snc1.facebook.com [10.128.203.16] (may be forged)) by pp02.snc1.tfbnw.net (8.14.1/8.14.1) with ESMTP id nA9JrFGL006925 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=NOT) for ; Mon, 9 Nov 2009 11:53:27 -0800 Received: from SC-MBXC1.TheFacebook.com ([192.168.18.102]) by sc-hub01.TheFacebook.com ([192.168.18.104]) with mapi; Mon, 9 Nov 2009 11:53:10 -0800 From: Ashish Thusoo To: "hive-dev@hadoop.apache.org" Date: Mon, 9 Nov 2009 11:53:07 -0800 Subject: RE: Hive Performance Thread-Topic: Hive Performance Thread-Index: Acpf30HjoOFvRyJLRSOhLq3gH0qpogBlrTLQ Message-ID: <68B7689C98024D43B4C2709456F0B5200A271C3573@SC-MBXC1.TheFacebook.com> References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166 definitions=2009-11-09_12:2009-10-29,2009-11-09,2009-11-09 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=5.0.0-0908210000 definitions=main-0911090210 There are a bunch of optimizations that deal with skewed data in Hive as we= ll. The optimizer is rule based and the user has to hint the query - simila= r to what is done in RDBMS. We have mostly done our performance work on the= benchmark published in the SIGMOD paper. Ashish -----Original Message----- From: Edward Capriolo [mailto:edlinuxguru@gmail.com]=20 Sent: Saturday, November 07, 2009 11:19 AM To: hive-dev@hadoop.apache.org Subject: Re: Hive Performance A friend and I were disgussing pig vs hive in general yesterday. On the sur= face hive is an sql like language.pig is its own language 'pig latin' howev= er in the end I think they both end up doing column projections, joins,etc.= In the end it is a similar operation happening on the same cluster. So per= formance wise I expect the performance will eventually be similair. now pig= offering more sql support is a large undertaking. While pig looks very versatile I resently emultated the example on clouder= a's blog for geoip locating traffic in pig. I did this in hive with an exte= rnal perl script using map/transform. (It did not take a page long pig prog= ram) I also think the hive udf framework can be used in place of many piggy= bank functions. Also unless I am missing something a udf is native java. Se= ems like piggybank functions are going to be piping /streaming output I can= 't see that performing better. To backtrack if pig adds sql, will we need hive? If hive adds something lik= e tsql will we need pig? On 11/7/09, Rob Stewart wrote: > Hi there. I'm in the process of writing a paper, and part of it I aim=20 > to write (yet another) comparative study on various interfaces with Hadoo= p. > > This will almost certainly include Pig and Hive, probably MapReduce,=20 > and maybe JAQL. > > I have read the papers published on the Hive JIRA (pig vs hive vs=20 > MapReduce for 2 queries, an aggregation, and a join). I am, however,=20 > wanting to know a bit from the Hive community. > > 1. Do you guys (the Hive developers) have a standardized benchmarking=20 > tool to use prior to each Hive release? I am thinking of something=20 > similar to PigMix, used by the Pig developers. In case you don't know,=20 > PigMix is a set of 12 designed queries, implemented in Pig and Java=20 > Hadoop, and comparisons are made on execution time. Does the Hive communi= ty have something similar? > > 2. The Pig wiki point out some unique features of Pig that allow=20 > optimal execution performance. For instance, they have a methods to=20 > optimize queries on skewed data (by taking samples of the data for=20 > reduce key allocations. Is there something about the implementation of=20 > Hive that gives it some functionality not found in other interfaces.=20 > And better still, would there some Hive implementation that could work=20 > as a proof of concept to show any optimized features of Hive? > > 3. One section suggested for investigation within the Pig development=20 > team is to create a SQL like language that could be compiled down=20 > through Pig to MR jobs. If such a project was to achieve parity with=20 > Hive's SQL like interface, where would be the distinction be between Pig = and Hive. > Certainly, from a users perspective, there would be very little differenc= e. > If the only difference turns out to be the execution performance=20 > achieved by one interface over another, where would this put the=20 > inferior interface (be that either Pig or Hive) in terms of its=20 > relevance in the Hadoop software stack? > > > Many thanks, > > > Rob Stewart >