pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jie Li <ji...@cs.duke.edu>
Subject Re: Running TPC-H on Pig
Date Fri, 02 Dec 2011 21:27:58 GMT
Yeah sure. We are just about to post them.

Jie

On Tue, Nov 29, 2011 at 8:18 PM, Jonathan Coveney <jcoveney@gmail.com>wrote:

> If you want some feedback on the how to make the scripts faster, feel free
> to post them.
>
> 2011/11/29 Jie Li <jieli@cs.duke.edu>
>
> > Did you mean the two update functions of TPC-H? I think we can leave them
> > out as Hive did, as usually Hadoop is not for update.
> >
> > Jie
> >
> > On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <sms@yahoo-inc.com
> > >wrote:
> >
> > > Please do. The association with TPC-H might be tricky as it mandates
> the
> > > concurrent data modification. Nevertheless, the benchmark will be very
> > > useful as you point out.
> > >
> > > -----Original Message-----
> > > From: Jie Li [mailto:jieli@cs.duke.edu]
> > > Sent: Tuesday, November 29, 2011 11:38 AM
> > > To: dev@pig.apache.org
> > > Subject: Running TPC-H on Pig
> > >
> > > Hello everyone,
> > >
> > > As people are usually more concerned about the performance, we need
> more
> > > benchmarks to identify the bottleneck of the Pig's performance. For a
> > class
> > > project we develop a whole set of Pig scripts for TPC-H. Though Pig was
> > not
> > > designed for this RDBMS benchmark, it does support most of the relation
> > > operators like join and aggregation, which can be optimized based on
> this
> > > benchmark. Besides that, we can also demonstrate how to write efficient
> > pig
> > > scripts by making full use of Pig Latin's features.
> > >
> > > Here are what we did:
> > > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> > > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator
> to
> > > implement join.
> > > 3) show how to optimize the join by slightly reordering or using
> > > replicated join. We think pig should be able to have more heuristic
> > > optimization for the join, such as evaluating the smaller join first,
> > using
> > > replicated join for small tables, and putting the larger table on the
> > right
> > > side of the hash join.
> > > 4) identify the poor performance of aggregation. Pig doesn't yet
> support
> > > hash-based aggregation so it's extremely slow for aggregation. Good to
> > know
> > > that Pig is just about to support it:)
> > >
> > > As TPC-H is well-known, a good benchmark result can help change
> people's
> > > impression that Pig is slow. Actually we compare Pig and Hive and find
> > that
> > > Pig is not necessarily slower than Hive. I wonder if we can create a
> jira
> > > for this project.
> > >
> > > Thanks,
> > > Jie Li
> > > PhD Candidate of Computer Science
> > > Duke University
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message