pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renato Marroquín Mogrovejo <renatoj.marroq...@gmail.com>
Subject Re: Running TPC-H on Pig
Date Fri, 02 Dec 2011 21:59:42 GMT
My bad I was talking about TPC-DS (:
I used the TPC-DS to test Pig Joins, but I didn't actually think on
comparing it with Hive because Hive already has on going projects for
its cost based optimizer, and I thought it wouldn't be a fair
comparison. But I guess your work is related to the starfish system
right?
Anyways, I hope to see your benchmark.

Renato M.


2011/12/2 Jie Li <jieli@cs.duke.edu>:
> TPC-E is for transaction, so why is it better for evaluating Hadoop related
> systems?
>
> We are benchmarking the whole queries. We found that some simple heuristics
> work very well so far. No doubt that the statistics would help make a even
> better query plan.
>
> Jie
>
> On Wed, Nov 30, 2011 at 12:18 AM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
>> Hey,
>>
>> why didn't you use the TPC-E?and what are you guys exactly
>> benchmarking?i.e. specific components of both systems or the whole queries?
>> Because hive is already able to use some basic statistics but pig isn't,and
>> at least until hcat is ready it won't be able to take fully advantage of
>> them.
>>
>> Renato M.
>> On Nov 29, 2011 8:18 PM, "Jonathan Coveney" <jcoveney@gmail.com> wrote:
>>
>> > If you want some feedback on the how to make the scripts faster, feel
>> free
>> > to post them.
>> >
>> > 2011/11/29 Jie Li <jieli@cs.duke.edu>
>> >
>> > > Did you mean the two update functions of TPC-H? I think we can leave
>> them
>> > > out as Hive did, as usually Hadoop is not for update.
>> > >
>> > > Jie
>> > >
>> > > On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <
>> sms@yahoo-inc.com
>> > > >wrote:
>> > >
>> > > > Please do. The association with TPC-H might be tricky as it mandates
>> > the
>> > > > concurrent data modification. Nevertheless, the benchmark will be
>> very
>> > > > useful as you point out.
>> > > >
>> > > > -----Original Message-----
>> > > > From: Jie Li [mailto:jieli@cs.duke.edu]
>> > > > Sent: Tuesday, November 29, 2011 11:38 AM
>> > > > To: dev@pig.apache.org
>> > > > Subject: Running TPC-H on Pig
>> > > >
>> > > > Hello everyone,
>> > > >
>> > > > As people are usually more concerned about the performance, we need
>> > more
>> > > > benchmarks to identify the bottleneck of the Pig's performance. For
a
>> > > class
>> > > > project we develop a whole set of Pig scripts for TPC-H. Though Pig
>> was
>> > > not
>> > > > designed for this RDBMS benchmark, it does support most of the
>> relation
>> > > > operators like join and aggregation, which can be optimized based
on
>> > this
>> > > > benchmark. Besides that, we can also demonstrate how to write
>> efficient
>> > > pig
>> > > > scripts by making full use of Pig Latin's features.
>> > > >
>> > > > Here are what we did:
>> > > > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
>> > > > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator
>> > to
>> > > > implement join.
>> > > > 3) show how to optimize the join by slightly reordering or using
>> > > > replicated join. We think pig should be able to have more heuristic
>> > > > optimization for the join, such as evaluating the smaller join first,
>> > > using
>> > > > replicated join for small tables, and putting the larger table on
the
>> > > right
>> > > > side of the hash join.
>> > > > 4) identify the poor performance of aggregation. Pig doesn't yet
>> > support
>> > > > hash-based aggregation so it's extremely slow for aggregation. Good
>> to
>> > > know
>> > > > that Pig is just about to support it:)
>> > > >
>> > > > As TPC-H is well-known, a good benchmark result can help change
>> > people's
>> > > > impression that Pig is slow. Actually we compare Pig and Hive and
>> find
>> > > that
>> > > > Pig is not necessarily slower than Hive. I wonder if we can create
a
>> > jira
>> > > > for this project.
>> > > >
>> > > > Thanks,
>> > > > Jie Li
>> > > > PhD Candidate of Computer Science
>> > > > Duke University
>> > > >
>> > > >
>> > >
>> >
>>
>

Mime
View raw message