hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-200) Pig Performance Benchmarks
Date Thu, 04 Dec 2008 22:52:46 GMT

     [ https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Alan Gates updated PIG-200:

    Attachment: perf.patch

The following attached patch takes a different approach to providing a set of benchmarks for
pig.  It contains a set of 14 queries which are designed to try to cover a range of ways users
use pig.  It also includes implementations of the same queries in java code for map reduce,
so that developers can compare pig performance against map reduce performance.  See http://wiki.apache.org/pig/PigMix
for information on how the queries were chosen, how the data is constructed, and data from
an initial run of 0.1.0 pig versus soon to be 0.2.0 pig.

This attachment is not ready for inclusion in the code.  It has several issues.

# The library used to generate the zipf distributions in the data is under the GNU public
license, and thus cannot be included.  The library can be obtained at http://www.eli.sdsu.edu/java-SDSU/
# The data generation script is single threaded because the zipf distribution generator is.
 This means to generate 10m rows of data (about 15G) takes ~48 hours.  I'd like to be able
to generate larger data sets, but first I need to find a parallel zipf distribution generator
that has a compatible license (or write one, which I don't really want to do).
# There are places in the code (particularly the map reduce code) where path names etc. are
hard wired to locations in my test setup.  These need to be generalized.

> Pig Performance Benchmarks
> --------------------------
>                 Key: PIG-200
>                 URL: https://issues.apache.org/jira/browse/PIG-200
>             Project: Pig
>          Issue Type: Task
>            Reporter: Amir Youssefi
>         Attachments: generate_data.pl, perf.patch
> To benchmark Pig performance, we need to have a TPC-H like Large Data Set plus Script
Collection. This is used in comparison of different Pig releases, Pig vs. other systems (e.g.
Pig + Hadoop vs. Hadoop Only).
> Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
> I am currently running long-running Pig scripts over data-sets in the order of tens of
TBs. Next step is hundreds of TBs.
> We need to have an open large-data set (open source scripts which generate data-set)
and detailed scripts for important operations such as ORDER, AGGREGATION etc.
> We can call those the Pig Workouts: Cardio (short processing), Marathon (long running
scripts) and Triathlon (Mix). 
> I will update this JIRA with more details of current activities soon.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message