hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-7) Optimize execution of algebraic functions
Date Thu, 29 Nov 2007 20:25:43 GMT

     [ https://issues.apache.org/jira/browse/PIG-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Alan Gates updated PIG-7:

    Patch Info: [Patch Available]

Attaching patch that implements use of combiner for algebraic functions in limited situations.
 Algebraic is only applied when all functions to be evaluated in a given generate line are
algebraic and when there is one and only one relation being grouped (ie it is not applied
in cogroup situations).

Initial, very simple, performance tests show a speed up of ~40% (13m -> 7.5m for 4G on
10 machines) with the following script:
a = load '/user/pig/tests/data/perf/studenttab200M';
b = group a by $0;
c = foreach b generate group, COUNT($1), SUM($1.$2), AVG($1.$2), MIN($1.$1), MAX($1.$2);
store c into 'bla';

> Optimize execution of algebraic functions
> -----------------------------------------
>                 Key: PIG-7
>                 URL: https://issues.apache.org/jira/browse/PIG-7
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
>         Attachments: combiner.patch
> Algebraic are functions that can be computed incrementally like count(X), SUM(X), etc.
They can be computed effciently by doing the first level computation using hadoop combiner.
This can give a significant (2-3x) speedup for many aggregation queries. 
> Several users asked us for this feature so it is pretty high priority.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message