hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-979) Acummulator Interface for UDFs
Date Fri, 25 Sep 2009 23:22:16 GMT

    [ https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759804#action_12759804

Alan Gates commented on PIG-979:

Consider a Pig script like the following:

A = load 'bla';
B = group A by $0;
C = foreach B {
    D = order A by $1;
    generate CUMMULATIVE_SUM(D);

Because the UDF needs to see this data in an ordered fashion, it cannot be done using Pig's
Algebraic interface.  But it
does not need to see all the contents of the bag together.

One way to address this is to add an Accumulator interface that UDFs could implement.

interface Accumulator<T> {

     * Pass tuples to the UDF.  The passed in bag will contain only records from one
     * key.  It may not contain all the records for one key.  This function will
     * be called repeatedly until all records from one key are provided
     * to the UDF.
     * @param 1 or more tuples, all sharing the same key.
    void accumulate(Bag b);

     * Called when all records from a key have been passed to accumulate.
     * @return the value for the UDF for this key.
    T getValue();

In cases where all UDFs in a given foreach implement this accumulate interface, then Pig could
choose to use this method to
push records to the UDFs.  Then it would not need to read all records from the Reduce iterator
and cache them in memory or
on disk.

Before we commit to adding this new level of complexity to the langauge, we should performance
test it.  Given that we have
recently made a change aimed at addressing Pig's problem of dying during large non-algebraic
group bys (see PIG-975), this
needs to perform significantly better than that to justify adding it.

> Acummulator Interface for UDFs
> ------------------------------
>                 Key: PIG-979
>                 URL: https://issues.apache.org/jira/browse/PIG-979
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Ying He
> Add an accumulator interface for UDFs that would allow them to take a set number of records
at a time instead of the entire bag.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message