hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Ciemiewicz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)
Date Thu, 28 May 2009 23:49:45 GMT

    [ https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714231#action_12714231

David Ciemiewicz commented on PIG-807:

I wonder if there is also the need for some additional classes of functions that go along
with ReadOnce / Streaming applications:
Accumulating Functions that operated on ordered data and output a tuple for each and every
tuple read.

For instance, cummulative sums, rank, dense rank, cumulative proportions all could be written
Accumulating Functions that operate on streams.

>From my Perl example above, cummulative sum would be a function that does:

sub accumulate
        my $self = shift;
        my $value = shift;

        $self->{'sum'} += $value;

        return $self->{'sum'};

These kinds of functions would be different from the SUM, COUNT, MIN, MAX, .. Accumulating

I think that any designs / redesigns of Pig to support ReadOnce data should also include consideration
for these kinds of cumulative sum type functions as well.

> PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values
> ------------------------------------------------------------------------------------------------
>                 Key: PIG-807
>                 URL: https://issues.apache.org/jira/browse/PIG-807
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.1
>            Reporter: Pradeep Kamath
>             Fix For: 0.3.0
> Currently all bags resulting from a group or cogroup are materialized as bags containing
all of the contents. The issue with this is that if a particular key has many corresponding
values, all these values get stuffed in a bag which may run out of memory and hence spill
causing slow down in performance and sometime memory exceptions. In many cases, the udfs which
use these bags coming out a group and cogroup only need to iterate over the bag in a unidirectional
read-once manner. This can be implemented by having the bag implement its iterator by simply
iterating over the underlying hadoop iterator provided in the reduce. This kind of a bag is
also needed in http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for
this issue too. The other part of this issue is to have some way for the udfs to communicate
to Pig that any input bags that they need are "read once" bags . This can be achieved by having
an Interface - say "UsesReadOnceBags " which is serves as a tag to indicate the intent to
Pig. Pig can then rewire its execution plan to use ReadOnceBags is feasible.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message