pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ying He (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-979) Acummulator Interface for UDFs
Date Mon, 09 Nov 2009 23:20:32 GMT

    [ https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775184#action_12775184

Ying He commented on PIG-979:

Alan, thanks for the feedback.

1. A test case is already created to test mix of accumulator UDF with regular UDF, it is in

2. The optimizer can't be applied when inner is set to POPackage, because if an inner is set,
POPackage checks the bag for that input is NULL, if it is, POPackage returns NULL. This can
only be done when all the tuples are retrieved and put into a bag.

3 & 4, will fix that

5. needs performance testing.

6. The reducer get results from POPackage and pass it to root, which is POForEach, to process.
From POForEach perspective, it gets a tuple with bags in it from POPackage. Then POForEach
retrieves tuples off iterator and pass to UDFs in multiple cycles. Because only POPackage
knows how to read tuples out of iterator and put in proper bags, AccumulativeTupleBuffer and
AccumulativeBag are created to communicate between POPackage and POForEach. Every time POForEach
calls getNextBatch() on AccumulativeTupleBuffer, it in effects calls inner class of POPackage
to retrieve tuples out of iterator.

POPackage can not be the one to block the reading of tuples, because it is only called once
from reducer. I also thought of changing reducer to call POPackage multiple times to process
each batch of data, then it becomes tricky to maintain correct states of operators, and all
operators in reducer plan would have to support partial data, which is not necessary. 

> Acummulator Interface for UDFs
> ------------------------------
>                 Key: PIG-979
>                 URL: https://issues.apache.org/jira/browse/PIG-979
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Ying He
>         Attachments: PIG-979.patch
> Add an accumulator interface for UDFs that would allow them to take a set number of records
at a time instead of the entire bag.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message