datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eyal Allweil (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator
Date Tue, 08 Mar 2016 18:18:40 GMT

    [ https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185409#comment-15185409
] 

Eyal Allweil commented on DATAFU-116:
-------------------------------------

As far as I can tell, when the accumulator is used, Pig passes _pig.accumulative.batchsize_
tuples from each bag until all the tuples are exhausted. I think an implementation that iterates
over the bags and only keeps some of the tuples in between batches is possible - hopefully
very few, but the worst case is all of them, which is no worse than the current implementation.

I'm assuming Pig passes batches in this way based on the code in [POPackage|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java]
and from looking through all the documentation I could find on accumulators. If I'm wrong
it does mean that an accumulator implementation isn't worthwhile.

> Make SetIntersect and SetDifference implement Accumulator
> ---------------------------------------------------------
>
>                 Key: DATAFU-116
>                 URL: https://issues.apache.org/jira/browse/DATAFU-116
>             Project: DataFu
>          Issue Type: Improvement
>    Affects Versions: 1.3.0
>            Reporter: Eyal Allweil
>
> SetIntersect and SetDifference accept only sorted bags, and the output is always smaller
than the inputs. Therefore an accumulator implementation should be possible and it will improve
memory usage (somewhat) and allow Pig to optimize loops with these operations better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message