datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Hayes (JIRA)" <>
Subject [jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator
Date Wed, 09 Mar 2016 17:56:40 GMT


Matthew Hayes commented on DATAFU-116:

bq. but the worst case is all of them, which is no worse than the current implementation.

I think it would be worse than the current implementation actually.  Pig does not keep the
entire input bags in memory.  I'm not an expert on Pig internals, but I believe as you iterate
through the members of a DataBag it loads the data in chunks from disk.  Without doing this
it wouldn't be possible to operate on bags larger than what can fit in memory.  

> Make SetIntersect and SetDifference implement Accumulator
> ---------------------------------------------------------
>                 Key: DATAFU-116
>                 URL:
>             Project: DataFu
>          Issue Type: Improvement
>    Affects Versions: 1.3.0
>            Reporter: Eyal Allweil
> SetIntersect and SetDifference accept only sorted bags, and the output is always smaller
than the inputs. Therefore an accumulator implementation should be possible and it will improve
memory usage (somewhat) and allow Pig to optimize loops with these operations better.

This message was sent by Atlassian JIRA

View raw message