Return-Path: X-Original-To: apmail-datafu-dev-archive@minotaur.apache.org Delivered-To: apmail-datafu-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D903518A35 for ; Tue, 8 Mar 2016 18:18:42 +0000 (UTC) Received: (qmail 40063 invoked by uid 500); 8 Mar 2016 18:18:42 -0000 Delivered-To: apmail-datafu-dev-archive@datafu.apache.org Received: (qmail 40027 invoked by uid 500); 8 Mar 2016 18:18:42 -0000 Mailing-List: contact dev-help@datafu.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@datafu.incubator.apache.org Delivered-To: mailing list dev@datafu.incubator.apache.org Received: (qmail 40016 invoked by uid 99); 8 Mar 2016 18:18:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2016 18:18:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 656C418060A for ; Tue, 8 Mar 2016 18:18:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.349 X-Spam-Level: X-Spam-Status: No, score=-4.349 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.329] autolearn=disabled Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id svU71DWSI36o for ; Tue, 8 Mar 2016 18:18:41 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with SMTP id 5093C5F20E for ; Tue, 8 Mar 2016 18:18:41 +0000 (UTC) Received: (qmail 39938 invoked by uid 99); 8 Mar 2016 18:18:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2016 18:18:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id AC5252C044E for ; Tue, 8 Mar 2016 18:18:40 +0000 (UTC) Date: Tue, 8 Mar 2016 18:18:40 +0000 (UTC) From: "Eyal Allweil (JIRA)" To: dev@datafu.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185409#comment-15185409 ] Eyal Allweil commented on DATAFU-116: ------------------------------------- As far as I can tell, when the accumulator is used, Pig passes _pig.accumulative.batchsize_ tuples from each bag until all the tuples are exhausted. I think an implementation that iterates over the bags and only keeps some of the tuples in between batches is possible - hopefully very few, but the worst case is all of them, which is no worse than the current implementation. I'm assuming Pig passes batches in this way based on the code in [POPackage|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java] and from looking through all the documentation I could find on accumulators. If I'm wrong it does mean that an accumulator implementation isn't worthwhile. > Make SetIntersect and SetDifference implement Accumulator > --------------------------------------------------------- > > Key: DATAFU-116 > URL: https://issues.apache.org/jira/browse/DATAFU-116 > Project: DataFu > Issue Type: Improvement > Affects Versions: 1.3.0 > Reporter: Eyal Allweil > > SetIntersect and SetDifference accept only sorted bags, and the output is always smaller than the inputs. Therefore an accumulator implementation should be possible and it will improve memory usage (somewhat) and allow Pig to optimize loops with these operations better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)