Return-Path: X-Original-To: apmail-apex-dev-archive@minotaur.apache.org Delivered-To: apmail-apex-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7330018551 for ; Thu, 7 Apr 2016 00:03:28 +0000 (UTC) Received: (qmail 6450 invoked by uid 500); 7 Apr 2016 00:03:28 -0000 Delivered-To: apmail-apex-dev-archive@apex.apache.org Received: (qmail 6380 invoked by uid 500); 7 Apr 2016 00:03:28 -0000 Mailing-List: contact dev-help@apex.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.incubator.apache.org Delivered-To: mailing list dev@apex.incubator.apache.org Received: (qmail 6365 invoked by uid 99); 7 Apr 2016 00:03:28 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Apr 2016 00:03:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id BDCE6180430 for ; Thu, 7 Apr 2016 00:03:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.021 X-Spam-Level: X-Spam-Status: No, score=-4.021 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id pCTkBh0CAdC7 for ; Thu, 7 Apr 2016 00:03:26 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with SMTP id 5C8175F19B for ; Thu, 7 Apr 2016 00:03:26 +0000 (UTC) Received: (qmail 6332 invoked by uid 99); 7 Apr 2016 00:03:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Apr 2016 00:03:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 6DC812C14F6 for ; Thu, 7 Apr 2016 00:03:25 +0000 (UTC) Date: Thu, 7 Apr 2016 00:03:25 +0000 (UTC) From: "Chandni Singh (JIRA)" To: dev@apex.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (APEXMALHAR-2026) Spill-able Datastructures MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/APEXMALHAR-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated APEXMALHAR-2026: -------------------------------------- Summary: Spill-able Datastructures (was: Spooled Datastructures) > Spill-able Datastructures > ------------------------- > > Key: APEXMALHAR-2026 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2026 > Project: Apache Apex Malhar > Issue Type: New Feature > Reporter: Timothy Farkas > Assignee: Timothy Farkas > Labels: roadmap > > Add libraryies for spooling datastructures to a key value store. There are several customer use cases which require spooled data structures. > 1 - Some operators like AbstractFileInputOperator have ever growing state. This is an issue because eventually the state of the operator will grow larger than the memory allocated to the operator, which will cause the operator to perpetually fail. However if the operator's datastructures are spooled then the operator will never run out of memory. > 2 - Some users have requested for the ability to maintain a map as well as a list of keys over which to iterate. Most key value stores don't provide this functionality. However, with spooled datastructures this functionality can be provided by maintaining a spooled map and an iterable set of keys. > 3 - Some users have requested building graph databases within APEX. This would require implementing a spooled graph data structure. > 4 - Another use case for spooled data structures is database operators. Database operators need to write data to a data base, but sometimes the database is down. In this case most of the database operators repeatedly fail until the database comes back up. In order to avoid constant failures the database operator need to writes data to a queue when the data base is down, then when the database is up the operator need to take data from the queue and write it to the database. In the case of a database failure this queue will grow larger than the total amount of memory available to the operator, so the queue should be spooled in order to prevent the operator from failing. > 5 - Any operator which needs to maintain a large data structure in memory currently needs to have that data serialized and written out to HDFS with every checkpoint. This is costly when the data structure is large. If the data structure is spooled, then only the changes to the data structure are written out to HDFS instead of the entire data structure. > 6 - Also building an Apex Native database for aggregations requires indices. These indices need to take the form of spooled data structures. > 7 - In the future any operator which needs to maintain a data structure larger than the memory available to it will need to spool the data structure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)