Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 01A7B200B84 for ; Tue, 20 Sep 2016 20:57:26 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id F41BE160AC5; Tue, 20 Sep 2016 18:57:25 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4498B160AC0 for ; Tue, 20 Sep 2016 20:57:25 +0200 (CEST) Received: (qmail 69673 invoked by uid 500); 20 Sep 2016 18:57:24 -0000 Mailing-List: contact dev-help@apex.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.apache.org Delivered-To: mailing list dev@apex.apache.org Received: (qmail 69662 invoked by uid 99); 20 Sep 2016 18:57:24 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Sep 2016 18:57:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 0F5D7C021E for ; Tue, 20 Sep 2016 18:57:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.344 X-Spam-Level: X-Spam-Status: No, score=-4.344 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.124] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id GbYT3ZzIcf5B for ; Tue, 20 Sep 2016 18:57:23 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id 424C260CE1 for ; Tue, 20 Sep 2016 18:57:22 +0000 (UTC) Received: (qmail 66512 invoked by uid 99); 20 Sep 2016 18:57:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Sep 2016 18:57:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 8E05D2C2A6A for ; Tue, 20 Sep 2016 18:57:20 +0000 (UTC) Date: Tue, 20 Sep 2016 18:57:20 +0000 (UTC) From: "David Yan (JIRA)" To: dev@apex.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (APEXMALHAR-2244) Optimize WindowedStorage and Spillable data structures for time series MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 20 Sep 2016 18:57:26 -0000 [ https://issues.apache.org/jira/browse/APEXMALHAR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Yan updated APEXMALHAR-2244: ---------------------------------- Description: The spillable data structures currently does not make any assumption about the key that is used in Managed State, and as a result, it uses ManagedStateImpl to interface with Managed State and uses time buckets that are based on the apex window id. But for WindowedStorage used by WindowedOperator, the key to the storage is a window, which is event time based. Using the default ManagedStateImpl would be very inefficient for event time based keys, since it would write data that would belong to the same window to different time buckets. On a high level, the below summarizes roughly what needs to be done: 1. a way to tell the spillable data structures to use the ManagedTimeUnifiedStateImpl 2. a way to tell the spillable data structures how to extract the timestamp from the key. Note that in the case of WindowedOperator, the timestamp should be the end timestamp of the window (beginTimeMillis + durationMillis), not the begin timestamp. 3. a way to tell the spillable data structures how to assign the time bucket given that timestamp 4. only purge a time bucket when all keys that belong to that time bucket are removed and the apex window id of the first window in which the keys are all removed has been committed was: The spillable data structures currently does not make any assumption about the key that is used in Managed State, and as a result, it uses ManagedStateImpl to interface with Managed State. But for WindowedStorage used by WindowedOperator, the key to the storage is a window, which is event time based. Using the default ManagedStateImpl would be wrong for event time based keys, since ManagedStateImpl appears to purge data based on the apex window id (process time based). In a high level, the below summarizes roughly what needs to be done: 1. a way to tell the spillable data structures to use the ManagedTimeUnifiedStateImpl 2. a way to tell the spillable data structures how to extract the timestamp from the key. Note that in the case of WindowedOperator, the timestamp should be the end timestamp of the window (beginTimeMillis + durationMillis), not the begin timestamp. 3. a way to tell the spillable data structures how to assign the time bucket given that timestamp 4. only purge a time bucket when all keys that belong to that time bucket are removed and the apex window id of the first window in which the keys are all removed has been committed > Optimize WindowedStorage and Spillable data structures for time series > ---------------------------------------------------------------------- > > Key: APEXMALHAR-2244 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2244 > Project: Apache Apex Malhar > Issue Type: Sub-task > Reporter: David Yan > Assignee: Siyuan Hua > > The spillable data structures currently does not make any assumption about the key that is used in Managed State, and as a result, it uses ManagedStateImpl to interface with Managed State and uses time buckets that are based on the apex window id. But for WindowedStorage used by WindowedOperator, the key to the storage is a window, which is event time based. Using the default ManagedStateImpl would be very inefficient for event time based keys, since it would write data that would belong to the same window to different time buckets. > On a high level, the below summarizes roughly what needs to be done: > 1. a way to tell the spillable data structures to use the ManagedTimeUnifiedStateImpl > 2. a way to tell the spillable data structures how to extract the timestamp from the key. Note that in the case of WindowedOperator, the timestamp should be the end timestamp of the window (beginTimeMillis + durationMillis), not the begin timestamp. > 3. a way to tell the spillable data structures how to assign the time bucket given that timestamp > 4. only purge a time bucket when all keys that belong to that time bucket are removed and the apex window id of the first window in which the keys are all removed has been committed -- This message was sent by Atlassian JIRA (v6.3.4#6332)