apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Yan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (APEXMALHAR-2244) Optimize WindowedStorage and Spillable data structures for time series
Date Tue, 20 Sep 2016 18:57:20 GMT

     [ https://issues.apache.org/jira/browse/APEXMALHAR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

David Yan updated APEXMALHAR-2244:
----------------------------------
    Description: 
The spillable data structures currently does not make any assumption about the key that is
used in Managed State, and as a result, it uses ManagedStateImpl to interface with Managed
State and uses time buckets that are based on the apex window id. But for WindowedStorage
used by WindowedOperator, the key to the storage is a window, which is event time based. Using
the default ManagedStateImpl would be very inefficient for event time based keys, since it
would write data that would belong to the same window to different time buckets.

On a high level, the below summarizes roughly what needs to be done:

1. a way to tell the spillable data structures to use the ManagedTimeUnifiedStateImpl
2. a way to tell the spillable data structures how to extract the timestamp from the key.
Note that in the case of WindowedOperator, the timestamp should be the end timestamp of the
window (beginTimeMillis + durationMillis), not the begin timestamp.
3. a way to tell the spillable data structures how to assign the time bucket given that timestamp
4. only purge a time bucket when all keys that belong to that time bucket are removed and
the apex window id of the first window in which the keys are all removed has been committed



  was:
The spillable data structures currently does not make any assumption about the key that is
used in Managed State, and as a result, it uses ManagedStateImpl to interface with Managed
State. But for WindowedStorage used by WindowedOperator, the key to the storage is a window,
which is event time based. Using the default ManagedStateImpl would be wrong for event time
based keys, since ManagedStateImpl appears to purge data based on the apex window id (process
time based).

In a high level, the below summarizes roughly what needs to be done:

1. a way to tell the spillable data structures to use the ManagedTimeUnifiedStateImpl
2. a way to tell the spillable data structures how to extract the timestamp from the key.
Note that in the case of WindowedOperator, the timestamp should be the end timestamp of the
window (beginTimeMillis + durationMillis), not the begin timestamp.
3. a way to tell the spillable data structures how to assign the time bucket given that timestamp
4. only purge a time bucket when all keys that belong to that time bucket are removed and
the apex window id of the first window in which the keys are all removed has been committed




> Optimize WindowedStorage and Spillable data structures for time series
> ----------------------------------------------------------------------
>
>                 Key: APEXMALHAR-2244
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2244
>             Project: Apache Apex Malhar
>          Issue Type: Sub-task
>            Reporter: David Yan
>            Assignee: Siyuan Hua
>
> The spillable data structures currently does not make any assumption about the key that
is used in Managed State, and as a result, it uses ManagedStateImpl to interface with Managed
State and uses time buckets that are based on the apex window id. But for WindowedStorage
used by WindowedOperator, the key to the storage is a window, which is event time based. Using
the default ManagedStateImpl would be very inefficient for event time based keys, since it
would write data that would belong to the same window to different time buckets.
> On a high level, the below summarizes roughly what needs to be done:
> 1. a way to tell the spillable data structures to use the ManagedTimeUnifiedStateImpl
> 2. a way to tell the spillable data structures how to extract the timestamp from the
key. Note that in the case of WindowedOperator, the timestamp should be the end timestamp
of the window (beginTimeMillis + durationMillis), not the begin timestamp.
> 3. a way to tell the spillable data structures how to assign the time bucket given that
timestamp
> 4. only purge a time bucket when all keys that belong to that time bucket are removed
and the apex window id of the first window in which the keys are all removed has been committed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message