apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chandni Singh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (APEXMALHAR-2223) Managed state should parallelize WAL writes
Date Tue, 06 Sep 2016 21:05:21 GMT

    [ https://issues.apache.org/jira/browse/APEXMALHAR-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468563#comment-15468563

Chandni Singh commented on APEXMALHAR-2223:

A possible approach to address this:
- Have a property in ManagedState called {code}writeBufferThreshold{code}. When a bucket size
crosses this threshold, then the bucket is eligible for writing.
- The writing to WAL of  eligible buckets is done at the end of every application window()
in {code}endWindow(){code} callback.

With this approach there are fewer changes where data is still divided into windows when written
to the WAL.

> Managed state should parallelize WAL writes
> -------------------------------------------
>                 Key: APEXMALHAR-2223
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2223
>             Project: Apache Apex Malhar
>          Issue Type: Improvement
>    Affects Versions: 3.4.0
>            Reporter: Thomas Weise
>            Assignee: Chandni Singh
> Currently, data is accumulated in memory and written to the WAL on checkpoint only. This
causes a write spike on checkpoint and does not utilize the HDFS write pipeline. The other
extreme is writing to the WAL as soon as data arrives and then only flush in beforeCheckpoint.
The downside of this is that when the same key is written many times, all duplicates will
be in the WAL. Need to find a balances approach, that the user can potentially fine tune.

This message was sent by Atlassian JIRA

View raw message