apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timothy Farkas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (APEXMALHAR-2026) Spooled Datastructures
Date Tue, 05 Apr 2016 00:11:25 GMT

    [ https://issues.apache.org/jira/browse/APEXMALHAR-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225350#comment-15225350

Timothy Farkas commented on APEXMALHAR-2026:

The use case of a map of lists could be implemented as followed.

The Key Structure would be the following:

low byte index -> high byte index

| Key Prefix | Serialized Key Object | Array Batch Index | 

 - Key Prefix: Here the key prefix is an identifier for the Map of lists.
 - Array Batch Index: The array batch index is an index for the next batch of elements in
the list
 - Serialized Key Object: The serialized key object.

Each Map of Lists will have a handle stored in memory which will hold the key prefix.

--Doing a get:

When a get is done if there is no value for a key then a null is returned. Otherwise a spooled
list object is returned which has a KeyPrefix of | Key Prefix | Serialized Key Object |

The spooled list implementation is relatively straight foward.

--Doing a put:

In order to do a put you must first do a get on the list that you want to add to and then
add to the list as you would any other list.

If the list does not exist you can simple put a list into the map and its contents will be


These maps and lists would implement the existing Java Map and List implementations. One deviation
from the standard implementation is that when a list is put into the map it is copied.

I would like to take up this implementation.

> Spooled Datastructures
> ----------------------
>                 Key: APEXMALHAR-2026
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2026
>             Project: Apache Apex Malhar
>          Issue Type: New Feature
>            Reporter: Timothy Farkas
>            Assignee: Timothy Farkas
>              Labels: roadmap
> Add libraryies for spooling datastructures to a key value store. There are several customer
use cases which require spooled data structures.
> 1 - Some operators like AbstractFileInputOperator have ever growing state. This is an
issue because eventually the state of the operator will grow larger than the memory allocated
to the operator, which will cause the operator to perpetually fail. However if the operator's
datastructures are spooled then the operator will never run out of memory.
> 2 - Some users have requested for the ability to maintain a map as well as a list of
keys over which to iterate. Most key value stores don't provide this functionality. However,
with spooled datastructures this functionality can be provided by maintaining a spooled map
and an iterable set of keys.
> 3 - Some users have requested building graph databases within APEX. This would require
implementing a spooled graph data structure.
> 4 - Another use case for spooled data structures is database operators. Database operators
need to write data to a data base, but sometimes the database is down. In this case most of
the database operators repeatedly fail until the database comes back up. In order to avoid
constant failures the database operator need to writes data to a queue when the data base
is down, then when the database is up the operator need to take data from the queue and write
it to the database. In the case of a database failure this queue will grow larger than the
total amount of memory available to the operator, so the queue should be spooled in order
to prevent the operator from failing.
> 5 - Any operator which needs to maintain a large data structure in memory currently needs
to have that data serialized and written out to HDFS with every checkpoint. This is costly
when the data structure is large. If the data structure is spooled, then only the changes
to the data structure are written out to HDFS instead of the entire data structure.
> 6 - Also building an Apex Native database for aggregations requires indices. These indices
need to take the form of spooled data structures.
> 7 - In the future any operator which needs to maintain a data structure larger than the
memory available to it will need to spool the data structure.

This message was sent by Atlassian JIRA

View raw message