apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pramod Immaneni <pra...@datatorrent.com>
Subject Re: Fault-tolerant cache backed by a store
Date Tue, 20 Oct 2015 15:31:55 GMT
This is a much needed component Chandni.

The API for the cache will be important as users will be able to plugin
different implementations in future like those based off of popular
distributed in-memory caches. Ehcache is a popular cache mechanism and API
that comes to bind. It comes bundled with a non-distributed implementation
but there are commercial distributed implementations of it as well like
BigMemory.

Given our needs for fault tolerance we may not be able to adopt the ehcache
API as is but an extension of it might work. We would still provide a
default implementation but going off of a well recognized API will
facilitate development of other implementations in future based off of
popular implementations already available. We will need to investigate if
we can use the API as is or with relatively straightforward extensions
which will be a positive for using it. But if the API turns out to be
significantly deviating from what we need then that would be a negative.

Also it would be great if we could support an iterator to scan all the
keys, lazy loading as needed, since this need comes up from time to time in
different scenarios such as change data capture calculations.

Thanks.

On Mon, Oct 19, 2015 at 9:10 PM, Chandni Singh <chandni@datatorrent.com>
wrote:

> Hi All,
>
> While working on making the Join operator fault-tolerant, we realized the
> need of a fault-tolerant Cache in Malhar library.
>
> This cache is useful for any operator which is state-full and stores
> key/values for a very long period (more than an hour).
>
> The problem with just having a non-transient HashMap for the cache is that
> over a period of time this state will become so large that checkpointing it
> will be very costly and will cause bigger issues.
>
> In order to address this we need to checkpoint the state iteratively, i.e.,
> save the difference in state at every application window.
>
> This brings forward the following broad requirements for the cache:
> 1. The cache needs to have a max size and is backed by a filesystem.
>
> 2. When this threshold is reached, then adding more data to it should evict
> older entries from memory.
>
> 3. To minimize cache misses, a block of data is loaded in memory.
>
> 4. A block or bucket to which a key belongs is provided by the user
> (operator in this case) as the information about closeness in keys (that
> can potentially reduce future misses) is not known to the cache but to the
> user.
>
> 5. lazy load the keys in case of operator failure
>
> 6. To offset the cost of loading a block of keys when there is a miss,
> loading can be done asynchronously with a callback that indicates when the
> key is available. This allows the operator to process other keys which are
> in memory.
>
> 7. data that is spilled over needs to be purged when it is not needed
> anymore.
>
>
> In past we solved this problem with BucketManager which is not in open
> source now and also there were some limitations with the bucket api - the
> biggest one is that it doesn't allow to save multiple values for a key.
>
> My plan is to create a similar solution as BucketManager in Malhar with
> improved api.
> Also save the data on hdfs in TFile which provides better performance when
> saving key/values.
>
> Thanks,
> Chandni
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message