apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chandni Singh <chan...@datatorrent.com>
Subject Re: Fault-tolerant cache backed by a store
Date Sat, 14 Nov 2015 03:32:35 GMT
Let me know if anyone want to collaborate with me on this.

Thanks,
Chandni

On Tue, Nov 10, 2015 at 6:18 PM, Chandni Singh <chandni@datatorrent.com>
wrote:

> Have added some more details about a Bucket in the document. Have a look.
>
> On Sun, Nov 8, 2015 at 10:37 PM, Chandni Singh <chandni@datatorrent.com>
> wrote:
>
>> Forgot to attach the link.
>>
>> https://docs.google.com/document/d/1gRWN9ufKSZSZD0N-pthlhpC9TZ8KwJ6hJlAX6nxl5f8/edit#heading=h.wlc0p58uzygb
>>
>>
>> On Sun, Nov 8, 2015 at 10:36 PM, Chandni Singh <chandni@datatorrent.com>
>> wrote:
>>
>>> Hi,
>>> This contains the overview of large state management.
>>> Some parts need more description which I am working on but please free
>>> to go through it and any feedback is appreciated.
>>>
>>> Thanks,
>>> Chandni
>>>
>>>
>>> On Tue, Oct 20, 2015 at 8:31 AM, Pramod Immaneni <pramod@datatorrent.com
>>> > wrote:
>>>
>>>> This is a much needed component Chandni.
>>>>
>>>> The API for the cache will be important as users will be able to plugin
>>>> different implementations in future like those based off of popular
>>>> distributed in-memory caches. Ehcache is a popular cache mechanism and
>>>> API
>>>> that comes to bind. It comes bundled with a non-distributed
>>>> implementation
>>>> but there are commercial distributed implementations of it as well like
>>>> BigMemory.
>>>>
>>>> Given our needs for fault tolerance we may not be able to adopt the
>>>> ehcache
>>>> API as is but an extension of it might work. We would still provide a
>>>> default implementation but going off of a well recognized API will
>>>> facilitate development of other implementations in future based off of
>>>> popular implementations already available. We will need to investigate
>>>> if
>>>> we can use the API as is or with relatively straightforward extensions
>>>> which will be a positive for using it. But if the API turns out to be
>>>> significantly deviating from what we need then that would be a negative.
>>>>
>>>> Also it would be great if we could support an iterator to scan all the
>>>> keys, lazy loading as needed, since this need comes up from time to
>>>> time in
>>>> different scenarios such as change data capture calculations.
>>>>
>>>> Thanks.
>>>>
>>>> On Mon, Oct 19, 2015 at 9:10 PM, Chandni Singh <chandni@datatorrent.com
>>>> >
>>>> wrote:
>>>>
>>>> > Hi All,
>>>> >
>>>> > While working on making the Join operator fault-tolerant, we realized
>>>> the
>>>> > need of a fault-tolerant Cache in Malhar library.
>>>> >
>>>> > This cache is useful for any operator which is state-full and stores
>>>> > key/values for a very long period (more than an hour).
>>>> >
>>>> > The problem with just having a non-transient HashMap for the cache is
>>>> that
>>>> > over a period of time this state will become so large that
>>>> checkpointing it
>>>> > will be very costly and will cause bigger issues.
>>>> >
>>>> > In order to address this we need to checkpoint the state iteratively,
>>>> i.e.,
>>>> > save the difference in state at every application window.
>>>> >
>>>> > This brings forward the following broad requirements for the cache:
>>>> > 1. The cache needs to have a max size and is backed by a filesystem.
>>>> >
>>>> > 2. When this threshold is reached, then adding more data to it should
>>>> evict
>>>> > older entries from memory.
>>>> >
>>>> > 3. To minimize cache misses, a block of data is loaded in memory.
>>>> >
>>>> > 4. A block or bucket to which a key belongs is provided by the user
>>>> > (operator in this case) as the information about closeness in keys
>>>> (that
>>>> > can potentially reduce future misses) is not known to the cache but
>>>> to the
>>>> > user.
>>>> >
>>>> > 5. lazy load the keys in case of operator failure
>>>> >
>>>> > 6. To offset the cost of loading a block of keys when there is a miss,
>>>> > loading can be done asynchronously with a callback that indicates
>>>> when the
>>>> > key is available. This allows the operator to process other keys
>>>> which are
>>>> > in memory.
>>>> >
>>>> > 7. data that is spilled over needs to be purged when it is not needed
>>>> > anymore.
>>>> >
>>>> >
>>>> > In past we solved this problem with BucketManager which is not in open
>>>> > source now and also there were some limitations with the bucket api
-
>>>> the
>>>> > biggest one is that it doesn't allow to save multiple values for a
>>>> key.
>>>> >
>>>> > My plan is to create a similar solution as BucketManager in Malhar
>>>> with
>>>> > improved api.
>>>> > Also save the data on hdfs in TFile which provides better performance
>>>> when
>>>> > saving key/values.
>>>> >
>>>> > Thanks,
>>>> > Chandni
>>>> >
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message