apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chandni Singh <chan...@datatorrent.com>
Subject Re: Fault-tolerant cache backed by a store
Date Sat, 28 Nov 2015 19:53:45 GMT
Let's say a bucket is further divided into time-buckets and file is a
Purging is easier here by just deleting a time-bucket file.

Cons of this approach:
1. There can be multiple small files.
2. A tuple with key K1 and time t1 may belong to time bucket TB1  but later
a tuple with key K1 and time t2 may belong to another bucket TB2.
This makes query for K1 difficult. We need to scan the all the time
buckets. Since keys are sorted, this can be made more efficient.
3. Multiple entries for a keys (NOT in the same time bucket).

I think this is what we discussed. While explaining this approach I thought
that there can be a large number of small files which cannot be ignored.


On Sat, Nov 28, 2015 at 10:29 AM, Thomas Weise <thomas@datatorrent.com>

> Chandni,
> The approach that we discussed was to organize buckets (not keys) by time,
> which allows you to discard the entire file on purge.
> In addition, when you use a block indexed file format, there is no need to
> load the entire file to retrieve a single key.
> Thomas
> On Sat, Nov 28, 2015 at 9:42 AM, Chandni Singh <chandni@datatorrent.com>
> wrote:
> > Another approach is to treat an entry in bucket data file as:
> > <time><key><value>
> >
> > Time can be extract from the tuple (or windowId can be used).
> > With this approach purging can be simple. For each Bucket Data File we
> > check the last entry (since data is sorted in the bucket data file) and
> > delete if that is expired.
> >
> > Writing to bucket data file can be simple. We will not update value of a
> > key, but always add a new entry for the key when its value changes.
> > Cons- multiple entries for a key.
> > If the tuples are not out of order then we may never have to re-write a
> > bucket data file that is complete.
> >
> > Reading is a problem here. The whole bucket needs to be de-serialized to
> > find a key since data is no longer sorted on disk. If the query for a key
> > specifies a time range then that read can be optimized.
> >
> >
> > With Tim's approach, purging can be triggered asynchronously at regular
> > intervals that may even delete data file which hasn't been updated for
> > sometime and the latest entry in that file is expired.
> > Even though the writes may not be that complicated with this approach but
> > updating values when the length of value changes (example in join
> operation
> > a value is an appending list)  may result in many small stray files.
> >
> > Chandni
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message