accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <>
Subject Re: Time based locality groups
Date Thu, 08 Mar 2012 00:39:46 GMT
Something like this:

    partition, meta = partitioner.choose(key, value, meta)

The partition can be a string, which is used to look up the partitions'
configuration.  The meta information can be used by queries to avoid
including files from the partition in queries.  The metadata would be saved
at the close of the file.

During a query, files could be filtered based on some arbitrary query data:

    files = partitioner.selectFiles(files, query)

I like it! It might also be nice to indicate some sort of "estimated"
percent of keys processed, and the type of compaction occurring (flush,
partial, everything):

    partition, meta = partitioner.choose(key, value, meta, percent,

Is there any other tablet-level information we might want to provide to a
partitioner?  Perhaps the source partition of the key/value?


On Wed, Mar 7, 2012 at 6:54 PM, Keith Turner <> wrote:

> Replying to myself :)
> The more I think about this, it seems that locality groups could
> handled by plugins that can parition the data and select locality
> groups in any way it likes. Want locality groups based on row suffix,
> go ahead and write the plugin.
> The plugin would be used for compaction time partitioning and scan
> time locality group selection.   User could pass options to the
> locality group plugin at scan time just like options are passed to
> iterators.    Maybe this is an extension or further generalization of
> the existing iterator framework, I have not thought through that far
> enough.
> Keith
> On Wed, Mar 7, 2012 at 6:22 PM, Keith Turner <> wrote:
> > We regularly have questions from users about querying new data and
> > aging off old data.  I was thinking about how we could better support
> > this in need in 1.5.  One thing that occurred to me is having locality
> > groups that were based on timestamp instead of column family.  For
> > example a locality group for each month.   Alternatively we could have
> > group for < day old, < week old, < month old, < year old.  Would need
> > a way for users to define these.
> >
> > This would make scanning a table for recent data much faster.  Also
> > dropping old data could be made much faster by just dropping entire
> > locality groups at compaction time.
> >
> > One thing that irks me about this is : Should column family and time
> > based locality groups be mutually exclusive (i.e. an RFile has one or
> > the other, not both)?  If they are not then order of which is
> > partitioned first is important for query performance and would
> > probably need to be user configurable.
> >
> > Thoughts?
> >
> > Keith

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message