accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (Commented) (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-452) Generalize locality groups
Date Thu, 08 Mar 2012 19:13:58 GMT


Keith Turner commented on ACCUMULO-452:

I am thinking that keeping a min/max time stamp per file may satisfy some use cases but not
all.  It would certainly be helpful.  The compaction algorithm may need to be modified as
Adam suggested to make it more effective.  The way major compaction currently works in Accumulo
older data will eventually end up in the largest file.  If your goal is to avoid this file
under certain circumstances, then the user has no explicit control over that.  Also if you
want to age off older data, you will probably still need to read this entire file to do that.

If they want to scan the last 6 months of data for example and the largest file overlaps this
time range but only 10% of the data in the file matches the range, then a lot of data needs
to be filtered.  Do HBase do anything special to deal with case.   

Why limit partitioning to only locality groups? 
 * Increases model complexity.  I think this is true.  I think the complexity of the locality
group model is not increased.  If you understand partitioning on column families, you will
easily understand the concept of partitioning on any part of the key.  It certainly does increase
the complexity of the big table model as a whole though.  It would certainly give users more
rope to hang themselves.  Personally I am not opposed to this.
 * Increases code complexity.  I do not think this is true.  This would actually simplify
the code and make this functionality much easier to test in isolation.  I have found this
with iterators, they dramatically decreased the complexity of the scan code. When iterators
were first introduced, the scan loop was starting to get fairly complex.  This seems a lot
cleaner than customizing the current code to meet needs. OF course, end users may not care
about the complexity of the accumulo source code.  They just want it to solve their problems.
 * There are no compeling use cases.  These must exist.  I think the original time based locality
group is one, is their a better simpler way to achieve this?  That would remove this use case.
 The HBase design is simpler in terms of the model, but the code sounds more complex.  Also
this model does not give the user explicit control w/o allowing them to configure the compaction
process in some complex way.  

> Generalize locality groups
> --------------------------
>                 Key: ACCUMULO-452
>                 URL:
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>             Fix For: 1.5.0
>         Attachments: PartitionerDesign.txt
> Locality groups are a neat feature, but there is no reason to limit partitioning to column
families.  Data could be partitioned based on any criteria.  For example if a user is interested
in querying recent data and ageing off old data partitioning locality groups based in timestamp
would be useful.  This could be accomplished by letting users specify a partitioner plugin
that is used at compaction and scan time.  Scans would need an ability to pass options to
the partitioner.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message