accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Fuchs (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-1802) use case for future configurability of major compactions
Date Tue, 22 Oct 2013 20:08:41 GMT


Adam Fuchs commented on ACCUMULO-1802:

I have seen several use cases lately that lead me to agree that we should consider other compaction
strategies. Some of the factors you might want to optimize by a compaction strategy are:
1. Number of blocks read concurrently for a single query
2. Number of times a key/value pair is written to disk
3. Total number of files stored in HDFS
4. Efficiency of deleting data

Some of the additional use cases I've seen that would lead to different optimal compaction
algorithms are:
1. Time-series data and log data that is stored in roughly temporal order. In these cases,
once a record is written its "neighborhood" (things that sort close by) is not updated. We
can't help factor 1 by compacting frequently, since the ranges of files generated by minor
compaction are mostly distinct.
2. Use of one locality group at a time. This could be done to add features to existing rows
as the result of a ML process or something like it. With our current strategy, we are compacting
files together that have completely distinct locality groups. This doesn't help with factors
1 and 4, and hurts factor 2.
3. Inverted indexing and graph storage with an expiration date or age-off. I think this is
part of the use case Eric refers to. In this case, data is written in essentially random order,
but is deleted in temporal order. We could get tricky and optimize factor 4 at some cost to
factors 1, 2, and 3.
4. Document-partitioned indexing with really big tablets. In this case, we end up relying
more on the log-structured merge tree to sort data than the bucket sorting that comes with
organic tablet splits. Non-uniform updates across the tablet space could be optimized by having
multiple files output by the big major compactions, such that the files' ranges are non-overlapping.
Basically, when we do a major compaction to include lots of small files in a narrower range
than the whole tablet we don't want to have to rewrite the data from the entire tablet. This
potential optimization is augmented by frequent updates, deletions, and aggregation in a sub-range
of a tablet.

> use case for future configurability of major compactions
> --------------------------------------------------------
>                 Key: ACCUMULO-1802
>                 URL:
>             Project: Accumulo
>          Issue Type: Sub-task
>          Components: tserver
>            Reporter: Eric Newton
>             Fix For: 1.6.0
> The default compaction strategy has a tendency to put the oldest data in the largest
files.  This leads to a lot of work when it is time to age off data.
> One could imaging a compaction strategy that would split data into separate files based
on the timestamp.  Additionally, if the min/max timestamps for a file were known, old data
could be aged off by deleting whole files.
> Augment the configurable compaction strategy to support multiple output files, and saving/using
extra metadata in each file.

This message was sent by Atlassian JIRA

View raw message