accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-4501) Add support to RFile to track and store the histogram
Date Tue, 18 Oct 2016 12:44:58 GMT


Keith Turner commented on ACCUMULO-4501:

[~elserj] as promised on IRC, here is a write up.  This covers what [~rweeks] and I discussed
at the Accumulo Summit Hackathon.

Users could configure, per table, an implementation of CompactionSummarizer.  

  interface Counters {
    void increment(String counter, long amount);
    void increment(ByteSequence counter, long amount);

   // I thought of use cases where I would want to append a prefix to the counter.  We could

   // offer this as primitive so that each user does not have to figure out how to do this
   // Simple example of uses cases would be "fam:" and "vis:" prefixes for counting column

   // families and visibility.
    void increment(String prefix, ByteSequence counter, long amount);

  interface CompactionSummarizer {
     void summarize(Key k, Value v, Counters counters);

When a CompactionSummarizer is configured, Accumulo could do the following at compaction time.

 * Compute a histogram during compaction by calling CompactionSummarizer for each Key Value
added to RFile
 * Limit the histogram to a max size
 * Store histogram in RFile
 * Store name of summarizer in RFile
 * Store if histogram exceeded max size in RFile
We could modify rfile-info to print this information when its present in an RFile.  We could
also offer a use level API to fetch this information. The API could offer the following.

 * Require user to specify the name of the CompactionSummarizer they want histograms for.
 This is so that RFiles containing histograms generated by a different CompactionSummarizer
can be ignored.
 * Allow user to compute histogram for a row range.
 * Along with returned histogram, indicate if histograms were missing from RFiles or exceeded
max size.

We discussed an implementation similar to the BatchScanner in that it would send request out
to TabletServers to fetch info in parallel.  Histograms could be combined at the tablet, tablet
server, and client.  Thinking about this a little more after the summit I realized this implementation
may double count files that span multiple tablets.  Another possible implementation would
be to gather the unique set of files in the range, and then farm out to the tablet servers
aggregating the histograms.  This approach makes it hard to possibly cache the serialized
histograms.  We also discussed if the in memory map should keep a histogram, but came to no
conclusion on this.

> Add support to RFile to track and store the histogram
> -----------------------------------------------------
>                 Key: ACCUMULO-4501
>                 URL:
>             Project: Accumulo
>          Issue Type: Sub-task
>          Components: client, tserver
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>          Time Spent: 1h
>  Remaining Estimate: 0h
> Modify RFile such that it can build the histogram and store it in an RFile.
> Reading the RFile would deserialize the histogram back into memory.

This message was sent by Atlassian JIRA

View raw message