accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ACCUMULO-1266) Automatically determine when a full major compaction would benefit scans
Date Thu, 11 Apr 2013 17:49:16 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629154#comment-13629154
] 

Keith Turner edited comment on ACCUMULO-1266 at 4/11/13 5:47 PM:
-----------------------------------------------------------------

Major compactions of a subset of a tablets files and minor compactions should be considered.
  If 10% of a tablets k/v are compacted, dropping 50% of data, then maybe %10 could be deducted
from recentlyDropped and recentlyRead.  
                
      was (Author: kturner):
    Major compactions of a subset of a tablets files and minor compactions should be considered.
  If 10% of a tablets k/v are compacted, dropping 50% of data, then maybe %5 could be deducted
from recentlyDropped and 10% from recentlyRead.  
                  
> Automatically determine when a full major compaction would benefit scans
> ------------------------------------------------------------------------
>
>                 Key: ACCUMULO-1266
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1266
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>
> For the following situation, there is a tipping point where it becomes beneficial to
do a full major compaction.
>  * a tablet is frequently scanned
>  * scan time iterators supress a lot of data
>  * a full major compaction would also supress that data 
> Examples of this are tablets with lots of deletes, versions that are suppressed, data
thats combined, and data thats filtered.   
> If tablet servers kept track of statistics about scans, could this be used to determine
when its beneficial to automatically compact?  In the following simple example, it seems obvious
that a major compaction would be beneficial. In this example scans over the last hour have
had to examine and throw away 20 million uneeded keys.  Alot of scan work could have been
saved by doing a major compaction.
>  * all scans over tabletA within the last hour have read 30 million keys and returned
10 million keys 
>  * TabletA has 3 million keys
>  * a major compaction would reduce tabletA to 1 million keys and result in future scans
returning all keys read.
> One complicating factor is that major compaction may have a different set of iterators
configured.  Therefore its possible that scan may filter a lot of data, and major compactions
may not.   Could possibly keep track of ratio of data dropped by compactions and the ratio
of data dropped by scans.  This could be used when deciding if a major compaction should be
done to improve scan performance.
> What other situation can cause unnecessary major compactions and need to be defended
against?
> In the case where a compaction of just the data in memory would benefit scans, ACCUMULO-519
may solve the problem that this ticket is looking to solve.
> So what should the formula be?  
> {code:java}
>   // k/v : key values
>   // recentlyRead    : total number of k/v read before applying iterators by recent scans
(recentlyRead - recentlyDropped equals # of k/v returned to users)
>   // majcDropRatio   : ratio of k/v dropped by recent major compactions
>   // totalKeyValues  : total # of k/v in tablet
>   // R a user configurable ratio, like the current major compaction ratio that is based
on files
>   if((recentlyRead * majcDropRatio > R * totalKeyValues)){
>      doFullMajorCompaction()
>      resetScanStats()
>   }
> {code}
> The example formula above has an issue, it may initiate a major compaction when scans
are not reading a part of the tablet that drops data.  The formula below tries to remedy this.
> {code:java}
>   // k/v : key values
>   // recentlyDropped : number of k/v dropped by recent scans
>   // recentlyRead    : total number of k/v read before applying iterators by recent scans
(recentlyRead - recentlyDropped equals # of k/v returned to users)
>   // majcDropRatio   : ratio of k/v dropped by recent major compactions
>   // totalKeyValues  : total # of k/v in tablet
>   // R a user configurable ratio, like the current major compaction ratio that is based
on files
>   if((recentlyDropped > R * totalKeyValues) && (recentlyRead * majcDropRatio
> R * totalKeyValues)){
>      doFullMajorCompaction()
>      resetScanStats()
>   }
> {code}
> An issue with the above is that the total # of key values for a tablet may not be accurate
because of bulk import and splits.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message