accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <>
Subject [GitHub] keith-turner opened a new issue #564: Add multiple compaction thread pools and allow multiple compactions per tablet
Date Wed, 18 Jul 2018 15:58:13 GMT
keith-turner opened a new issue #564: Add multiple compaction thread pools and allow multiple
compactions per tablet
   Currently there is a single thread pool/executor for compactions and only a single compaction
can run per tablet.  This can cause problems when a user initiates a single long running filter
or transform compaction because new files build up and are not compacted.  Ideally a long
running compaction for a tablet could run in executor1 while new tablets files are compacted
in executor2.
   The current user pluggable CompactionStrategy class is not well suited for handling this
case of multiple executors and  compactions per tablet.  The following design is better suited
for mananging this concurrency in a way that is easy to understand.  In this design the CompactionManger
and CompactionPrioritizer are user pluggable.  Currently, prioritization of queued compaction
are not configurable. 
   | Functional components | Description |
   | CompactionJob         | Immutable class that describes work to be done.  Contains list
of files to compact, info about iterators for user compactions, info about output file (like
compression type). |
   | CompactionManager  | Per table class that decides what compactions to do for a tablet.
Can create and cancel compactions jobs.  Can see list of existing jobs.  Can submit multiple
jobs for a table as long as files are disjoint. This class decides which executor should process
a job.   |
   | CompactionPrioritizer | Per executor class that decides which compaction job to execute
next. |
   | CompactionExecutor    | Each tablet server has one or more executors that process compaction
jobs.  These are configured system wide. Number of threads, rate limits, max file per compaction
are some things that can be configured.  If a job exceeds the max files, then the executor
will process it in multiple passes.|
   One major goal with this design is to make it easy for the user to write code that avoids
concurrency mayhem.  The idea underlying this that a compaction manager will be called in
the following way.
    * System gathers a snapshot of tablet files and current compaction jobs.
    * System calls compaction manger with gathered snapshots.
    * The compaction manager returns jobs to cancel and new jobs to run.
    * If the set of files and/or jobs has changed the decisions are ignored and the manager
is called again.
   With this model the prioritizer is dealing with immutable jobs that will not magically
change when its time to run the job (how current compaction strategy works).  This makes reasoning
about creating, canceling, and prioritizing jobs sane.
   The following is an example of how this might work.  In this example assume executor E1
is intended for small compactions and executor E2 is for large compactions. Small vs large
could be a function of the input file sizes.
    * Tablet T1 has three files F1,F2,F3
    * Compaction manger decides to compact F2 and F3 on executor E1 as job J1
    * A new file F4 is added to T1
    * J1 is still queued on E1
    * Compaction manger decides to cancel J1 and compact F1,F2,F3,andF4 on executor E2 as
    * Nothing changed, so J1 is canceled and J2 is submitted. 
   For user initiated compactions, compaction strategies would still be used for compatibility.
 The behavior should be the following :
     * Cancel existing queued jobs (that are system initiated) and prevent more jobs from
     * Wait for any running jobs to complete
     * Apply the users strategy and create a job.
     * Ask the compaction manager which executor the job should be queued on. 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

View raw message