accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Walker (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ACCUMULO-4355) Provide more granular control for bulk import operations
Date Mon, 27 Jun 2016 17:02:51 GMT
Shawn Walker created ACCUMULO-4355:
--------------------------------------

             Summary: Provide more granular control for bulk import operations
                 Key: ACCUMULO-4355
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4355
             Project: Accumulo
          Issue Type: Wish
          Components: master, tserver
            Reporter: Shawn Walker
            Assignee: Shawn Walker


Accumulo currently provides mechanisms to initiate bulk imports and to list bulk imports in
progress.  Scheduling of bulk import requests is not entirely deterministic, and most of the
execution of a bulk-import request is done in a non-preemptable manner.  As such, any bulk
import which takes very long to complete can block bulk imports with higher operational priority
for significant periods.

To better support bulk-import-heavy applications, it would be nice if Accumulo would offer
additional mechanisms for controlling the scheduling and execution of bulk imports, such as
the abilities to:

* Pause/resume bulk import in progress.
* Prioritize/reprioritize bulk import requests.
* Cancel bulk import in progress.  If possible, cancelling a partially completed bulk import
request should result in a rollback of changes.  That is, a bulk import should either succeed
or make no changes.

Additionally, for multitenant situations, it would be nice if Accumulo would:

* Provide multiple queues for bulk import requests.  Each queue would have its requests worked
serially in priority order.  Requests in separate queues should be worked in parallel, or
have time distributed among the queues in some manner as to make work appear roughly parallel.

----
Implementation-wise, I'm thinking of rewriting much of the current bulk-loading logic.  While
the current logic depends upon multiple threads executing (potentially long-duration) blocking
RPC calls, I'd like to move to a more event-driven/message-passing model backed by a persistent
state machine.

Current ideas I'm playing around with (very tentative)
* Creating a new table {{accumulo.bulk_load_queues}} to keep track of bulk load progress.
* Distributing bulk load orchestration via a mechanism similar to tablet assignment instead
of the current blocking RPC calls (LoadFiles.java:156).
* Implementing something akin to a two-phase commit to achieve rollback behavior on failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message