accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Tubbs (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ACCUMULO-4355) Provide more granular control for bulk import operations
Date Fri, 01 Jul 2016 17:49:10 GMT

     [ https://issues.apache.org/jira/browse/ACCUMULO-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Christopher Tubbs updated ACCUMULO-4355:
----------------------------------------
    Fix Version/s: 1.9.0

> Provide more granular control for bulk import operations
> --------------------------------------------------------
>
>                 Key: ACCUMULO-4355
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4355
>             Project: Accumulo
>          Issue Type: Wish
>          Components: master, tserver
>            Reporter: Shawn Walker
>            Assignee: Shawn Walker
>             Fix For: 1.9.0
>
>
> Accumulo currently provides mechanisms to initiate bulk imports and to list bulk imports
in progress.  Scheduling of bulk import requests is not entirely deterministic, and most of
the execution of a bulk-import request is done in a non-preemptable manner.  As such, any
bulk import which takes very long to complete can block bulk imports with higher operational
priority for significant periods.
> To better support bulk-import-heavy applications, it would be nice if Accumulo would
offer additional mechanisms for controlling the scheduling and execution of bulk imports,
such as the abilities to:
> * Pause/resume bulk import in progress.
> * Prioritize/reprioritize bulk import requests.
> * Cancel bulk import in progress.  If possible, cancelling a partially completed bulk
import request should result in a rollback of changes.  That is, a bulk import should either
succeed or make no changes.
> Additionally, for multitenant situations, it would be nice if Accumulo would:
> * Provide multiple queues for bulk import requests.  Each queue would have its requests
worked serially in priority order.  Requests in separate queues should be worked in parallel,
or have time distributed among the queues in some manner as to make work appear roughly parallel.
> ----
> Implementation-wise, I'm thinking of rewriting much of the current bulk-loading logic.
 While the current logic depends upon multiple threads executing (potentially long-duration)
blocking RPC calls, I'd like to move to a more event-driven/message-passing model backed by
a persistent state machine.
> Current ideas I'm playing around with (very tentative)
> * Creating a new table {{accumulo.bulk_load_queues}} to keep track of bulk load progress.
> * Distributing bulk load orchestration via a mechanism similar to tablet assignment instead
of the current blocking RPC calls (LoadFiles.java:156).
> * Implementing something akin to a two-phase commit to achieve rollback behavior on failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message