accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-4813) Accepting mapping file for bulk import
Date Fri, 16 Feb 2018 07:59:00 GMT


Keith Turner commented on ACCUMULO-4813:

[~m-hogue] it would be additive. I think it would be nice to deprecate the current bulk import
process in favor of this.  This could new done by adding new APIs to do bulk import with
a mapping file and deprecating the current APIs.

> Accepting mapping file for bulk import
> --------------------------------------
>                 Key: ACCUMULO-4813
>                 URL:
>             Project: Accumulo
>          Issue Type: Sub-task
>            Reporter: Keith Turner
>            Priority: Major
>             Fix For: 2.0.0
> During bulk import, inspecting files to determine where they go is expensive and slow. 
In order to spread the cost, Accumulo has an internal mechanism to spread the work of inspecting
files to random tablet servers.  Because this internal process takes time and consumes resources
on the cluster, users want control over it.  The best way to give this control may be to
externalize it by allowing bulk imports to have a mapping file.  This mapping file would
specify the ranges where files should be loaded.  If Accumulo provided API to help produce
this file, then that work could be done in Map Reduce or Spark.  This would give users all
the control they want over when and where this computation is done.  This would naturally
fit in the process used to create the bulk files. 
> To make bulk import fast this mapping file should have the following properties.
>  * Key in file is a range
>  * Value in file is a list of files
>  * Ranges are non overlapping
>  * File is sorted by range/key
>  * Has a mapping for every non-empty file in the bulk import directory.
> If Accumulo provides APIs to do the following operation, then producing the file could
written as a map/reduce job.
>  * For a given rfile produce a list of row ranges where the file should be loaded. 
These row ranges would be based on tablets.
>  * Merge row range,list of file pairs
>  * Serialize row range,list of files pairs
> With a mapping file, the bulk import algorithm could be written as follows.  This could
all be executed in the master with no need to run inspection task on random tablet servers.
>  * Sanity check file
>  ** Ensure in sorted order
>  ** Ensure ranges are non-overlapping
>  ** Ensure each file in directory has at least one entry in file
>  ** Ensure all splits in the file exist in the table.
>  * Since file is sorted can do a merged read of file and metadata table, looping over
the following operations for each tablet until all files are loaded.
>  ** Read the loaded files for the tablet
>  ** Read the files to load for the range
>  ** For any files not loaded, send an async load message to the tablet server
> The above algorithm can just keep scanning the metadata table and sending async load
messages until the bulk import is complete.  Since the load messages are async, the bulk
load could of a large number of files could potentially be very fast.
> The bulk load operation can easily handle the case of tablets splitting during the operation
by matching a single range in the file to multiple tablets.  However attempting to handle
merges would be a lot more tricky.  It would probably be simplest to fail the operation if
a merge is detected.  The nice thing is that this can be done in a very clean way.   Once
the bulk import operation has the table lock, merges can not happen.  So after getting the
table lock the bulk import operation can ensure all splits in the file exist in the table.
The operation can abort if the condition is not met before doing any work.  If this condition
is not met, it indicates a merge happened between generating the mapping file an doing the
bulk import.
> Hopefully the mapping file plus the algorithm that sends async load messages can dramatically
speed up bulk import operations.  This may lessen the need for other things like prioritizing
bulk import.  To measure this, it would be very useful create a bulk import performance test
that can create many files with very little data and measure the time it takes load them.

This message was sent by Atlassian JIRA

View raw message