accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3602) BatchScanner optimization for AccumuloInputFormat
Date Tue, 24 Mar 2015 03:09:53 GMT


Josh Elser commented on ACCUMULO-3602:

Left some commits on your most recent commit in that branch. Thanks for sharing, [~echeipesh]!
Some bigger concerns that we'll need to talk through:

* Resource management on the BatchScanner. Are we sure that it will get closed? I might have
just missed it happening elsewhere (via ScannerBase).
* All of the MapReduce classes/interfaces/etc under `org.apache.accumulo.core.client` are
(painfully) considered to be in the public API. Since we're following semver, it's unlikely
that we'll be able to make this change 1.6.3 as you've currently targeted. We'd have to go
for 1.7.0. The mapred/mapreduce classes often fall into this gray area where we don't *really*
want them all to be public API, but they got grandfathered in. We'll have to keep this in
* The duplication between RangeInputSplit and MultiRangeInputSplit is painful (I think you
called this out to begin with). Maybe there's some more consolidation that can be done with
RangeInputSplit? I'm worried about it because we've had a couple of bugs come out of RangeInputSplit
due to the duplicative code and insufficient testing.

As far as testing, you can try to take a look at That has a good
example of running a "mapreduce" job over a MiniAccumuloCluster. You could also check out
the MiniMRYarnCluster class provided by Hadoop/YARN itself which should hopefully give you
a more realistic test environment (as opposed to using the local MR runner).
also has some examples of running MR jobs against Accumulo which you could look at. In short,
the more you can unit test, the better. A test or two to ensure high-level functionality would
be good IMO.

> BatchScanner optimization for AccumuloInputFormat
> -------------------------------------------------
>                 Key: ACCUMULO-3602
>                 URL:
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Eugene Cheipesh
>            Assignee: Eugene Cheipesh
>              Labels: performance
> Currently {{AccumuloInputFormat}} produces a split for reach {{Range}} specified in the
configuration. Some table indexing schemes, for instance z-order geospacial index, produce
large number of small ranges resulting in large number of splits. This is specifically a concern
when using {{AccumuloInputFormat}} as a source for Spark RDD where each Split is mapped to
an RDD partition.
> Large number of small RDD partitions leads to poor parallism on read and high overhead
on processing. A desirable alternative is to group ranges by tablet into a single split and
use {{BatchScanner}} to produce the records. Grouping by tablets is useful because it represents
Accumulos attempt to distributed stored records and can be influance by the user through table
> The grouping functionality already exists in the internal {{TabletLocator}} class. 
> Current proposal is to modify {{AbstractInputFormat}} such that it generates either {{RangeInputSplit}}
or {{MultiRangeInputSplit}} based on a new setting in {{InputConfigurator}}.  {{AccumuloInputFormat}}
would then be able to inspect the type of the split and instantiate an appropriate reader.
> The functinality of {{TabletLocator}} should be exposed as a public API in 1.7 as it
is useful for optimizations.

This message was sent by Atlassian JIRA

View raw message