accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3602) BatchScanner optimization for AccumuloInputFormat
Date Wed, 08 Apr 2015 21:25:13 GMT


ASF GitHub Bot commented on ACCUMULO-3602:

Github user joshelser commented on the pull request:
    > > Agreed, but is this a new issue?
    > I think the intent of this PR is new. I think the intent is to efficiently support
a map reduce job that reads from many small ranges. If we are going to do that, then lets
do it in a way that scales better. 
    Re-reading the issue description, yes, you are correct. However, what Eugene has already
done is already an improvement over what currently exists, IMO. Assuming that the majority
of Ranges that his Spark workflow creates are non-overlapping (a hunch given the subject matter
-- very small boxes of interest), this still reduces the total memory use of splits and Ranges.
The worst case would be numerous very large ranges. We're never going to have the ideal solution
the first time around -- I just don't want this to stall because we come up with a different
solution to an already solved problem.
    I'm happy to continue the discussion for dealing with large numbers of ranges efficiently,
I'm just not convinced this is the place to do it.

> BatchScanner optimization for AccumuloInputFormat
> -------------------------------------------------
>                 Key: ACCUMULO-3602
>                 URL:
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Eugene Cheipesh
>            Assignee: Eugene Cheipesh
>              Labels: performance
>             Fix For: 1.7.0
> Currently {{AccumuloInputFormat}} produces a split for reach {{Range}} specified in the
configuration. Some table indexing schemes, for instance z-order geospacial index, produce
large number of small ranges resulting in large number of splits. This is specifically a concern
when using {{AccumuloInputFormat}} as a source for Spark RDD where each Split is mapped to
an RDD partition.
> Large number of small RDD partitions leads to poor parallism on read and high overhead
on processing. A desirable alternative is to group ranges by tablet into a single split and
use {{BatchScanner}} to produce the records. Grouping by tablets is useful because it represents
Accumulos attempt to distributed stored records and can be influance by the user through table
> The grouping functionality already exists in the internal {{TabletLocator}} class. 
> Current proposal is to modify {{AbstractInputFormat}} such that it generates either {{RangeInputSplit}}
or {{MultiRangeInputSplit}} based on a new setting in {{InputConfigurator}}.  {{AccumuloInputFormat}}
would then be able to inspect the type of the split and instantiate an appropriate reader.
> The functinality of {{TabletLocator}} should be exposed as a public API in 1.7 as it
is useful for optimizations.

This message was sent by Atlassian JIRA

View raw message