accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3602) BatchScanner optimization for AccumuloInputFormat
Date Wed, 08 Apr 2015 20:53:12 GMT


ASF GitHub Bot commented on ACCUMULO-3602:

Github user keith-turner commented on the pull request:
    I was discussing the big picture behind this PR w/ @ctubbsii .   It seems like this change
could encourage users to pass many ranges as configuration for the map reduce job.   This
could cause memory exhaustion for the job tracker.   
    We discussed passing a function which generates a set of ranges, instead of passing lots
of ranges.  The implementation would still use a batch scanner (or scanner with a special
iterator but its harder to pass code to tserver).   Each input split could call a function
like the following which deterministically creates a set of ranges.   Then those ranges could
be used for the batch scanner. 
    interface RangeGenerator {
       * @param tabletRange  The data range for the tablet over which the input split is executing
       * @param config a mysterious class that allows user to pass parameters to the function
      List<Range> createRanges(Range tabletRange, Myst config);
    When configuring the AccumuloInputFormat to use the batch scanner, a class name that implements
this function would be provided.   The ranges set on the job would be large ranges for portions
of table to process.  An input split would be created for each tablet that falls within those
large ranges, and for each input split the function would be called to possibly create many
more ranges.

> BatchScanner optimization for AccumuloInputFormat
> -------------------------------------------------
>                 Key: ACCUMULO-3602
>                 URL:
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Eugene Cheipesh
>            Assignee: Eugene Cheipesh
>              Labels: performance
>             Fix For: 1.7.0
> Currently {{AccumuloInputFormat}} produces a split for reach {{Range}} specified in the
configuration. Some table indexing schemes, for instance z-order geospacial index, produce
large number of small ranges resulting in large number of splits. This is specifically a concern
when using {{AccumuloInputFormat}} as a source for Spark RDD where each Split is mapped to
an RDD partition.
> Large number of small RDD partitions leads to poor parallism on read and high overhead
on processing. A desirable alternative is to group ranges by tablet into a single split and
use {{BatchScanner}} to produce the records. Grouping by tablets is useful because it represents
Accumulos attempt to distributed stored records and can be influance by the user through table
> The grouping functionality already exists in the internal {{TabletLocator}} class. 
> Current proposal is to modify {{AbstractInputFormat}} such that it generates either {{RangeInputSplit}}
or {{MultiRangeInputSplit}} based on a new setting in {{InputConfigurator}}.  {{AccumuloInputFormat}}
would then be able to inspect the type of the split and instantiate an appropriate reader.
> The functinality of {{TabletLocator}} should be exposed as a public API in 1.7 as it
is useful for optimizations.

This message was sent by Atlassian JIRA

View raw message