accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3602) BatchScanner optimization for AccumuloInputFormat
Date Tue, 31 Mar 2015 22:56:54 GMT


ASF GitHub Bot commented on ACCUMULO-3602:

GitHub user echeipesh opened a pull request:

    ACCUMULO-3602 BatchScanner optimization for AccumuloInputFormat


You can merge this pull request into a Git repository by running:

    $ git pull ACCUMULO-3602

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #25
commit 8da593302e2c702d8a775752d5f5620d2ddb3345
Author: Eugene Cheipesh <>
Date:   2015-03-23T20:43:46Z

    Added support for grouping ranges per tablet when using AccumuloInputFormat

commit 47a2fae8423ab383364228469055904803751c1a
Author: Eugene Cheipesh <>
Date:   2015-03-25T15:08:20Z

    Avoid casts by using AccumuloInputSplit interface

commit b4101a37949271a1b388ac5bd3f1b1d14e00d272
Author: Eugene Cheipesh <>
Date:   2015-03-25T15:10:05Z

    upgrade warning to exception if batch scan is requested but not available

commit ad83769d96224c55f909ae4ba0d6482c7555bf1c
Author: Eugene Cheipesh <>
Date:   2015-03-25T15:11:24Z

    close underlying scanner with the RecordReader

commit 46a45c3582289730905ee4a68fa3fcf8f5245ccd
Author: Eugene Cheipesh <>
Date:   2015-03-31T19:08:37Z

    Upgrade AccumuloInputSplit to abstract class for code reuse

commit 7a09974b4ae731c78e7a99ec94d02c3a7a5ad0b1
Author: Eugene Cheipesh <>
Date:   2015-03-31T19:53:58Z

    merging upstream master

commit 41e79cf9b71b48be9acf2a44a44a523476c45024
Author: Eugene Cheipesh <>
Date:   2015-03-31T20:00:42Z

    fix merge errors

commit 7b3beebb2ba023e8862f45c89b61435a7b404368
Author: Eugene Cheipesh <>
Date:   2015-03-31T20:32:06Z

    add batch scan setters/getters to AccumuloInputFormat

commit c84c8c035414d6f32a864b573f7f6a1f6c16551a
Author: Eugene Cheipesh <>
Date:   2015-03-31T21:47:17Z

    fix casting error

commit 35495142d007303cbba3c083a07abbe1a859e289
Author: Eugene Cheipesh <>
Date:   2015-03-31T21:47:35Z

    make findbugs happy

commit 7ac1cef68d651b9c19bfc57db28a718630ae37d1
Author: Eugene Cheipesh <>
Date:   2015-03-31T21:47:48Z

    test BatchInputSplit generation

commit 9a477220a205816aa8641d2b362c3b2c395c83f0
Author: Eugene Cheipesh <>
Date:   2015-03-31T21:58:26Z

    match code flow between mapred and mapreduce
    also: killing dead code for “iterators”

commit 23f6d400b4794ea8c30570f51fcdae09ba33a17c
Author: Eugene Cheipesh <>
Date:   2015-03-31T22:21:59Z

    test all splits for correct type, not just head

commit 17fa276c8dc06c6036bc7a4c9ccc2a8cf23f90e9
Author: Eugene Cheipesh <>
Date:   2015-03-31T22:55:32Z

    Test MR job with BatchScan option enable


> BatchScanner optimization for AccumuloInputFormat
> -------------------------------------------------
>                 Key: ACCUMULO-3602
>                 URL:
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Eugene Cheipesh
>            Assignee: Eugene Cheipesh
>              Labels: performance
> Currently {{AccumuloInputFormat}} produces a split for reach {{Range}} specified in the
configuration. Some table indexing schemes, for instance z-order geospacial index, produce
large number of small ranges resulting in large number of splits. This is specifically a concern
when using {{AccumuloInputFormat}} as a source for Spark RDD where each Split is mapped to
an RDD partition.
> Large number of small RDD partitions leads to poor parallism on read and high overhead
on processing. A desirable alternative is to group ranges by tablet into a single split and
use {{BatchScanner}} to produce the records. Grouping by tablets is useful because it represents
Accumulos attempt to distributed stored records and can be influance by the user through table
> The grouping functionality already exists in the internal {{TabletLocator}} class. 
> Current proposal is to modify {{AbstractInputFormat}} such that it generates either {{RangeInputSplit}}
or {{MultiRangeInputSplit}} based on a new setting in {{InputConfigurator}}.  {{AccumuloInputFormat}}
would then be able to inspect the type of the split and instantiate an appropriate reader.
> The functinality of {{TabletLocator}} should be exposed as a public API in 1.7 as it
is useful for optimizations.

This message was sent by Atlassian JIRA

View raw message