accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3602) BatchScanner optimization for AccumuloInputFormat
Date Fri, 10 Apr 2015 18:15:13 GMT


ASF GitHub Bot commented on ACCUMULO-3602:

Github user echeipesh commented on the pull request:
    @keith-turner I don't think the range decomposition interface approach would handle my
case without some kind of intermediate storage level, as I am reading it anyway. The Range
-> List<Range> interface does not have enough parameters to determine the query plan.
Also, if the query is user generated then a named class can not be present to encapsulate
the parameters.
    From this standpoint it seems there are two options 
    - Use some presence that is that would be populated by the query planner
      - Perhaps the Configuration object would still need a value to find the correct location
    - Expand the interface to to include some kind of `QueryConfiguration` interface that
could be implemented by the user and would be passed in.
    The second approach sounds more interesting but it's hard to judge it since I don't think
I have that problem. 
    To clarify, what is the underlying concern, that the Configuration object gets too large
and the (de)serialization becomes too expensive or not even possible?
    My particular use case is reflection on the fact that small ranges introduce too much
overhead when they translate to mappers (in MR) and partitions (in spark). So it would be
nice if there is some kind of minimum batch, tablet maps nicely to that.
    The only other addition to this problem, which is not explored in this PR, is to divide
sequence of ranges per tablet server into "balanced", user configurable, number, of groups.
Balanced in this case could mean something like: approximately equal number of possible row
results. The idea would to be to match "x mappers per host" design. Sounds like it would be
great, but I haven't tested anything like that yet, could be another PR if it works out?

> BatchScanner optimization for AccumuloInputFormat
> -------------------------------------------------
>                 Key: ACCUMULO-3602
>                 URL:
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Eugene Cheipesh
>            Assignee: Eugene Cheipesh
>              Labels: performance
>             Fix For: 1.7.0
> Currently {{AccumuloInputFormat}} produces a split for reach {{Range}} specified in the
configuration. Some table indexing schemes, for instance z-order geospacial index, produce
large number of small ranges resulting in large number of splits. This is specifically a concern
when using {{AccumuloInputFormat}} as a source for Spark RDD where each Split is mapped to
an RDD partition.
> Large number of small RDD partitions leads to poor parallism on read and high overhead
on processing. A desirable alternative is to group ranges by tablet into a single split and
use {{BatchScanner}} to produce the records. Grouping by tablets is useful because it represents
Accumulos attempt to distributed stored records and can be influance by the user through table
> The grouping functionality already exists in the internal {{TabletLocator}} class. 
> Current proposal is to modify {{AbstractInputFormat}} such that it generates either {{RangeInputSplit}}
or {{MultiRangeInputSplit}} based on a new setting in {{InputConfigurator}}.  {{AccumuloInputFormat}}
would then be able to inspect the type of the split and instantiate an appropriate reader.
> The functinality of {{TabletLocator}} should be exposed as a public API in 1.7 as it
is useful for optimizations.

This message was sent by Atlassian JIRA

View raw message