accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From echeipesh <...@git.apache.org>
Subject [GitHub] accumulo pull request: ACCUMULO-3602 BatchScanner optimization for...
Date Fri, 10 Apr 2015 18:14:27 GMT
Github user echeipesh commented on the pull request:

    https://github.com/apache/accumulo/pull/25#issuecomment-91641526
  
    @keith-turner I don't think the range decomposition interface approach would handle my
case without some kind of intermediate storage level, as I am reading it anyway. The Range
-> List<Range> interface does not have enough parameters to determine the query plan.
Also, if the query is user generated then a named class can not be present to encapsulate
the parameters.
    
    From this standpoint it seems there are two options 
    - Use some presence that is that would be populated by the query planner
      - Perhaps the Configuration object would still need a value to find the correct location
    - Expand the interface to to include some kind of `QueryConfiguration` interface that
could be implemented by the user and would be passed in.
    
    The second approach sounds more interesting but it's hard to judge it since I don't think
I have that problem. 
    
    To clarify, what is the underlying concern, that the Configuration object gets too large
and the (de)serialization becomes too expensive or not even possible?
    
    My particular use case is reflection on the fact that small ranges introduce too much
overhead when they translate to mappers (in MR) and partitions (in spark). So it would be
nice if there is some kind of minimum batch, tablet maps nicely to that.
    
    The only other addition to this problem, which is not explored in this PR, is to divide
sequence of ranges per tablet server into "balanced", user configurable, number, of groups.
Balanced in this case could mean something like: approximately equal number of possible row
results. The idea would to be to match "x mappers per host" design. Sounds like it would be
great, but I haven't tested anything like that yet, could be another PR if it works out?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message