accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From joshelser <...@git.apache.org>
Subject [GitHub] accumulo pull request: ACCUMULO-3602 BatchScanner optimization for...
Date Wed, 08 Apr 2015 21:24:35 GMT
Github user joshelser commented on the pull request:

    https://github.com/apache/accumulo/pull/25#issuecomment-91041003
  
    > > Agreed, but is this a new issue?
    
    > I think the intent of this PR is new. I think the intent is to efficiently support
a map reduce job that reads from many small ranges. If we are going to do that, then lets
do it in a way that scales better. 
    
    Re-reading the issue description, yes, you are correct. However, what Eugene has already
done is already an improvement over what currently exists, IMO. Assuming that the majority
of Ranges that his Spark workflow creates are non-overlapping (a hunch given the subject matter
-- very small boxes of interest), this still reduces the total memory use of splits and Ranges.
The worst case would be numerous very large ranges. We're never going to have the ideal solution
the first time around -- I just don't want this to stall because we come up with a different
solution to an already solved problem.
    
    I'm happy to continue the discussion for dealing with large numbers of ranges efficiently,
I'm just not convinced this is the place to do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message