accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From keith-turner <...@git.apache.org>
Subject [GitHub] accumulo pull request: ACCUMULO-3602 BatchScanner optimization for...
Date Fri, 10 Apr 2015 19:19:05 GMT
Github user keith-turner commented on the pull request:

    https://github.com/apache/accumulo/pull/25#issuecomment-91656749
  
    > To clarify, what is the underlying concern, that the Configuration object gets too
large and the (de)serialization becomes too expensive or not even possible?
    
    Yeah that was my concern.  I am also making the assumption that the Configuration has
to be able to fit into memory on the client submitting the M/R job and on the jobtracker.
    
    > My particular use case is reflection on the fact that small ranges introduce too
much overhead when they translate to mappers (in MR) and partitions (in spark). So it would
be nice if there is some kind of minimum batch, tablet maps nicely to that.
    
    I think what you are doing to make AIF handle this case well is useful.   I have had multiple
users ask about this type of thing before.   Your changes make it possible for a map reduce
job to handle a large number of ranges in an efficient manner, but a user can only pass a
set of ranges that fits in memory.  I don't think that problem has to be solved in this PR,
but I think its useful to think ahead w.r.t to the API.  After this change it would be cool
if we were not limited by the number of ranges that could fit into memory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message