accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From keith-turner <...@git.apache.org>
Subject [GitHub] accumulo issue #180: Taking a crack at new summarization API
Date Wed, 09 Nov 2016 15:02:08 GMT
Github user keith-turner commented on the issue:

    https://github.com/apache/accumulo/pull/180
  
    > For example, we would want to avoid storing 1M CVs if a user had that many in a table
(for some reason).
    
    I think we should address this issue in some way while considering the following.
    
     * Fetching summaries should be relatively fast.  Gigantic summaries will stymie this
goal.
     * When a users summarizer does produce a gigantic summary, it would be nice if we helped
them debug it.
    
    I am thinking one way to accomplish these goals is to store gigantic summaries, but only
read summaries under a certain size.  The size of a serialized summary could be written first.
 When a summary is read this size will be the first bit of info.  If the summary is over a
certain size an error could be logged and that file would be treated like it had no summary.
 We could also add a enum that indicates gigantic summaries were present.  Since the summary
is stored, it would give the user a chance to use rfile-info to look at whats in the summary
for debugging.
    
    We also need to stress in the javadoc that summaries are intended to be small.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message