accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <>
Subject Re: using combiner vs. building stats cache
Date Sat, 29 Aug 2015 04:03:41 GMT
On the BatchWriter, I think it starts flushing data in the background once
enough mutations are added to it to consume half the BatchWriter's max
memory usage set in BatchWriterConfig.  One approach is to set an
appropriate max memory, then never worry about manually flushing and let
the BatchWriter handle flushing as data is added to it.  Make sure to call
close() after everything is finished.  Of course this does not solve the
data loss problem if your app crashes before close()...

On the other hand, the idea Josh presented is to do something called
pre-summing: if the entry ('foo', 1) is added twice to your buffer before
the buffer is flushed, then why not take the two ('foo', 1) entries out and
replace them with a single ('foo', 2) entry to write to your combining
table?  If you take this approach you will have to use your own buffer
rather than the automatic one inside BatchWriter.  It's a simple concept
but requires extra programming and only makes sense when you really want
that last ounce of performance at scale.  Using an LRU-cache with
write-on-eviction is a place to start.  It's also harder to do when the
terms you want to count are spread uniformly randomly in your input
documents, since the likelihood is low a collision will occur in your
buffer unless the buffer is large or the number of unique terms is small.

Conclusion: use BatchWriter's mechanisms for an easy and well performing
solution.  Implement your own buffering (or other way to do pre-summing)
for bonus brownie performance points.

On Fri, Aug 28, 2015 at 4:37 PM, z11373 <> wrote:

> Thanks Josh and Adam!
> My bad, I looked at the code again, actually we only call flush in the end
> (that override function we have only called at the end), so I have another
> issue here, which is the mutation will be lost when the app crash. I will
> think about more on how to mitigate this issue.
> Thanks for mentioning about the batch writer semantic. Luckily for our
> case,
> the count doesn't need to be so accurate, as it's more like for the
> optimizer to re-order the queries based on cardinality. The stats
> discrepancy needs to be big enough to screw the result, otherwise it won't
> matter much. This is good tips though, and I'll pay attention on it in the
> future.
> Thanks,
> Z
> --
> View this message in context:
> Sent from the Developers mailing list archive at

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message