accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <afu...@apache.org>
Subject Re: using combiner vs. building stats cache
Date Fri, 28 Aug 2015 17:49:24 GMT
Calling flush after every write will probably slow you down by more than
1000x, since the flush call is on the order of 10-100ms. Keeping a buffer
of data at your client and only flushing when the buffer is full is usually
a pretty decent strategy. That way you can replay from the buffer in case
of a client failure. Many upstream processing architectures (like Kafka and
Flume) have something like a checkpoint marker that you might be able to
leverage for this purpose.

One tricky issue is that the BatchWriter has an underlying semantic of "at
least once", meaning it is possible under some failure conditions to ingest
data multiple times. With combiners that means your values could end up
being inconsistent. It is not possible to get "once and only once"
semantics with the BatchWriter. Depending on how much you care about your
counts being accurate under these failure modes, this may not be a problem
for you. If it is, you may want to do something a bit more complicated like
write data using bulk imports [1] or implement some type of lambda
architecture [2] to get eventually consistent counts.

Cheers,
Adam

[1] https://accumulo.apache.org/1.7/accumulo_user_manual.html#_bulk_ingest
[2] https://en.wikipedia.org/wiki/Lambda_architecture


On Fri, Aug 28, 2015 at 12:08 PM, z11373 <z11373@outlook.com> wrote:

> Thanks Dylan and late chimer Josh, who is always helpful..
>
> After Dylan's reply, I did a quick experiment:
> 1. Set SummingCombiner -all (scan, minor and major compaction) on the table
> 2. Delete default vers iter from the table (the reason is I just want to
> see
> if the rows got 'combined' or not)
> 3. Insert row id = 'foo' and value = 1
> 4. Insert row id = 'foo' and value = 1
> 5. Scan will return 1 row: 'foo', 2 (so this is correct as expected)
> 6. Delete the summing combiner, so the table doesn't have any iterators now
> 7. Scan the table again, and now it returns 2 rows (both are 'foo', 1)
>
> Then I deleted the table, and redo all steps above, except replace step #5
> with "flush -w". At step #7, it now returns 1 row: 'foo', 2 (this is what I
> want, which means the combiner result got persisted, instead of being
> calculated everytime).
>
> Therefore, the approach I was thinking about writing the snapshot to
> another
> table (because I wanted to avoid aggregation operation every scan) is no
> longer needed, since Accumulo has taken care of this. After compaction,
> it'll have 1 row for each unique key with aggregate value. Cool!
>
> Thanks for the tips Josh. We are using BatchWriter, so it should perform
> better throughput. But I just looked at the code, and it looks like we call
> batchWriter.flush() after each addMutation call. This doesn't seem a good
> utilization of batch writer...
> I am curious on how normally people batch the insert/update? The process
> may
> crash, and we'll lose those changes unfortunately :-(
>
> Thanks,
> Z
>
> Thanks,
> Z
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/using-combiner-vs-building-stats-cache-tp14979p14998.html
> Sent from the Developers mailing list archive at Nabble.com.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message