accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <>
Subject Re: using combiner vs. building stats cache
Date Thu, 27 Aug 2015 17:42:55 GMT
On Thu, Aug 27, 2015 at 9:33 AM z11373 <> wrote:

> Russ: I like your idea (indeed best of both worlds), so during compaction
> time, we can store that stats info to another table (but this time it will
> be only single row, hence won't affect query time). So I can add the code
> to
> insert to another table in the reduce() of my custom combiner, right? Or is
> there a better way?

No, don't trigger the table snapshot or compaction from inside your
combiner. I'd do it as a scheduled task via cron or something like that. A
full major compaction is generally seen as a big job, but it might not be
in this case if most of the data has already been combined. Alternatively,
if you can isolate a range of rows to be compacted you can pass that into
TableOperations.compact to speed things up.

I think the only way to guarantee that your scans of the snapshot are
dealing with totally compacted data is to compact after the snapshot. But I
think if you want both the original table and the snapshot to get the
benefit of compaction, you'd want to compact before the snapshot and accept
the risk that there might be a little bit of uncompacted data in the

Honestly, this is how I *think* it should all work, but there are probably
people on this list who are more familiar with combiners, snapshots and
compaction than me.

Let say we have table called TEMP_STATS which we apply the custom combiner.
> During ingestion, we simply insert a row, i.e. ['foo', 1] to the table.
> Next
> time insert ['foo', 1], and so on. Let say we have 10 rows of 'foo', so
> reading that word would return 'foo', 10 (thanks to combiner). Now I want
> to
> delete only 1 row, so that it'd return 'foo', 9 instead. What is the best
> way to do this?

If all you're doing in your stats table is tracking counts, then you could
insert 'foo':-1 and the count will be adjusted correctly. If you're also
tracking mins and maxes, you'll need a different approach... which I would
be fascinated to understand because it seems like a very hard problem.


One option I could think is to apply another identifier, i.e. seq number, so
> it'd insert ['foo', 1, 1], ['foo', 2, 1], and so on (the second number will
> be the seq# and can be stored as column qualifier). Then I have to modify
> the combiner to make it also returns the highest seq# (i.e. 'foo', 10, 10).
> When deleting for one item only, I could just put delete 'foo', :10, and it
> will only mark that row as deleted. Any other better approach?
> Thanks,
> Z
> --
> View this message in context:
> Sent from the Developers mailing list archive at

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message