accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: another question on summing combiner
Date Fri, 09 Oct 2015 15:28:44 GMT
If you were doing a batch job to just recompute the stats, I'd probably 
make a new table and then rename it, replacing your old stats table. 
This can also be problematic in making sure clients that are still 
writing data will correctly write to the new table. Can you quiesce 
ingest temporarily?

In short, this is hard to do correctly (and there are edge cases that 
could potentially happen that make the table inaccurate at a very low 
probability). Have you considered just running the system for a while 
and seeing how skewed your stats are?

It kind of sounds like the easier problem to solve is whether or not 
some record exists in your system and then you can know definitively 
whether or not you need to even process that record again (much less 
update the stats table).

z11373 wrote:
> Revisit this topic, if I go with option #2, i.e. having a batch job to fix
> the stats table, now I am not really sure if it will work, since the stats
> table already have summing combiner enabled, hence the batch job can't just
> update the value since it'll be incorrect.
> For example:
> Current stats table contains:
> foo     | 2
> bar     | 3
> test    | 1
> The batch job scan the main table, and going to update the stats table, let
> say the actual stats is foo=1, bar=4, test=1, hence the final stats table
> would become:
> foo     | 3
> bar     | 7
> test    | 2
> It'd be correct if it removes the summing combiner from the table, but then
> another process (not the batch job) may update particular key, overwriting
> the correct value (updated from batch job). We can't tolerate the system is
> offline, otherwise we can refresh the stats during that downtime. Any idea
> on how to solve this problem?
> Unfortunately there is an inherent problem with summing combiner, i.e. when
> adding same key to main table, it'll behave just like 'update' when the same
> key already exist, but my current logic will add<key>|1 to the stats table,
> so if we have many 'update', then some values in stats table will be far
> off. Similar case for deleting, it will be no-op for main table if the key
> doesn't exist, but the app logic will add<key>|-1 to the stats table. This
> is the reason why we're thinking to have a batch job to 'fix' the stats
> table, but that also has its own problem :-(
> Thanks,
> Z
> --
> View this message in context:
> Sent from the Developers mailing list archive at

View raw message