accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: using combiner vs. building stats cache
Date Fri, 28 Aug 2015 15:01:43 GMT
Late chime in: Dylan and Russ are on the money. A combiner is the way to go.

And since there was some confusion on the matter, a table with 100 '1' 
values for a given key would require the tabletserver to sum these 
values, and then return them to the client. After the table compacts 
(assuming a full compaction), these 100 '1' values would be rewritten on 
disk to 1 '100' value. The beauty of this is that, as an application, 
you don't have to know whether the values were combined by a 
tabletserver before you saw them or if the value came directly from 
disk. You just get a strongly consistent view of the data in your table.

Assuming you do go the combiner route, beware of writing a single '1' 
update for every "term" you see. If you can do some batching of updates 
to your stats table before writing to Accumulo (split the combination 
work between your client and the servers), you should see better 
throughput than sending lots of updates to the stats table.

Dylan Hutchison wrote:
> Sounds like you have the idea now Z.  There are three places an iterator
> can be applied: scan time, minor compaction time, and major compaction
> time.  Minor compactions help your case a lot-- when enough entries are
> written to a tablet server that the tablet server needs to dump them to a
> new Hadoop RFile, the minor compaction iterators run on the entries as they
> stream to the RFile.  This means that each RFile has only one entry for
> each unique (row, column family, column qualifier) tuple.
> Entries with the same (row, column family, column qualifier) in distinct
> RFiles will get combined at the next major compaction, or on the fly during
> the next scan.
> For example, let say there are 100 rows of [foo, 1], it will actually be
>> 'combined' to a single row [foo, 100]?
> Careful-- Accumulo's combiners combine on Keys with identical row, column
> family and column qualifier.  You'd have to make a more fancy iterator if
> you want to combine all the entries that share the same row.  Let us know
> if you need help doing that.
> On Thu, Aug 27, 2015 at 3:09 PM, z11373<>  wrote:
>> Thanks again Russ!
>> "but it might not be in this case if most of the data has already been
>> combined"
>> Does this mean Accumulo actually combine and persist the combined result
>> after the scan/compaction (depending on which op the combiner is applied)?
>> For example, let say there are 100 rows of [foo, 1], it will actually be
>> 'combined' to a single row [foo, 100]? If that is the case, then combiner
>> is
>> not expensive.
>> Wow! that's brilliant using -1 approach, I didn't even think about it
>> before. Yes, this will work for my case because i only need to know the
>> count.
>> Thanks,
>> Z
>> --
>> View this message in context:
>> Sent from the Developers mailing list archive at

View raw message