accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <rwe...@newbrightidea.com>
Subject Re: Combiner behaviour
Date Wed, 19 Mar 2014 22:40:04 GMT
Hi, Josh,

Thanks very much for your response. I think I get what you're saying, but
it's kind of blowing my mind.

Are you saying that if I first set up an iterator that took my key/value
pairs like,

000200001ccaac30 meta:size []    1807
000200001ccaac30 meta:source []    data2
000200001cdaac30 meta:filename []    doc02985453
000200001cdaac30 meta:size []    656
000200001cdaac30 meta:source []    data2
000200001cfaac30 meta:filename []    doc04484522
000200001cfaac30 meta:size []    565
000200001cfaac30 meta:source []    data2
000200001dcaac30 meta:filename []    doc03342958

And emitted something like,

0 meta:size [] 1807
0 meta:size [] 656
0 meta:size [] 565

And then applied a SummingCombiner at a lower priority than that iterator,
then... it should work, right?

I'll give it a try.

Regards,
-Russ


On Wed, Mar 19, 2014 at 3:33 PM, Josh Elser <josh.elser@gmail.com> wrote:

> Russ,
>
> Remember about the distribution of data across multiple nodes in your
> cluster by tablet.
>
> A tablet, at the very minimum, will contain one row. Any way to say that
> same thing is that a row will never be split across multiple tablets. The
> only guarantee you get from Accumulo here is that you can use a combiner to
> do you combination across one row.
>
> However, when you combine (pun not intended) another SKVI with the
> Combiner, you can do more merging of that intermediate "combined value"
> from each row before returning back to the client. You can think of this
> approach as doing a multi-level summation.
>
> This still requires one final sum on the client side, but you should get
> quite the reduction with this approach over doing the entire sum client
> side. You sum the meta:size column in parallel across parts of the table
> (server-side) and then client-side you sum the sums from each part.
>
> I can sketch this out in more detail if it's not clear. HTH
>
>
> On 3/19/14, 6:18 PM, Russ Weeks wrote:
>
>> The accumulo manual states that combiners can be applied to values which
>> share the same rowID, column family, and column qualifier. Is there any
>> way to adjust this behaviour? I have rows that look like,
>>
>> 000200001ccaac30 meta:size []    1807
>> 000200001ccaac30 meta:source []    data2
>> 000200001cdaac30 meta:filename []    doc02985453
>> 000200001cdaac30 meta:size []    656
>> 000200001cdaac30 meta:source []    data2
>> 000200001cfaac30 meta:filename []    doc04484522
>> 000200001cfaac30 meta:size []    565
>> 000200001cfaac30 meta:source []    data2
>> 000200001dcaac30 meta:filename []    doc03342958
>>
>> and I'd like to sum up all the values of meta:size across all rows.  I
>> know I can scan the sizes and sum them on the client side, but I was
>> hoping there would be a way to do this inside my cluster. Is mapreduce
>> my only option here?
>>
>> Thanks,
>> -Russ
>>
>

Mime
View raw message