accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Vines <vi...@apache.org>
Subject Re: Combiner behaviour
Date Wed, 19 Mar 2014 22:43:06 GMT
Be careful when changing row values, especially outside of the tablet
range, as I believe it can cause the data to be dropped or rejected.


On Wed, Mar 19, 2014 at 6:40 PM, Russ Weeks <rweeks@newbrightidea.com>wrote:

> Hi, Josh,
>
> Thanks very much for your response. I think I get what you're saying, but
> it's kind of blowing my mind.
>
> Are you saying that if I first set up an iterator that took my key/value
> pairs like,
>
> 000200001ccaac30 meta:size []    1807
> 000200001ccaac30 meta:source []    data2
> 000200001cdaac30 meta:filename []    doc02985453
> 000200001cdaac30 meta:size []    656
> 000200001cdaac30 meta:source []    data2
> 000200001cfaac30 meta:filename []    doc04484522
> 000200001cfaac30 meta:size []    565
> 000200001cfaac30 meta:source []    data2
> 000200001dcaac30 meta:filename []    doc03342958
>
> And emitted something like,
>
> 0 meta:size [] 1807
> 0 meta:size [] 656
> 0 meta:size [] 565
>
> And then applied a SummingCombiner at a lower priority than that iterator,
> then... it should work, right?
>
> I'll give it a try.
>
> Regards,
> -Russ
>
>
> On Wed, Mar 19, 2014 at 3:33 PM, Josh Elser <josh.elser@gmail.com> wrote:
>
>> Russ,
>>
>> Remember about the distribution of data across multiple nodes in your
>> cluster by tablet.
>>
>> A tablet, at the very minimum, will contain one row. Any way to say that
>> same thing is that a row will never be split across multiple tablets. The
>> only guarantee you get from Accumulo here is that you can use a combiner to
>> do you combination across one row.
>>
>> However, when you combine (pun not intended) another SKVI with the
>> Combiner, you can do more merging of that intermediate "combined value"
>> from each row before returning back to the client. You can think of this
>> approach as doing a multi-level summation.
>>
>> This still requires one final sum on the client side, but you should get
>> quite the reduction with this approach over doing the entire sum client
>> side. You sum the meta:size column in parallel across parts of the table
>> (server-side) and then client-side you sum the sums from each part.
>>
>> I can sketch this out in more detail if it's not clear. HTH
>>
>>
>> On 3/19/14, 6:18 PM, Russ Weeks wrote:
>>
>>> The accumulo manual states that combiners can be applied to values which
>>> share the same rowID, column family, and column qualifier. Is there any
>>> way to adjust this behaviour? I have rows that look like,
>>>
>>> 000200001ccaac30 meta:size []    1807
>>> 000200001ccaac30 meta:source []    data2
>>> 000200001cdaac30 meta:filename []    doc02985453
>>> 000200001cdaac30 meta:size []    656
>>> 000200001cdaac30 meta:source []    data2
>>> 000200001cfaac30 meta:filename []    doc04484522
>>> 000200001cfaac30 meta:size []    565
>>> 000200001cfaac30 meta:source []    data2
>>> 000200001dcaac30 meta:filename []    doc03342958
>>>
>>> and I'd like to sum up all the values of meta:size across all rows.  I
>>> know I can scan the sizes and sum them on the client side, but I was
>>> hoping there would be a way to do this inside my cluster. Is mapreduce
>>> my only option here?
>>>
>>> Thanks,
>>> -Russ
>>>
>>
>

Mime
View raw message