accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Combiner behaviour
Date Wed, 19 Mar 2014 22:51:56 GMT
Ummm, you got the gist of it (I may have misspoke in what I initially said).

What my first thought was to make an iterator that will filter down to 
the columns that you want. It doesn't look like we have an iterator that 
will efficiently do this for you included in the core (although, I know 
I've done something similar in the past like this). This iterator would 
scan the rows on your table returning just the columns you want.

000200001ccaac30 meta:size []    1807
000200001cdaac30 meta:size []    656
000200001cfaac30 meta:size []    565

Then, we could put the summing combiner on top of that iterator to sum 
those and get back a single key. The row in the key you return should be 
the last row you included in the sum. This way, if a retry happens under 
the hood by the batchscanner, you'll resume where you left off and won't 
double-count things.

(you could even do things like sum a maximum of N rows before returning 
back some intermediate count to better parallelize things)

000200001cfaac30 meta:size []    3028

So, each "ScanSession" (what the batchscanner is doing underneath the 
hood) would return you a value which your client would do a final summation.

The final stack would be {(data from accumulo) > SKVI to project columns 
 > summing combiner} > final summation, where {...} denotes work done 
server-side. This is one of those things that really shines with the 
Accumulo API.

On 3/19/14, 6:40 PM, Russ Weeks wrote:
> Hi, Josh,
>
> Thanks very much for your response. I think I get what you're saying,
> but it's kind of blowing my mind.
>
> Are you saying that if I first set up an iterator that took my key/value
> pairs like,
>
> 000200001ccaac30 meta:size []    1807
> 000200001ccaac30 meta:source []    data2
> 000200001cdaac30 meta:filename []    doc02985453
> 000200001cdaac30 meta:size []    656
> 000200001cdaac30 meta:source []    data2
> 000200001cfaac30 meta:filename []    doc04484522
> 000200001cfaac30 meta:size []    565
> 000200001cfaac30 meta:source []    data2
> 000200001dcaac30 meta:filename []    doc03342958
>
> And emitted something like,
>
> 0 meta:size [] 1807
> 0 meta:size [] 656
> 0 meta:size [] 565
>
> And then applied a SummingCombiner at a lower priority than that
> iterator, then... it should work, right?
>
> I'll give it a try.
>
> Regards,
> -Russ
>
>
> On Wed, Mar 19, 2014 at 3:33 PM, Josh Elser <josh.elser@gmail.com
> <mailto:josh.elser@gmail.com>> wrote:
>
>     Russ,
>
>     Remember about the distribution of data across multiple nodes in
>     your cluster by tablet.
>
>     A tablet, at the very minimum, will contain one row. Any way to say
>     that same thing is that a row will never be split across multiple
>     tablets. The only guarantee you get from Accumulo here is that you
>     can use a combiner to do you combination across one row.
>
>     However, when you combine (pun not intended) another SKVI with the
>     Combiner, you can do more merging of that intermediate "combined
>     value" from each row before returning back to the client. You can
>     think of this approach as doing a multi-level summation.
>
>     This still requires one final sum on the client side, but you should
>     get quite the reduction with this approach over doing the entire sum
>     client side. You sum the meta:size column in parallel across parts
>     of the table (server-side) and then client-side you sum the sums
>     from each part.
>
>     I can sketch this out in more detail if it's not clear. HTH
>
>
>     On 3/19/14, 6:18 PM, Russ Weeks wrote:
>
>         The accumulo manual states that combiners can be applied to
>         values which
>         share the same rowID, column family, and column qualifier. Is
>         there any
>         way to adjust this behaviour? I have rows that look like,
>
>         000200001ccaac30 meta:size []    1807
>         000200001ccaac30 meta:source []    data2
>         000200001cdaac30 meta:filename []    doc02985453
>         000200001cdaac30 meta:size []    656
>         000200001cdaac30 meta:source []    data2
>         000200001cfaac30 meta:filename []    doc04484522
>         000200001cfaac30 meta:size []    565
>         000200001cfaac30 meta:source []    data2
>         000200001dcaac30 meta:filename []    doc03342958
>
>         and I'd like to sum up all the values of meta:size across all
>         rows.  I
>         know I can scan the sizes and sum them on the client side, but I was
>         hoping there would be a way to do this inside my cluster. Is
>         mapreduce
>         my only option here?
>
>         Thanks,
>         -Russ
>
>

Mime
View raw message