beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Baetens <>
Subject GroupByKey and CombineFn: internals
Date Tue, 22 Nov 2016 11:32:07 GMT
Hi there,

I had some questions about the internal working of these two concepts and
where I could find more info on this (so I might be able similar problems
in the future myself. Here we go:

+ When doing a GroupByKey, when does the shuffling actually take place?
Could it be the behaviour is not the same when using a CombineFn to
aggregate compared to when using a Serializablefunction? (I have a feeling
in the first case not all the keys get shuffled to one machine, while it
does for the second).

+ When using Accumulators in a CombineFn, what are the actual internals? Is
there any docs on this? The problem I run into is that, when I try adding
elements to an ArrayList and then merge ArrayList, the output is an empty
list. The problem could probably be solved by using a Serializablefunction
to Combine everything at once, but you might loose the advantages of
parallellisation in that case (~ above).

Thanks a lot :)



*Matthias Baetens*

*datatonic | data power unleashed*
office +44 203 668 3680  |  mobile +44 74 918 20646

Level24 | 1 Canada Square | Canary Wharf | E14 5AB London

View raw message