flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Saar <William.S...@king.com>
Subject Working with data locality in streaming using groupBy?
Date Fri, 05 Jun 2015 13:00:46 GMT
How can I maintain a local state, for instance a ConcurrentHashMap, across different steps
in a streaming chain on a single machine/process?
Static variable? (This doesn't seem to work well when running locally as it gets shared across
multiple instances, a common "pipeline" store would be helpful)

Is it OK to checkpoint such a local state in a single map operation at the beginning of the
pipeline, or does it need to be done for every function?

Will multiple groupBy steps using the same key selector output pass data to the same machines?
(To preserve data locality)

How can I do a fold/reduce operation that only returns its result after a full window has
been processed, even when the processing in the window includes streams that have been distributed
and merged from different machines using groupBy?

My scenario is as follows
I want to build up and partition a large state across different machines by using groupBy
on a stream. The processing occurs in a window and some processing needs to be done on multiple
machines so I want to do additional groupBy operators to pass partial results to other machines.
Pseudo code:

flattenedWindowStream = streamSource.groupBy(myKeySelector). // Initial paritioning
map(localStateSaverCheckpointMapper).  //Checkpoint that saves local state, just passes through
the data

localAndRemoteStream =  flattenedWindowStream.split(event -> canBeProcessedLocally(event)
? "local" : "remote" );
remoteStream = localAndRemoteStream.select("remote").
map(partialProcessing).  // Partially process what I can with my local state
groupBy(myKeySelector). // Send the partial processing to the machines that own the rest of
the data
globalResult = localAndRemoteStream.select("local").map(process).union(remoteStream).broadcast();
// Broadcast all fully processed results to all machines

globalResult.fold().addSink(globalWindowOutputSink)  // fold/reduce, I want a result based
on the full contents of the window

Any help would be greatly appreciated!


View raw message