samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jef G <j...@dataminr.com>
Subject buffering records for join
Date Wed, 11 Oct 2017 20:56:04 GMT
Hello. My team is looking into Samza for doing real-time processing. We
would like to run a directed graph of jobs, where the records in each job's
input streams are joined on a common key. We have functionality to perform
the join by buffering records from the input streams until certain
conditions are met and then passing them on.

We are wondering about the best way to integrate this functionality into a
Samza job. After looking over the API we see two possibilities:

1. Use a StreamTask that adds records to a buffer. This is the method that
the "ad event" example uses. But we am concerned that the framework commits
a StreamTask's offset after process() completes, so if the job fails,
records in the buffer are permanently lost.

2. Use an AsyncTask that adds records to a buffer. Also add TaskCallbacks
to the buffer. When records are eventually joined and processed, commit
their callbacks. This method seems promising but it requires setting
task.max.concurrency very high - possibly in the tens of thousands in our
case. Are we likely to run into any issues doing that?

Are there any other options that we overlooked? What is the best approach?

-Jef G

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message