I'm writing an application to generate reports of a real time stream of sale records (product, quantity, price, time) that I get from a websocket, and I plan to use Kafka and Flink to do so. The result of the pipeline should be, for instance, "the average price for a certain product in the last 3 hours, minute by minute", "the volume (sum of quantities) for a certain product in the last hour, second by second", etc.
I want to balance the load on a cluster given that I'm receiving way too many records to even keep them in memory in a single computer, around 10,000 each second. I intend to distribute them on a Kafka topic "sales", with multiple partitions (one for each product). That way all the records for a given product are kept on the same node but different products are processed on different nodes.
There are some things I still don't get about how Kafka and Flink are supposed to work together. The following questions may be rather general and that's because I've no experience with either of them, so any additional insight or advice you can provide is very welcome.
Keeping all records for a given product on the same node as mentioned should lower the latency since all consumers will handle only local data. That works in theory only, because if I'm not wrong Flink executes the operators over its own cluster of nodes.
How does Flink know where to process what? What happens if there are more nodes running Flink than Kafka, or the other way around? In case I have one pair of Kafka and Flink node on each node on the cluster (AWS), how am I sure that nodes get the correct data (ie, only local data)?
In the case of computing the volume of a product in a sliding window and provided the quantities are 10, 3, 6, 1, 0, 15, .... For the first window I may need to compute 10+3+6+1, for the second 3+6+1+0, for the third 6+1+0+15, and so on. The problem here is that in each case I compute some of the sums multiple times (6+1 for instance is computed 3 times), and it is even worse considering a window may have thousands of records and that some operations are not that simple (standard deviation for instance).
Is there a way to reuse the result of previous operations to avoid this problem? Any performance improvement to apply on this cases?
The pipeline of operations to get all the information for a report is really big. It may require an average of prices, volume, standard deviation, etc. But some of these operations can be executed concurrently.
How do I define the workflow (topology) so that certain "steps" are executed concurrently, while others wait for all the previous steps to be completed before proceeding?
The producer and all the consumers have in common many Java classes, so for simplicity I intend to create and launch them from a single application/process, maybe creating one thread for each one if required.
Is there any problem with that? Any advantage of creating an independent application for each producer and consumer as shown in the documentation?