kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Bellemare <adam.bellem...@gmail.com>
Subject Synchronized consumption + processing based on timestamps?
Date Wed, 03 Jul 2019 01:01:23 GMT
Hi All

The use-case is pretty simple. Lets say we have a history of events with
the following:
key=userId, value = (timestamp, productId)

and we want to remap it to (just as we would with an internal topic):
key=productId, value=(original_timestamp, userId)

Now, say I have 30 days of backlog, and 2 partitions for the input topic. I
spin up two instances and let them process the data from the start of time,
but one instance is only half as powerful (less CPU, Mem, etc), such that
instance 0 processes X events / sec which instance 1 processes x/2 events
/sec.

My question is: Are there *any* clients, kafka streams, spark, flink, etc
or otherwise, that would allow these two consumers to remain in sync *according
to their timestamps*? I don't want to see events with original_timestamp of
today (from instance 0) interleaved with events from 15 days ago (from the
underpowered instance 1). Yes, I do realize this would bring my throughput
down, but I am looking for any existing tech that would effectively say *"cap
the time difference of events coming out of this repartition processor at
60 seconds max"*

Currently, I am not aware of ANY open source solutions that do this for
Kafka, but if someone has heard otherwise I would love to know.
Alternately, perhaps this could lead to a KIP.

Thanks
Adam

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message