kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias J. Sax" <matth...@confluent.io>
Subject Re: Synchronized consumption + processing based on timestamps?
Date Wed, 03 Jul 2019 03:27:21 GMT
I think you can only achieve this, if

1) you don't use two clients, but only one client that reads both partitions

or

2) let both clients exchange data about their time progress


-Matthias


On 7/2/19 6:01 PM, Adam Bellemare wrote:
> Hi All
> 
> The use-case is pretty simple. Lets say we have a history of events with
> the following:
> key=userId, value = (timestamp, productId)
> 
> and we want to remap it to (just as we would with an internal topic):
> key=productId, value=(original_timestamp, userId)
> 
> Now, say I have 30 days of backlog, and 2 partitions for the input topic. I
> spin up two instances and let them process the data from the start of time,
> but one instance is only half as powerful (less CPU, Mem, etc), such that
> instance 0 processes X events / sec which instance 1 processes x/2 events
> /sec.
> 
> My question is: Are there *any* clients, kafka streams, spark, flink, etc
> or otherwise, that would allow these two consumers to remain in sync *according
> to their timestamps*? I don't want to see events with original_timestamp of
> today (from instance 0) interleaved with events from 15 days ago (from the
> underpowered instance 1). Yes, I do realize this would bring my throughput
> down, but I am looking for any existing tech that would effectively say *"cap
> the time difference of events coming out of this repartition processor at
> 60 seconds max"*
> 
> Currently, I am not aware of ANY open source solutions that do this for
> Kafka, but if someone has heard otherwise I would love to know.
> Alternately, perhaps this could lead to a KIP.
> 
> Thanks
> Adam
> 


Mime
View raw message