kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zahari Dichev <zaharidic...@gmail.com>
Subject Re: Throwing away prefetched records optimisation.
Date Wed, 17 Oct 2018 18:55:29 GMT
Hi there Ryanne,

Thanks for the response ! There is most likely quite a lot that I am
missing here, but after I read the docs, it seems to me that the
pause/resume API has been provided with the very purpose of implementing
bespoke flow control. That being said, I see it as quite natural to be able
to pause and resume as needed without facing the problems outlined in
my previous email. So if we are going totally wrong about this and using
the pause/resume the wrong way, feel free to elaborate. I am really not
trying to argue my case here, just genuinely attempting to understand what
can be done on our end to improve the Akka streams integration.. Thanks in
advance :)

Zahari

On Wed, Oct 17, 2018 at 5:49 PM Ryanne Dolan <ryannedolan@gmail.com> wrote:

> Zahari,
>
> It sounds to me like this problem is due to Akka attempting to implement
> additional backpressure on top of the Consumer API. I'd suggest they not do
> that, and then this problem goes away.
>
> Ryanne
>
> On Wed, Oct 17, 2018 at 7:35 AM Zahari Dichev <zaharidichev@gmail.com>
> wrote:
>
> > Hi there,
> >
> > Are there any opinions on the matter described in my previous email? I
> > think this is quite important when it comes to implementing any non
> trivial
> > functionality that relies on pause/resume. Of course if I am mistaken,
> feel
> > free to elaborate.
> >
> > Thanks,
> > Zahari
> >
> > On Tue, Oct 16, 2018 at 10:29 AM Zahari Dichev <zaharidichev@gmail.com>
> > wrote:
> >
> > > Hi there Kafka developers,
> > >
> > > I am currently trying to find a solution to an issue that has been
> > > manifesting itself in the Akka streams implementation of the Kafka
> > > connector. When it comes to consuming messages, the implementation
> relies
> > > heavily on the fact that we can pause and resume partitions. In some
> > > situations when a single consumer instance is shared among several
> > streams,
> > > we might end up with frequently pausing and unpausing a set of topic
> > > partitions, which is the main facility that allows us to implement back
> > > pressure. This however has certain disadvantages, especially when there
> > are
> > > two consumers that differ in terms of processing speed.
> > >
> > > To articulate the issue more clearly, imagine that a consumer maintains
> > > assignments for two topic partitions *TP1* and *TP2*. This consumer is
> > > shared by two streams - S1 and S2. So effectively when we have demand
> > from
> > > only one of the streams - *S1*, we will pause one of the topic
> partitions
> > > *TP2* and call *poll()* on the consumer to only retrieve the records
> for
> > > the demanded topic partition - *TP1*. The result of that is all the
> > > records that have been prefetched for *TP2* are now thrown away by the
> > > fetcher ("*Not returning fetched records for assigned partition TP2
> since
> > > it is no longer fetchable"*). If we extrapolate that to multiple
> streams
> > > sharing the same consumer, we might quickly end up in a situation where
> > we
> > > throw prefetched data quite often. This does not seem like the most
> > > efficient approach and in fact produces quite a lot of overlapping
> fetch
> > > requests as illustrated in the following issue:
> > >
> > > https://github.com/akka/alpakka-kafka/issues/549
> > >
> > > I am writing this email to get some initial opinion on a KIP I was
> > > thinking about. What if we give the clients of the Consumer API a bit
> > more
> > > control of what to do with this prefetched data. Two options I am
> > wondering
> > > about:
> > >
> > > 1. Introduce a configuration setting, such as*
> > > "return-prefetched-data-for-paused-topic-partitions = false"* (have to
> > > think of a better name), which when set to true will return what is
> > > prefetched instead of throwing it away on calling *poll()*. Since this
> is
> > > amount of data that is bounded by the maximum size of the prefetch, we
> > can
> > > control what is the most amount of records returned. The client of the
> > > consumer API can then be responsible for keeping that data around and
> use
> > > it when appropriate (i.e. when demand is present)
> > >
> > > 2. Introduce a facility to pass in a buffer into which the prefetched
> > > records are drained when poll is called and paused partitions have some
> > > prefetched records.
> > >
> > > Any opinions on the matter are welcome. Thanks a lot !
> > >
> > > Zahari Dichev
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message