kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias J. Sax" <matth...@confluent.io>
Subject Re: Throwing away prefetched records optimisation.
Date Sun, 21 Oct 2018 21:59:24 GMT
It's spread out multiple classes...

Good starting point is here:

It implements the main-loop that polls, addRecordsToTasks() (ie, put the
into buffers), and processes record.

pause/resume is done here:




Not blocking forever makes sense of course. I did not mean in literally.
However, longer poll() blocking times, should allow you to drain the
consumer buffer better.


On 10/21/18 12:54 PM, Zahari Dichev wrote:
> Thanks for your feedback Matthias, Do you think you can point me to the
> part where Kafka streams deals with all of that so I can take a look. Will
> try and see whether your suggested approach works for us before trying to
> argue my point further. Just a mall thing to mention though, blocking on
> the poll forever is not something that I am convinced we would like to do.
> Zahari
> On Sun, Oct 21, 2018 at 10:11 PM Matthias J. Sax <matthias@confluent.io>
> wrote:
>>> You have one consumer that is quite
>>>> slow so lets say it call poll every 5 seconds, while you need to call
>> poll
>>>> every 1 second to issue a heartbeat (these are made up numbers of
>> course).
>> Heartbeats are send by a backgroud thread since And there is a
>> second configuration `max.poll.interval.ms` that you can increase for a
>> slow consumer. Kafka Streams sets it to `Integer.MAX_VALUE` for example.
>> For the remainder, I agree, that using a buffer on top might not a
>> perfect solution. That is, what I meant by:
>>> (not 100% percent guarantee,
>>> depending on fetch size and max.poll.record etc, but with high
>>> probability)
>> However, I believe, that if you configure the consumer accordingly, you
>> can to drain the fetch buffer if you block on `poll()` forever.
>> I have to admit, than I am not familiar with the details of pipelining
>> fetch requests thought. The general idea is, to make sure to drain the
>> internal buffer of the consumer, before you call `pause()`.
>> Curious to to learn why this would not work? How does the pipelining of
>> fetch requests works in detail?
>> -Matthias
>> On 10/19/18 1:36 PM, Zahari Dichev wrote:
>>> Hi there Matthias,
>>> Very useful thoughts indeed. I have considered the exact same approach
>> but
>>> what worries me a bit is that I do not think that will certainly solve
>> the
>>> issue. Imagine the following situation. You have one consumer that is
>> quite
>>> slow so lets say it call poll every 5 seconds, while you need to call
>> poll
>>> every 1 second to issue a heartbeat (these are made up numbers of
>> course).
>>> So our consumer calls poll at T0 grabs some data and puts it in a buffer
>>> and calls pause on the topic partition. The fetcher tries to pipeline and
>>> issues a fetch request and at some point the data arrives. At that point
>> we
>>> have some data in the buffer and we can do whatever we want with it, but
>>> there is also some data living within the consumer/fetcher. Approximately
>>> one second later we call poll again because we need to. We are not
>> getting
>>> any data because our partition is paused and this is good because if we
>> got
>>> data we would not know what to do with it as our client is still busy
>>> crunching the data from the first poll. So far so good. What happens
>> though
>>> is that the pre-fetched data gets thrown away upon calling poll as its
>> "no
>>> longer fetchable...".  Maybe I am not fully understanding your suggested
>>> approach, but I dont think it would solve this problem.
>>> Zahari
>>> On Fri, Oct 19, 2018 at 8:10 PM Matthias J. Sax <matthias@confluent.io>
>>> wrote:
>>>> Just my 2 cents.
>>>> I am not 100% sure if we would need to change the consumer for this.
>>>> While I still think, that KIP-349 might be valuable, it seems to be
>>>> complementary/orthogonal to the issue discussed here.
>>>> For Kafka Streams, we have a related scenario and what Kafka Streams
>>>> does is, to add its own buffer on top of the consumer. Thus, for each
>>>> `poll()` all data is put into this buffer, and now Streams can decide
>>>> which record to process first. For buffers that have data, we can call
>>>> `pause()` without loosing fetched data (not 100% percent guarantee,
>>>> depending on fetch size and max.poll.record etc, but with high
>>>> probability) and if a buffer gets empty we `resume()` partitions.
>>>> As Akka Streams builds on top of the consumer it could implement a
>>>> similar pattern. Of course, one cannot use `auto.commit` on the
>>>> consumer, but commits need to be manged manually, (ie, only data that
>>>> was take out of the buffer and actually was processed can be committed).
>>>> For the `MessageChooser` idea, I also still think it might be useful,
>>>> but it's unclear to me if this should be a consumer feature or build on
>>>> top of the consumer (maybe it could be a Streams feature, as Streams is
>>>> build on top of the consumer). Thoughts?
>>>> -Matthias
>>>> On 10/18/18 9:27 AM, Jan Filipiak wrote:
>>>>> The idea for you would be that Messagechooser could hang on to the
>>>>> prefetched messages.
>>>>> ccing cmccabe@apache.org
>>>>> @Collin
>>>>> just for you to see that MessageChooser is a powerfull abstraction.
>>>>> :)
>>>>> Best jan
>>>>> On 18.10.2018 13:59, Zahari Dichev wrote:
>>>>>> Jan,
>>>>>> Quite insightful indeed. I think your propositions are valid.
>>>>>> Ryanne,
>>>>>> I understand that consumers are using a pull model... And yes, indeed
>>>> if a
>>>>>> consumer is not ready for more records it surely should not call
>>>>>> Except that it needs to do so periodically in order to indicate that
>> its
>>>>>> live. Forget about the "backpressure", I guess I was wrong with
>> phrasing
>>>>>> this so lets not get caught up on it.
>>>>>> You say pause/resume can be used to prioritise certain
>> topics/partitions
>>>>>> over others. And indeed this is the case. So instead of thinking
>>>> it
>>>>>> in terms of backpressure, lets put it in a different way. The Akka
>>>> streams
>>>>>> connector would like to prioritise certain topics over others, using
>>>> once
>>>>>> consumer instance. On top of that, add the detail that the priorities
>>>>>> change quite frequently (which translates to calling pause/resume
>>>>>> frequently). So all that being said, what would be a proper way to
>>>> handle
>>>>>> the situation without throwing the pre-fetched records away when
>> calling
>>>>>> poll on a consumer that happens to have a topic that was recently
>> paused
>>>>>> (and that might be un-paused soon )? Am I the only one who considers
>>>> that
>>>>>> an actual problem with the use os pause/resume ? Not sure how to
>> explain
>>>>>> the situation in a better way..
>>>>>> Zahari
>>>>>> On Thu, Oct 18, 2018 at 9:46 AM Zahari Dichev <zaharidichev@gmail.com
>>>>>> wrote:
>>>>>>> Thanks a lot Jan,
>>>>>>> I will read it.
>>>>>>> Zahari
>>>>>>> On Thu, Oct 18, 2018 at 9:31 AM Jan Filipiak <
>> Jan.Filipiak@trivago.com
>>>>>>> wrote:
>>>>>>>> especially my suggestions ;)
>>>>>>>> On 18.10.2018 08:30, Jan Filipiak wrote:
>>>>>>>>> Hi Zahari,
>>>>>>>>> would you be willing to scan through the KIP-349 discussion
>> little?
>>>>>>>>> I think it has suggestions that could be interesting
for you
>>>>>>>>> Best Jan
>>>>>>>>> On 16.10.2018 09:29, Zahari Dichev wrote:
>>>>>>>>>> Hi there Kafka developers,
>>>>>>>>>> I am currently trying to find a solution to an issue
that has been
>>>>>>>>>> manifesting itself in the Akka streams implementation
of the Kafka
>>>>>>>>>> connector. When it comes to consuming messages, the
>>>>>>>> relies
>>>>>>>>>> heavily on the fact that we can pause and resume
partitions. In
>> some
>>>>>>>>>> situations when a single consumer instance is shared
among several
>>>>>>>>>> streams,
>>>>>>>>>> we might end up with frequently pausing and unpausing
a set of
>> topic
>>>>>>>>>> partitions, which is the main facility that allows
us to implement
>>>> back
>>>>>>>>>> pressure. This however has certain disadvantages,
especially when
>>>>>>>>>> there are
>>>>>>>>>> two consumers that differ in terms of processing
>>>>>>>>>> To articulate the issue more clearly, imagine that
a consumer
>>>> maintains
>>>>>>>>>> assignments for two topic partitions *TP1* and *TP2*.
>> consumer
>>>> is
>>>>>>>>>> shared by two streams - S1 and S2. So effectively
when we have
>>>> demand
>>>>>>>>>> from
>>>>>>>>>> only one of the streams - *S1*, we will pause one
of the topic
>>>>>>>> partitions
>>>>>>>>>> *TP2* and call *poll()* on the consumer to only retrieve
>> records
>>>>>>>> for
>>>>>>>>>> the demanded topic partition - *TP1*. The result
of that is all
>> the
>>>>>>>>>> records
>>>>>>>>>> that have been prefetched for *TP2* are now thrown
away by the
>>>> fetcher
>>>>>>>>>> ("*Not
>>>>>>>>>> returning fetched records for assigned partition
TP2 since it is
>> no
>>>>>>>>>> longer
>>>>>>>>>> fetchable"*). If we extrapolate that to multiple
streams sharing
>> the
>>>>>>>> same
>>>>>>>>>> consumer, we might quickly end up in a situation
where we throw
>>>>>>>>>> prefetched
>>>>>>>>>> data quite often. This does not seem like the most
>>>> approach
>>>>>>>> and
>>>>>>>>>> in fact produces quite a lot of overlapping fetch
requests as
>>>>>>>> illustrated
>>>>>>>>>> in the following issue:
>>>>>>>>>> https://github.com/akka/alpakka-kafka/issues/549
>>>>>>>>>> I am writing this email to get some initial opinion
on a KIP I was
>>>>>>>>>> thinking
>>>>>>>>>> about. What if we give the clients of the Consumer
API a bit more
>>>>>>>> control
>>>>>>>>>> of what to do with this prefetched data. Two options
I am
>> wondering
>>>>>>>>>> about:
>>>>>>>>>> 1. Introduce a configuration setting, such as*
>>>>>>>>>> "return-prefetched-data-for-paused-topic-partitions
= false"*
>> (have
>>>> to
>>>>>>>>>> think of a better name), which when set to true will
return what
>> is
>>>>>>>>>> prefetched instead of throwing it away on calling
*poll()*. Since
>>>> this
>>>>>>>> is
>>>>>>>>>> amount of data that is bounded by the maximum size
of the
>> prefetch,
>>>> we
>>>>>>>>>> can
>>>>>>>>>> control what is the most amount of records returned.
The client of
>>>> the
>>>>>>>>>> consumer API can then be responsible for keeping
that data around
>>>> and
>>>>>>>> use
>>>>>>>>>> it when appropriate (i.e. when demand is present)
>>>>>>>>>> 2. Introduce a facility to pass in a buffer into
which the
>>>> prefetched
>>>>>>>>>> records are drained when poll is called and paused
partitions have
>>>> some
>>>>>>>>>> prefetched records.
>>>>>>>>>> Any opinions on the matter are welcome. Thanks a
lot !
>>>>>>>>>> Zahari Dichev

View raw message