flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 金剑 <jinjia...@gmail.com>
Subject Re: Batchsize in kafka sink
Date Wed, 30 Sep 2015 01:06:27 GMT
You should still be cautious using async producer. Kafka async producer
would throw messages away silently  under some situation. For example, our
kafka ran out of disk, then kafka producer just discard all messages after
that. we configured a spoilable memory channel, but of course, the producer
is not aware of that.  I took a look at the source code, that is actually
current logic. And you also could not get any exception.
If throughput is really a concern, it is better to do a test.  Batch in
flume may be enough.

Best Regards!

Jian Jin

2015-09-27 22:03 GMT-04:00 Sharninder <sharninder@gmail.com>:

> That does make sense. Thanks Gonzalo. We do use the async producer with a
> default kafka num.messages. We don't care about a few messages being lost
> in the event of a crash or something so I think we'll continue using the
> async producer but picking up an X number of messages in a single
> transaction will surely help withe reducing IO on the flume server.
>
> Thanks a lot.
>
> --
> Sharninder
>
>
> On Sun, Sep 27, 2015 at 3:16 PM, Gonzalo Herreros <gherreros@gmail.com>
> wrote:
>
>> There are subtle but significant differences.
>>
>> When you configure in the sink: "batchSize" you are specifying how many
>> messages are taken as a transaction from the channel at once (like in any
>> other sink).
>> While the Kafka property "batch.num.messages" (which in the flume config
>> is specified as "kafka.batch.num.messages", specifies the batch size for
>> sending messages to the broker from an asynchronous producer. By default
>> the producer is synchronous, so that configuration property would do
>> nothing.
>>
>> If you use the synchronous producer (which is default), the messages
>> taken from the channel as a batch (100 by default) will be sent together to
>> the kafka broker.
>> However, if you change the producer to async then it's more complicated,
>> by default "kafka.batch.num.messages" is 200 so it means that the Sink
>> will take 100 from the channel and commit that but those messages will be
>> kept in memory until another 100 are taken (so there is a risk of losing
>> messages).
>>
>> I would stay away for the async producer in a Flume sink because you want
>> the sink to control the pace (a file or memory channel will be faster) so
>> it doesn't need to buffer in memory risking message loss. An async producer
>> is useful when the client is an online application you don't want to delay.
>>
>> Answering you question: if you don't specify any batching properties, by
>> default it will delivery messages in batches of 100, which is probably good
>> in most cases.
>> Hope that makes sense.
>>
>> Regards,
>> Gonzalo
>>
>>
>> On 26 September 2015 at 05:19, Sharninder <sharninder@gmail.com> wrote:
>>
>>> Anyone ?
>>>
>>> > On 25-Sep-2015, at 3:51 PM, Sharninder <sharninder@gmail.com> wrote:
>>> >
>>> > Hi,
>>> >
>>> > We want to move to the built-in kafka sink from our own custom
>>> implementation and I have a question about the batchsize config parameter.
>>> >
>>> > Looking at the code of the sink, I can tell that the batchsize is used
>>> to construct the list of keyed messages fed to the producer.
>>> >
>>> > My question is what is the difference between this variable and the
>>> kafka batch.num.messages parameter?
>>> >
>>> > Is the flume parameter necessary ?
>>> >
>>> > --
>>> > Sharninder
>>> >
>>> >
>>>
>>
>>
>
>
> --
> --
> Sharninder
>
>
>

Mime
View raw message