Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm
Precedence: bulk
Reply-To: jira@kafka.apache.org
Date: Thu, 27 Jul 2017 21:47:00 +0000 (UTC)
From: "Jiangjie Qin (JIRA)" <jira@apache.org>
To: jira@kafka.apache.org
Message-ID: <JIRA.13088865.1500576070000.31113.1501192020113@Atlassian.JIRA>
In-Reply-To: <JIRA.13088865.1500576070000@Atlassian.JIRA>
References: <JIRA.13088865.1500576070000@Atlassian.JIRA> <JIRA.13088865.1500576070418@jira-lw-us.apache.org>
Subject: [jira] [Commented] (KAFKA-5621) The producer should retry expired
 batches when retries are enabled
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Thu, 27 Jul 2017 21:47:17 -0000


    [ https://issues.apache.org/jira/browse/KAFKA-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103967#comment-16103967 ] 

Jiangjie Qin commented on KAFKA-5621:
-------------------------------------

[~apurva] I am trying to understand the following statement
{quote}
On the other hand, for an application, partitions are not really independent (and especially so if you use transactions). If one partition is down, it makes sense to wait for it to be ready before continuing. So we would want to handle as many errors internally as possible. It would mean blocking sends once the queue is too large and not expiring batches in the queue. This simplifies the application programming model.
{quote}

Is it really different from applications and MM when a partition cannot make progress? It seems in both cases the users would want to know that at some point and handle it? I think retries are also for this purpose, otherwise we may block forever. If I understand right, what this ticket is proposing is just to extend the batch expiration time from request.timeout.ms to request.timeout.ms * reties. And KIP-91 proposes having an additional explicit configuration for that batch expiration time instead of deriving it from request timeout. They seem not quite different except that KIP-91 decouples the configurations from each other.

KAFKA-5494 is a good improvement. Regarding the error/anomaly handling, If we are willing to make public interface changes given the next release would be 1.0.0, I am thinking of the following configurations:
1. request.timeout.ms - needed for wire timeout
2. expiry.ms - the expiration time for a message, this is an approximate time to expire a message if it cannot be sent out for whatever reason after it is ready for sending (the batch is ready). In the worst case a message would be expired in (expiry.ms + request.timeout.ms) after that message is ready for sending (note that user defines when the message is ready for sending by specifying linger.ms and batch.size). expiry.ms should be longer than request.timeout.ms, e.g. 2x or 3x.

The following configs are optional and will be decided by the producer if not specified:
3. min.retries - When this config is specified, the producer will at least retry for min.retries times even if that will cause the message stay in the producer longer than expiry.ms. This is to avoid the case that the producer cannot even retry at least once. When retry, the producer will do exponential backoff internally. This could be default to 1.

Hopefully this gives us a cleaner configuration set for the producer.

> The producer should retry expired batches when retries are enabled
> ------------------------------------------------------------------
>
>                 Key: KAFKA-5621
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5621
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Apurva Mehta
>             Fix For: 1.0.0
>
>
> Today, when a batch is expired in the accumulator, a {{TimeoutException}} is raised to the user.
> It might be better the producer to retry the expired batch rather up to the configured number of retries. This is more intuitive from the user's point of view. 
> Further the proposed behavior makes it easier for applications like mirror maker to provide ordering guarantees even when batches expire. Today, they would resend the expired batch and it would get added to the back of the queue, causing the output ordering to be different from the input ordering.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)