Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CC177200CD2 for ; Thu, 27 Jul 2017 23:47:16 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id CA09E16BAB0; Thu, 27 Jul 2017 21:47:16 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1B3B316BAAA for ; Thu, 27 Jul 2017 23:47:15 +0200 (CEST) Received: (qmail 28278 invoked by uid 500); 27 Jul 2017 21:47:15 -0000 Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@kafka.apache.org Delivered-To: mailing list jira@kafka.apache.org Received: (qmail 28262 invoked by uid 99); 27 Jul 2017 21:47:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Jul 2017 21:47:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C8756C2D14 for ; Thu, 27 Jul 2017 21:47:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.201 X-Spam-Level: X-Spam-Status: No, score=-99.201 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id cmzf8ZPlhfQl for ; Thu, 27 Jul 2017 21:47:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 34E415F6BF for ; Thu, 27 Jul 2017 21:47:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 65C84E0999 for ; Thu, 27 Jul 2017 21:47:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1C19D24D25 for ; Thu, 27 Jul 2017 21:47:00 +0000 (UTC) Date: Thu, 27 Jul 2017 21:47:00 +0000 (UTC) From: "Jiangjie Qin (JIRA)" To: jira@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KAFKA-5621) The producer should retry expired batches when retries are enabled MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 27 Jul 2017 21:47:17 -0000 [ https://issues.apache.org/jira/browse/KAFKA-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103967#comment-16103967 ] Jiangjie Qin commented on KAFKA-5621: ------------------------------------- [~apurva] I am trying to understand the following statement {quote} On the other hand, for an application, partitions are not really independent (and especially so if you use transactions). If one partition is down, it makes sense to wait for it to be ready before continuing. So we would want to handle as many errors internally as possible. It would mean blocking sends once the queue is too large and not expiring batches in the queue. This simplifies the application programming model. {quote} Is it really different from applications and MM when a partition cannot make progress? It seems in both cases the users would want to know that at some point and handle it? I think retries are also for this purpose, otherwise we may block forever. If I understand right, what this ticket is proposing is just to extend the batch expiration time from request.timeout.ms to request.timeout.ms * reties. And KIP-91 proposes having an additional explicit configuration for that batch expiration time instead of deriving it from request timeout. They seem not quite different except that KIP-91 decouples the configurations from each other. KAFKA-5494 is a good improvement. Regarding the error/anomaly handling, If we are willing to make public interface changes given the next release would be 1.0.0, I am thinking of the following configurations: 1. request.timeout.ms - needed for wire timeout 2. expiry.ms - the expiration time for a message, this is an approximate time to expire a message if it cannot be sent out for whatever reason after it is ready for sending (the batch is ready). In the worst case a message would be expired in (expiry.ms + request.timeout.ms) after that message is ready for sending (note that user defines when the message is ready for sending by specifying linger.ms and batch.size). expiry.ms should be longer than request.timeout.ms, e.g. 2x or 3x. The following configs are optional and will be decided by the producer if not specified: 3. min.retries - When this config is specified, the producer will at least retry for min.retries times even if that will cause the message stay in the producer longer than expiry.ms. This is to avoid the case that the producer cannot even retry at least once. When retry, the producer will do exponential backoff internally. This could be default to 1. Hopefully this gives us a cleaner configuration set for the producer. > The producer should retry expired batches when retries are enabled > ------------------------------------------------------------------ > > Key: KAFKA-5621 > URL: https://issues.apache.org/jira/browse/KAFKA-5621 > Project: Kafka > Issue Type: Bug > Reporter: Apurva Mehta > Fix For: 1.0.0 > > > Today, when a batch is expired in the accumulator, a {{TimeoutException}} is raised to the user. > It might be better the producer to retry the expired batch rather up to the configured number of retries. This is more intuitive from the user's point of view. > Further the proposed behavior makes it easier for applications like mirror maker to provide ordering guarantees even when batches expire. Today, they would resend the expired batch and it would get added to the back of the queue, causing the output ordering to be different from the input ordering. -- This message was sent by Atlassian JIRA (v6.4.14#64029)