kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apurva Mehta (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KAFKA-5477) TransactionalProducer sleeps unnecessarily long during back to back transactions
Date Wed, 21 Jun 2017 00:48:00 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-5477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Apurva Mehta updated KAFKA-5477:
    Priority: Blocker  (was: Major)

> TransactionalProducer sleeps unnecessarily long during back to back transactions
> --------------------------------------------------------------------------------
>                 Key: KAFKA-5477
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5477
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions:
>            Reporter: Apurva Mehta
>            Assignee: Apurva Mehta
>            Priority: Blocker
>             Fix For:
> I am running some perf tests for EOS and there is a severe perf impact with our default
> Here is the issue.
> # When we do a commit transaction, the producer sends an `EndTxn` request to the coordinator.
The coordinator writes the `PrepareCommit` message to the transaction log and then returns
the response the client. It writes the transaction markers and the final 'CompleteCommit'
message asynchronously. 
> # In the mean time, if the client starts another transaction, it will send an `AddPartitions`
request on the next `Sender.run` loop. If the markers haven't been written yet, then the coordinator
will return a retriable `CONCURRENT_TRANSACTIONS` error to the client.
> # The current behavior in the producer is to sleep for `retryBackoffMs` before retrying
the request. The current default for this is 100ms. So the producer will sleep for 100ms before
sending the `AddPartitions` again. This puts a floor on the latency for back to back transactions.
> The impact: Back to back transactions (the typical usecase for streams) would have a
latency floor of 100ms.
> Ideally, we don't want to sleep the full 100ms  in this particular case, because the
retry is 'expected'.
> The options are: 
> # do nothing, let streams override the retry.backoff.ms in their producer to 10 when
EOS is enabled (since they have a HOTFIX patch out anyway).
> # Introduce a special 'transactionRetryBackoffMs' non-configurable variable and hard
code that to a low value which applies to all transactional requests.
> # do nothing and fix it properly in 
> Option 2 as stated is a 1 line fix. If we want to lower the retry just for this particular
error, it would be a slightly bigger change (10-15 lines).

This message was sent by Atlassian JIRA

View raw message