kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Di Shang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KAFKA-6706) NETWORK_EXCEPTION and REQUEST_TIMED_OUT in mirrormaker producer after 1.0 broker upgrade
Date Thu, 29 Mar 2018 05:33:00 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Di Shang updated KAFKA-6706:
----------------------------
       Priority: Blocker  (was: Major)
    Description: 
We have 2 clusters A and B with 4 brokers each, we use mirrormaker to replicate topics from A to B. 
 We recently upgraded our brokers from 0.10.2.0 to 1.0.0, after the upgrade we started seeing the mirrormaker task showing producer errors and intermittently dying. 
 We tried using 1.0.0 and 0.10.2.0 mirrormaker, both have the same problem. Downgrading cluster B brokers back to 0.10.2.0 and the problem went away, so we think it's a server side problem.

There are 2 types of errors: REQUEST_TIMED_OUT and NETWORK_EXCEPTION. For testing, I used a topic *logging* with 20 partitions and 3 replicas (same on cluster A and B), the source topic has 50+ million msg.

(this is from mirrormaker 1.0 at info level, the 0.10.2.0 log is very similar)
{noformat}
22 Mar 2018 02:16:07.407 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 35122 on topic-partition logging-7, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:17:49.731 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 51572 on topic-partition logging-7, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:18:33.903 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 57785 on topic-partition logging-5, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:21:21.399 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 85406 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:25:22.278 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 128047 on topic-partition logging-5, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:26:17.154 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 137049 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:27:57.358 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 153976 on topic-partition logging-5, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:27:57.779 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-2, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
 22 Mar 2018 02:27:57.780 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-10, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
 22 Mar 2018 02:27:57.780 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-18, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
 22 Mar 2018 02:27:57.781 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-14, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
 22 Mar 2018 02:27:57.781 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-6, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
 22 Mar 2018 02:29:12.378 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 163175 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:31:47.067 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 192333 on topic-partition logging-17, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:32:26.198 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 195400 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:33:16.280 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 206708 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:33:46.428 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 206830 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 12:53:50.848 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 734439 on topic-partition logging-3, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 16:21:45.557 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 941307 on topic-partition logging-6, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT{noformat}
The NETWORK_EXCEPTION always comes in group of 5 with partition leaders on the same broker (there are 5 partition leader/broker). 
 The REQUEST_TIMED_OUT seems to be from random individual partition.

I turned on trace and it showed how the NETWORK_EXCEPTION is causing the task to die.
 (I filtered out the important lines in this log segment)
{noformat}
22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.NetworkClient trace(line:135) [Producer clientId=producer-1] Completed receive from node 44 for PRODUCE with correlation id 71596, received {responses=[{topic=logging,partition_responses=[{partition=17,error_code=0,base_offset=28466678,log_append_time=-1,log_start_offset=24408600},{partition=1,error_code=0,base_offset=11611076,log_append_time=-1,log_start_offset=7562939},{partition=5,error_code=0,base_offset=28467700,log_append_time=-1,log_start_offset=24416182},{partition=9,error_code=0,base_offset=32762076,log_append_time=-1,log_start_offset=30108807},{partition=13,error_code=0,base_offset=11612712,log_append_time=-1,log_start_offset=7564490}]}],throttle_time_ms=0}
22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:129) [Producer clientId=producer-1] Received produce response from node 44 with correlation id 71596
22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-9 with base offset 32762076.
22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-13 with base offset 11612712.
22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-17 with base offset 28466678.
22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-1 with base offset 11611076.
22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-5 with base offset 28467700.
22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:123) [Producer clientId=producer-1] Nodes with data ready to send: [broker44:9020 (id: 44 rack: null)]
22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.NetworkClient trace(line:135) [Producer clientId=producer-1] Sending PRODUCE {acks=-1,timeout=30000,partitionSizes=[logging-1=647,logging-9=645,logging-5=647,logging-13=645]} with correlation id 71599 to node 44
22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:129) [Producer clientId=producer-1] Sent produce request to 44: (type=ProduceRequest, acks=-1, timeout=30000, partitionRecords=({logging-1=[(record=DefaultRecord(offset=0, timestamp=1521353381705, key=0 bytes, value=1048 bytes))], logging-9=[(record=DefaultRecord(offset=0, timestamp=1521353381983, key=0 bytes, value=1048 bytes))], logging-5=[(record=DefaultRecord(offset=0, timestamp=1521353380881, key=0 bytes, value=1048 bytes))], logging-13=[(record=DefaultRecord(offset=0, timestamp=1521353381288, key=0 bytes, value=1048 bytes))]}), transactionalId=''
22 Mar 2018 02:06:20.156 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.KafkaProducer trace(line:135) [Producer clientId=producer-1] Sending record ProducerRecord(topic=logging, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value=[B@cf56dcea, timestamp=1521353382731) with callback kafka.tools.MirrorMaker$MirrorMakerProducerCallback@e7d47b69 to topic logging partition 17
22 Mar 2018 02:06:20.156 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.internals.RecordAccumulator trace(line:135) [Producer clientId=producer-1] Allocating a new 16384 byte message buffer for topic logging partition 17
22 Mar 2018 02:06:20.156 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.KafkaProducer trace(line:129) [Producer clientId=producer-1] Waking up the sender since topic logging partition 17 is either full or getting a new batch
22 Mar 2018 02:06:20.164 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.KafkaProducer trace(line:135) [Producer clientId=producer-1] Sending record ProducerRecord(topic=logging, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value=[B@d6e47439, timestamp=1521353385041) with callback kafka.tools.MirrorMaker$MirrorMakerProducerCallback@60717e86 to topic logging partition 17
22 Mar 2018 02:06:20.172 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.KafkaProducer trace(line:135) [Producer clientId=producer-1] Sending record ProducerRecord(topic=logging, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value=[B@613a02ed, timestamp=1521353387109) with callback kafka.tools.MirrorMaker$MirrorMakerProducerCallback@563a082b to topic logging partition 17
...
30s later (request.timeout.ms=30000)
...
22 Mar 2018 02:06:50.156 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.clients.NetworkClient debug(line:183) [Producer clientId=producer-1] Disconnecting from node 44 due to request timeout.
22 Mar 2018 02:06:50.156 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.NetworkClient trace(line:135) [Producer clientId=producer-1] Cancelled request {acks=-1,timeout=30000,partitionSizes=[logging-1=647,logging-9=645,logging-5=647,logging-13=645]} with correlation id 71599 due to node 44 being disconnected
22 Mar 2018 02:06:50.156 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:129) [Producer clientId=producer-1] Cancelled request with header RequestHeader(apiKey=PRODUCE, apiVersion=5, clientId=producer-1, correlationId=71599) due to node 44 being disconnected
22 Mar 2018 02:06:50.156 [kafka-producer-network-thread | producer-1] WARN  org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 71599 on topic-partition logging-1, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] WARN  org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 71599 on topic-partition logging-9, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] WARN  org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 71599 on topic-partition logging-5, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] WARN  org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 71599 on topic-partition logging-13, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:123) [Producer clientId=producer-1] Expired 1 batches in accumulator
22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:174) Failed to produce messages to logging-17.
org.apache.kafka.common.errors.TimeoutException: Expiring 25 record(s) for logging-17: 30001 ms has passed since batch creation plus linger time
22 Mar 2018 02:06:50.162 [kafka-producer-network-thread | producer-1] ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback onCompletion(line:52) Error when sending message to topic logging with key: null, value: 1047 bytes with error:
org.apache.kafka.common.errors.TimeoutException: Expiring 25 record(s) for logging-17: 30001 ms has passed since batch creation plus linger time
22 Mar 2018 02:06:50.162 [kafka-producer-network-thread | producer-1] INFO  kafka.tools.MirrorMaker$ info(line:72) Closing producer due to send failure.
22 Mar 2018 02:06:50.163 [kafka-producer-network-thread | producer-1] INFO  org.apache.kafka.clients.producer.KafkaProducer info(line:341) [Producer clientId=producer-1] Closing the Kafka producer with timeoutMillis = 0 ms.
22 Mar 2018 02:06:50.163 [kafka-producer-network-thread | producer-1] INFO  org.apache.kafka.clients.producer.KafkaProducer info(line:341) [Producer clientId=producer-1] Proceeding to force close the producer since pending requests could not be completed within timeout 0 ms.
22 Mar 2018 02:06:50.164 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.clients.producer.KafkaProducer debug(line:177) [Producer clientId=producer-1] Kafka producer has been closed
{noformat}
 The first few lines shows that the producer just received a response from its previous produce request to broker 44 with partitions \{1,5,9,13,17} (leaders). 
 The next produce request it sent only contained batches for partitions \{1,5,9,13}, however 1ms later new data arrived for partition 17 but since it missed the current request to that broker it had to sit in the producer and wait for the next send round. 
 Somehow the produce request for \{1,5,9,13} never got a response and timed out. Normally these batches can be retried and mirrormaker will keep working, however because of the leftover batch for partition 17 expired, mirrormaker detected data loss and will shutdown.

So I think the root problem is the REQUEST_TIMED_OUT where the server intermittently times out for the produce request after upgrading to 1.0. 
 In the lucky case where all batches can be retried (they seem to always succeed on the first retry anyway), it can randomly introduce 30s of latency to data replication, and if unlucky a batch get left out and expires, the task immediately dies. 
 We observed that this usually happens under high load with large number of msg from the source topic to mirror (I can reliably recreate the errors by restarting the task with a new consumer group id and auto.offset.reset=earliest so it got 50 mil msg to mirror from the start of the topic)

Increasing producer linger.ms and batch.size seems to some degree reduce the error rate under same heavy workload.

I tried to enable trace on the server but all I see are logs (massive) from ReplicaFetcher and lines like this:
{noformat}
Bytes written as part of multi-send call: 2431, total bytes written so far: 2431, expected bytes to write: 2431{noformat}
Is there any log on the server side that can be correlated to the producer request and help debugging this?

This is the trace log config I tried to use on the server but couldn't see anything useful:
{noformat}
 rootLogger.level=trace
 appender.console.type=Console
 appender.console.name=STDOUT
 rootLogger.appenderRef.stdout.ref=STDOUT{noformat}
 

Here's the mirrormaker producer config:
{noformat}
	acks = all
	batch.size = 16384
	bootstrap.servers = [...]
	buffer.memory = 33554432
	client.id = 
	compression.type = gzip
	connections.max.idle.ms = 540000
	enable.idempotence = false
	interceptor.classes = null
	key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
	linger.ms = 0
	max.block.ms = 9223372036854775807
	max.in.flight.requests.per.connection = 1
	max.request.size = 2097152
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
	receive.buffer.bytes = 32768
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 30000
	retries = 2147483647
	retry.backoff.ms = 100
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	send.buffer.bytes = 131072
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
	ssl.endpoint.identification.algorithm = null
	ssl.key.password = null
	ssl.keymanager.algorithm = IbmX509
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLS
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
	transaction.timeout.ms = 60000
	transactional.id = null
	value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer{noformat}

  was:
We have 2 clusters A and B with 4 brokers each, we use mirrormaker to replicate topics from A to B. 
 We recently upgraded our brokers from 0.10.2.0 to 1.0.0, after the upgrade we started seeing the mirrormaker task showing producer errors and intermittently dying. 
 We tried using 1.0.0 and 0.10.2.0 mirrormaker, both have the same problem. Downgrading cluster B brokers back to 0.10.2.0 and the problem went away, so we think it's a server side problem.

There are 2 types of errors: REQUEST_TIMED_OUT and NETWORK_EXCEPTION. For testing, I used a topic *logging* with 20 partitions and 3 replicas (same on cluster A and B), the source topic has 50+ million msg.

(this is from mirrormaker 1.0 at info level, the 0.10.2.0 log is very similar)
{noformat}
22 Mar 2018 02:16:07.407 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 35122 on topic-partition logging-7, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:17:49.731 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 51572 on topic-partition logging-7, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:18:33.903 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 57785 on topic-partition logging-5, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:21:21.399 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 85406 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:25:22.278 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 128047 on topic-partition logging-5, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:26:17.154 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 137049 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:27:57.358 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 153976 on topic-partition logging-5, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:27:57.779 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-2, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
 22 Mar 2018 02:27:57.780 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-10, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
 22 Mar 2018 02:27:57.780 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-18, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
 22 Mar 2018 02:27:57.781 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-14, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
 22 Mar 2018 02:27:57.781 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-6, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
 22 Mar 2018 02:29:12.378 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 163175 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:31:47.067 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 192333 on topic-partition logging-17, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:32:26.198 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 195400 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:33:16.280 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 206708 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 02:33:46.428 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 206830 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 12:53:50.848 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 734439 on topic-partition logging-3, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
 22 Mar 2018 16:21:45.557 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 941307 on topic-partition logging-6, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT{noformat}
The NETWORK_EXCEPTION always comes in group of 5 with partition leaders on the same broker (there are 5 partition leader/broker). 
 The REQUEST_TIMED_OUT seems to be from random individual partition.

I turned on trace and it showed how the NETWORK_EXCEPTION is causing the task to die.
 (I filtered out the important lines in this log segment)
{noformat}
22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.NetworkClient trace(line:135) [Producer clientId=producer-1] Completed receive from node 44 for PRODUCE with correlation id 71596, received {responses=[{topic=logging,partition_responses=[{partition=17,error_code=0,base_offset=28466678,log_append_time=-1,log_start_offset=24408600},{partition=1,error_code=0,base_offset=11611076,log_append_time=-1,log_start_offset=7562939},{partition=5,error_code=0,base_offset=28467700,log_append_time=-1,log_start_offset=24416182},{partition=9,error_code=0,base_offset=32762076,log_append_time=-1,log_start_offset=30108807},{partition=13,error_code=0,base_offset=11612712,log_append_time=-1,log_start_offset=7564490}]}],throttle_time_ms=0}
22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:129) [Producer clientId=producer-1] Received produce response from node 44 with correlation id 71596
22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-9 with base offset 32762076.
22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-13 with base offset 11612712.
22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-17 with base offset 28466678.
22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-1 with base offset 11611076.
22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-5 with base offset 28467700.
22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:123) [Producer clientId=producer-1] Nodes with data ready to send: [broker44:9020 (id: 44 rack: null)]
22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.NetworkClient trace(line:135) [Producer clientId=producer-1] Sending PRODUCE {acks=-1,timeout=30000,partitionSizes=[logging-1=647,logging-9=645,logging-5=647,logging-13=645]} with correlation id 71599 to node 44
22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:129) [Producer clientId=producer-1] Sent produce request to 44: (type=ProduceRequest, acks=-1, timeout=30000, partitionRecords=({logging-1=[(record=DefaultRecord(offset=0, timestamp=1521353381705, key=0 bytes, value=1048 bytes))], logging-9=[(record=DefaultRecord(offset=0, timestamp=1521353381983, key=0 bytes, value=1048 bytes))], logging-5=[(record=DefaultRecord(offset=0, timestamp=1521353380881, key=0 bytes, value=1048 bytes))], logging-13=[(record=DefaultRecord(offset=0, timestamp=1521353381288, key=0 bytes, value=1048 bytes))]}), transactionalId=''
22 Mar 2018 02:06:20.156 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.KafkaProducer trace(line:135) [Producer clientId=producer-1] Sending record ProducerRecord(topic=logging, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value=[B@cf56dcea, timestamp=1521353382731) with callback kafka.tools.MirrorMaker$MirrorMakerProducerCallback@e7d47b69 to topic logging partition 17
22 Mar 2018 02:06:20.156 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.internals.RecordAccumulator trace(line:135) [Producer clientId=producer-1] Allocating a new 16384 byte message buffer for topic logging partition 17
22 Mar 2018 02:06:20.156 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.KafkaProducer trace(line:129) [Producer clientId=producer-1] Waking up the sender since topic logging partition 17 is either full or getting a new batch
22 Mar 2018 02:06:20.164 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.KafkaProducer trace(line:135) [Producer clientId=producer-1] Sending record ProducerRecord(topic=logging, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value=[B@d6e47439, timestamp=1521353385041) with callback kafka.tools.MirrorMaker$MirrorMakerProducerCallback@60717e86 to topic logging partition 17
22 Mar 2018 02:06:20.172 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.KafkaProducer trace(line:135) [Producer clientId=producer-1] Sending record ProducerRecord(topic=logging, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value=[B@613a02ed, timestamp=1521353387109) with callback kafka.tools.MirrorMaker$MirrorMakerProducerCallback@563a082b to topic logging partition 17
...
30s later (request.timeout.ms=30000)
...
22 Mar 2018 02:06:50.156 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.clients.NetworkClient debug(line:183) [Producer clientId=producer-1] Disconnecting from node 44 due to request timeout.
22 Mar 2018 02:06:50.156 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.NetworkClient trace(line:135) [Producer clientId=producer-1] Cancelled request {acks=-1,timeout=30000,partitionSizes=[logging-1=647,logging-9=645,logging-5=647,logging-13=645]} with correlation id 71599 due to node 44 being disconnected
22 Mar 2018 02:06:50.156 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:129) [Producer clientId=producer-1] Cancelled request with header RequestHeader(apiKey=PRODUCE, apiVersion=5, clientId=producer-1, correlationId=71599) due to node 44 being disconnected
22 Mar 2018 02:06:50.156 [kafka-producer-network-thread | producer-1] WARN  org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 71599 on topic-partition logging-1, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] WARN  org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 71599 on topic-partition logging-9, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] WARN  org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 71599 on topic-partition logging-5, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] WARN  org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 71599 on topic-partition logging-13, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:123) [Producer clientId=producer-1] Expired 1 batches in accumulator
22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:174) Failed to produce messages to logging-17.
org.apache.kafka.common.errors.TimeoutException: Expiring 25 record(s) for logging-17: 30001 ms has passed since batch creation plus linger time
22 Mar 2018 02:06:50.162 [kafka-producer-network-thread | producer-1] ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback onCompletion(line:52) Error when sending message to topic logging with key: null, value: 1047 bytes with error:
org.apache.kafka.common.errors.TimeoutException: Expiring 25 record(s) for logging-17: 30001 ms has passed since batch creation plus linger time
22 Mar 2018 02:06:50.162 [kafka-producer-network-thread | producer-1] INFO  kafka.tools.MirrorMaker$ info(line:72) Closing producer due to send failure.
22 Mar 2018 02:06:50.163 [kafka-producer-network-thread | producer-1] INFO  org.apache.kafka.clients.producer.KafkaProducer info(line:341) [Producer clientId=producer-1] Closing the Kafka producer with timeoutMillis = 0 ms.
22 Mar 2018 02:06:50.163 [kafka-producer-network-thread | producer-1] INFO  org.apache.kafka.clients.producer.KafkaProducer info(line:341) [Producer clientId=producer-1] Proceeding to force close the producer since pending requests could not be completed within timeout 0 ms.
22 Mar 2018 02:06:50.164 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.clients.producer.KafkaProducer debug(line:177) [Producer clientId=producer-1] Kafka producer has been closed
{noformat}
 The first few lines shows that the producer just received a response from its previous produce request to broker 44 with partitions \{1,5,9,13,17} (leaders). 
 The next produce request it sent only contained batches for partitions \{1,5,9,13}, however 1ms later new data arrived for partition 17 but since it missed the current request to that broker it had to sit in the producer and wait for the next send round. 
 Somehow the produce request for \{1,5,9,13} never got a response and timed out. Normally these batches can be retried and mirrormaker will keep working, however because of the leftover batch for partition 17 expired, mirrormaker detected data loss and will shutdown.

So I think the root problem here is that the server intermittently times out for the produce request after upgrading to 1.0. 
 In the lucky case where all batches can be retried (they seem to always succeed on the first retry anyway), it can randomly introduce 30s of latency to data replication, and if unlucky a batch get left out and expires, the task immediately dies. 
 We observed that this usually happens under high load (I can reliably recreate the errors by restarting the task with a new consumer group id and auto.offset.reset=earliest so it got 50 mil msg to mirror)

Increasing producer linger.ms and batch.size seems to some degree reduce the error rate under same heavy workload.

I tried to enable trace on the server but all I see are logs (massive) from ReplicaFetcher and lines like this:
{noformat}
Bytes written as part of multi-send call: 2431, total bytes written so far: 2431, expected bytes to write: 2431{noformat}
Is there any log on the server side that can be correlated to the producer request and help debugging this?
 Also I still got no idea what the REQUEST_TIMED_OUT meant and what's causing it.

This is the trace log config I tried to use on the server but couldn't see anything useful:
{noformat}
 rootLogger.level=trace
 appender.console.type=Console
 appender.console.name=STDOUT
 rootLogger.appenderRef.stdout.ref=STDOUT{noformat}
 

Here's the mirrormaker producer config:
{noformat}
	acks = all
	batch.size = 16384
	bootstrap.servers = [...]
	buffer.memory = 33554432
	client.id = 
	compression.type = gzip
	connections.max.idle.ms = 540000
	enable.idempotence = false
	interceptor.classes = null
	key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
	linger.ms = 0
	max.block.ms = 9223372036854775807
	max.in.flight.requests.per.connection = 1
	max.request.size = 2097152
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
	receive.buffer.bytes = 32768
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 30000
	retries = 2147483647
	retry.backoff.ms = 100
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	send.buffer.bytes = 131072
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
	ssl.endpoint.identification.algorithm = null
	ssl.key.password = null
	ssl.keymanager.algorithm = IbmX509
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLS
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
	transaction.timeout.ms = 60000
	transactional.id = null
	value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer{noformat}


We cannot proceed with Kafka 1.0 roll out until this is resolved

> NETWORK_EXCEPTION and REQUEST_TIMED_OUT in mirrormaker producer after 1.0 broker upgrade
> ----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-6706
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6706
>             Project: Kafka
>          Issue Type: Bug
>          Components: core, network
>    Affects Versions: 1.0.0
>            Reporter: Di Shang
>            Priority: Blocker
>              Labels: mirror-maker
>
> We have 2 clusters A and B with 4 brokers each, we use mirrormaker to replicate topics from A to B. 
>  We recently upgraded our brokers from 0.10.2.0 to 1.0.0, after the upgrade we started seeing the mirrormaker task showing producer errors and intermittently dying. 
>  We tried using 1.0.0 and 0.10.2.0 mirrormaker, both have the same problem. Downgrading cluster B brokers back to 0.10.2.0 and the problem went away, so we think it's a server side problem.
> There are 2 types of errors: REQUEST_TIMED_OUT and NETWORK_EXCEPTION. For testing, I used a topic *logging* with 20 partitions and 3 replicas (same on cluster A and B), the source topic has 50+ million msg.
> (this is from mirrormaker 1.0 at info level, the 0.10.2.0 log is very similar)
> {noformat}
> 22 Mar 2018 02:16:07.407 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 35122 on topic-partition logging-7, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 02:17:49.731 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 51572 on topic-partition logging-7, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 02:18:33.903 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 57785 on topic-partition logging-5, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 02:21:21.399 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 85406 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 02:25:22.278 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 128047 on topic-partition logging-5, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 02:26:17.154 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 137049 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 02:27:57.358 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 153976 on topic-partition logging-5, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 02:27:57.779 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-2, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
>  22 Mar 2018 02:27:57.780 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-10, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
>  22 Mar 2018 02:27:57.780 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-18, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
>  22 Mar 2018 02:27:57.781 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-14, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
>  22 Mar 2018 02:27:57.781 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 154077 on topic-partition logging-6, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
>  22 Mar 2018 02:29:12.378 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 163175 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 02:31:47.067 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 192333 on topic-partition logging-17, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 02:32:26.198 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 195400 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 02:33:16.280 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 206708 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 02:33:46.428 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 206830 on topic-partition logging-18, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 12:53:50.848 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 734439 on topic-partition logging-3, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT
>  22 Mar 2018 16:21:45.557 [kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 941307 on topic-partition logging-6, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT{noformat}
> The NETWORK_EXCEPTION always comes in group of 5 with partition leaders on the same broker (there are 5 partition leader/broker). 
>  The REQUEST_TIMED_OUT seems to be from random individual partition.
> I turned on trace and it showed how the NETWORK_EXCEPTION is causing the task to die.
>  (I filtered out the important lines in this log segment)
> {noformat}
> 22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.NetworkClient trace(line:135) [Producer clientId=producer-1] Completed receive from node 44 for PRODUCE with correlation id 71596, received {responses=[{topic=logging,partition_responses=[{partition=17,error_code=0,base_offset=28466678,log_append_time=-1,log_start_offset=24408600},{partition=1,error_code=0,base_offset=11611076,log_append_time=-1,log_start_offset=7562939},{partition=5,error_code=0,base_offset=28467700,log_append_time=-1,log_start_offset=24416182},{partition=9,error_code=0,base_offset=32762076,log_append_time=-1,log_start_offset=30108807},{partition=13,error_code=0,base_offset=11612712,log_append_time=-1,log_start_offset=7564490}]}],throttle_time_ms=0}
> 22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:129) [Producer clientId=producer-1] Received produce response from node 44 with correlation id 71596
> 22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-9 with base offset 32762076.
> 22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-13 with base offset 11612712.
> 22 Mar 2018 02:06:20.154 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-17 with base offset 28466678.
> 22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-1 with base offset 11611076.
> 22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:171) Successfully produced messages to logging-5 with base offset 28467700.
> 22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:123) [Producer clientId=producer-1] Nodes with data ready to send: [broker44:9020 (id: 44 rack: null)]
> 22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.NetworkClient trace(line:135) [Producer clientId=producer-1] Sending PRODUCE {acks=-1,timeout=30000,partitionSizes=[logging-1=647,logging-9=645,logging-5=647,logging-13=645]} with correlation id 71599 to node 44
> 22 Mar 2018 02:06:20.155 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:129) [Producer clientId=producer-1] Sent produce request to 44: (type=ProduceRequest, acks=-1, timeout=30000, partitionRecords=({logging-1=[(record=DefaultRecord(offset=0, timestamp=1521353381705, key=0 bytes, value=1048 bytes))], logging-9=[(record=DefaultRecord(offset=0, timestamp=1521353381983, key=0 bytes, value=1048 bytes))], logging-5=[(record=DefaultRecord(offset=0, timestamp=1521353380881, key=0 bytes, value=1048 bytes))], logging-13=[(record=DefaultRecord(offset=0, timestamp=1521353381288, key=0 bytes, value=1048 bytes))]}), transactionalId=''
> 22 Mar 2018 02:06:20.156 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.KafkaProducer trace(line:135) [Producer clientId=producer-1] Sending record ProducerRecord(topic=logging, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value=[B@cf56dcea, timestamp=1521353382731) with callback kafka.tools.MirrorMaker$MirrorMakerProducerCallback@e7d47b69 to topic logging partition 17
> 22 Mar 2018 02:06:20.156 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.internals.RecordAccumulator trace(line:135) [Producer clientId=producer-1] Allocating a new 16384 byte message buffer for topic logging partition 17
> 22 Mar 2018 02:06:20.156 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.KafkaProducer trace(line:129) [Producer clientId=producer-1] Waking up the sender since topic logging partition 17 is either full or getting a new batch
> 22 Mar 2018 02:06:20.164 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.KafkaProducer trace(line:135) [Producer clientId=producer-1] Sending record ProducerRecord(topic=logging, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value=[B@d6e47439, timestamp=1521353385041) with callback kafka.tools.MirrorMaker$MirrorMakerProducerCallback@60717e86 to topic logging partition 17
> 22 Mar 2018 02:06:20.172 [mirrormaker-thread-0] TRACE org.apache.kafka.clients.producer.KafkaProducer trace(line:135) [Producer clientId=producer-1] Sending record ProducerRecord(topic=logging, partition=null, headers=RecordHeaders(headers = [], isReadOnly = true), key=null, value=[B@613a02ed, timestamp=1521353387109) with callback kafka.tools.MirrorMaker$MirrorMakerProducerCallback@563a082b to topic logging partition 17
> ...
> 30s later (request.timeout.ms=30000)
> ...
> 22 Mar 2018 02:06:50.156 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.clients.NetworkClient debug(line:183) [Producer clientId=producer-1] Disconnecting from node 44 due to request timeout.
> 22 Mar 2018 02:06:50.156 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.NetworkClient trace(line:135) [Producer clientId=producer-1] Cancelled request {acks=-1,timeout=30000,partitionSizes=[logging-1=647,logging-9=645,logging-5=647,logging-13=645]} with correlation id 71599 due to node 44 being disconnected
> 22 Mar 2018 02:06:50.156 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:129) [Producer clientId=producer-1] Cancelled request with header RequestHeader(apiKey=PRODUCE, apiVersion=5, clientId=producer-1, correlationId=71599) due to node 44 being disconnected
> 22 Mar 2018 02:06:50.156 [kafka-producer-network-thread | producer-1] WARN  org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 71599 on topic-partition logging-1, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
> 22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] WARN  org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 71599 on topic-partition logging-9, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
> 22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] WARN  org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 71599 on topic-partition logging-5, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
> 22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] WARN  org.apache.kafka.clients.producer.internals.Sender warn(line:251) [Producer clientId=producer-1] Got error produce response with correlation id 71599 on topic-partition logging-13, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
> 22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.Sender trace(line:123) [Producer clientId=producer-1] Expired 1 batches in accumulator
> 22 Mar 2018 02:06:50.157 [kafka-producer-network-thread | producer-1] TRACE org.apache.kafka.clients.producer.internals.ProducerBatch done(line:174) Failed to produce messages to logging-17.
> org.apache.kafka.common.errors.TimeoutException: Expiring 25 record(s) for logging-17: 30001 ms has passed since batch creation plus linger time
> 22 Mar 2018 02:06:50.162 [kafka-producer-network-thread | producer-1] ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback onCompletion(line:52) Error when sending message to topic logging with key: null, value: 1047 bytes with error:
> org.apache.kafka.common.errors.TimeoutException: Expiring 25 record(s) for logging-17: 30001 ms has passed since batch creation plus linger time
> 22 Mar 2018 02:06:50.162 [kafka-producer-network-thread | producer-1] INFO  kafka.tools.MirrorMaker$ info(line:72) Closing producer due to send failure.
> 22 Mar 2018 02:06:50.163 [kafka-producer-network-thread | producer-1] INFO  org.apache.kafka.clients.producer.KafkaProducer info(line:341) [Producer clientId=producer-1] Closing the Kafka producer with timeoutMillis = 0 ms.
> 22 Mar 2018 02:06:50.163 [kafka-producer-network-thread | producer-1] INFO  org.apache.kafka.clients.producer.KafkaProducer info(line:341) [Producer clientId=producer-1] Proceeding to force close the producer since pending requests could not be completed within timeout 0 ms.
> 22 Mar 2018 02:06:50.164 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.clients.producer.KafkaProducer debug(line:177) [Producer clientId=producer-1] Kafka producer has been closed
> {noformat}
>  The first few lines shows that the producer just received a response from its previous produce request to broker 44 with partitions \{1,5,9,13,17} (leaders). 
>  The next produce request it sent only contained batches for partitions \{1,5,9,13}, however 1ms later new data arrived for partition 17 but since it missed the current request to that broker it had to sit in the producer and wait for the next send round. 
>  Somehow the produce request for \{1,5,9,13} never got a response and timed out. Normally these batches can be retried and mirrormaker will keep working, however because of the leftover batch for partition 17 expired, mirrormaker detected data loss and will shutdown.
> So I think the root problem is the REQUEST_TIMED_OUT where the server intermittently times out for the produce request after upgrading to 1.0. 
>  In the lucky case where all batches can be retried (they seem to always succeed on the first retry anyway), it can randomly introduce 30s of latency to data replication, and if unlucky a batch get left out and expires, the task immediately dies. 
>  We observed that this usually happens under high load with large number of msg from the source topic to mirror (I can reliably recreate the errors by restarting the task with a new consumer group id and auto.offset.reset=earliest so it got 50 mil msg to mirror from the start of the topic)
> Increasing producer linger.ms and batch.size seems to some degree reduce the error rate under same heavy workload.
> I tried to enable trace on the server but all I see are logs (massive) from ReplicaFetcher and lines like this:
> {noformat}
> Bytes written as part of multi-send call: 2431, total bytes written so far: 2431, expected bytes to write: 2431{noformat}
> Is there any log on the server side that can be correlated to the producer request and help debugging this?
> This is the trace log config I tried to use on the server but couldn't see anything useful:
> {noformat}
>  rootLogger.level=trace
>  appender.console.type=Console
>  appender.console.name=STDOUT
>  rootLogger.appenderRef.stdout.ref=STDOUT{noformat}
>  
> Here's the mirrormaker producer config:
> {noformat}
> 	acks = all
> 	batch.size = 16384
> 	bootstrap.servers = [...]
> 	buffer.memory = 33554432
> 	client.id = 
> 	compression.type = gzip
> 	connections.max.idle.ms = 540000
> 	enable.idempotence = false
> 	interceptor.classes = null
> 	key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
> 	linger.ms = 0
> 	max.block.ms = 9223372036854775807
> 	max.in.flight.requests.per.connection = 1
> 	max.request.size = 2097152
> 	metadata.max.age.ms = 300000
> 	metric.reporters = []
> 	metrics.num.samples = 2
> 	metrics.recording.level = INFO
> 	metrics.sample.window.ms = 30000
> 	partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
> 	receive.buffer.bytes = 32768
> 	reconnect.backoff.max.ms = 1000
> 	reconnect.backoff.ms = 50
> 	request.timeout.ms = 30000
> 	retries = 2147483647
> 	retry.backoff.ms = 100
> 	sasl.jaas.config = null
> 	sasl.kerberos.kinit.cmd = /usr/bin/kinit
> 	sasl.kerberos.min.time.before.relogin = 60000
> 	sasl.kerberos.service.name = null
> 	sasl.kerberos.ticket.renew.jitter = 0.05
> 	sasl.kerberos.ticket.renew.window.factor = 0.8
> 	sasl.mechanism = GSSAPI
> 	security.protocol = PLAINTEXT
> 	send.buffer.bytes = 131072
> 	ssl.cipher.suites = null
> 	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
> 	ssl.endpoint.identification.algorithm = null
> 	ssl.key.password = null
> 	ssl.keymanager.algorithm = IbmX509
> 	ssl.keystore.location = null
> 	ssl.keystore.password = null
> 	ssl.keystore.type = JKS
> 	ssl.protocol = TLS
> 	ssl.provider = null
> 	ssl.secure.random.implementation = null
> 	ssl.trustmanager.algorithm = PKIX
> 	ssl.truststore.location = null
> 	ssl.truststore.password = null
> 	ssl.truststore.type = JKS
> 	transaction.timeout.ms = 60000
> 	transactional.id = null
> 	value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message