kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apurva Mehta (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-4526) Transient failure in ThrottlingTest.test_throttled_reassignment
Date Mon, 19 Dec 2016 21:29:58 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762361#comment-15762361
] 

Apurva Mehta commented on KAFKA-4526:
-------------------------------------

I had a look at the logs from one of the failures, and here is the problem: 

# The test has two phases: one bulk producer phase, which seeds the topic with large enough
quantities of data so that we can actually test throttled reassignment. The other phase is
the regular produce-consume-validate loop. 
# We start the reassignment, and then run the produce-consume-validate loop to ensure that
no new messages are lost during reassignment.
# Because the produce-consume-validate pattern uses structured (integer) data in phase two,
we require that the consumer start from the end of the log and also start before the producer
begins producing messages. If this is true, then the consumer will read and validate all the
messages sent by the producer. The test has a `wait_until` block, but that only checks for
the existence of the process. 
# What is seen in the logs is that the producer starts and begins producing messages _before_
the consumer fetches the metadata for all the partitions. As as a result, the consumer misses
the initial messages, which is consistent across all test failures. 
# This can be explained by the recent changes in ducktape: thanks to paramiko, running commands
on worker machines is much faster since ssh connections are reused. Hence, the producer starts
much faster than before, causing the initial set of messages to be missed by the consumer
some of the time.
# The fix is to avoid using the PID of the consumer as a proxy for 'the consumer is ready'.
Something  like 'partitions assigned' would be a more reliable proxy of the consumer being
ready. Note that the original PR of the test had a timeout between consumer and producer start
since there was no more robust method to ensure that the consumer was init'd before the producer
started. But since the use of timeouts are --rightly!-- discouraged, it was removed. Adding
suitable metrics would be a step in the right direction. 
# Next step is to leverage suitable metrics (like partitions assigned if it exists), or add
them to the console consumer to ensure that it is init'd before continuing to start the producer.

> Transient failure in ThrottlingTest.test_throttled_reassignment
> ---------------------------------------------------------------
>
>                 Key: KAFKA-4526
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4526
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Ewen Cheslack-Postava
>            Assignee: Apurva Mehta
>              Labels: system-test-failure, system-tests
>             Fix For: 0.10.2.0
>
>
> This test is seeing transient failures sometimes
> {quote}
> Module: kafkatest.tests.core.throttling_test
> Class:  ThrottlingTest
> Method: test_throttled_reassignment
> Arguments:
> {
>   "bounce_brokers": false
> }
> {quote}
> This happens with both bounce_brokers = true and false. Fails with
> {quote}
> AssertionError: 1646 acked message did not make it to the Consumer. They are: 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19...plus 1626 more. Total Acked:
174799, Total Consumed: 173153. We validated that the first 1000 of these missing messages
correctly made it into Kafka's data files. This suggests they were lost on their way to the
consumer.
> {quote}
> See http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-12--001.1481535295--apache--trunk--62e043a/report.html
for an example.
> Note that there are a number of similar bug reports for different tests: https://issues.apache.org/jira/issues/?jql=text%20~%20%22acked%20message%20did%20not%20make%20it%20to%20the%20Consumer%22%20and%20project%20%3D%20Kafka
I am wondering if we have a wrong ack setting somewhere that we should be specifying as acks=all
but is only defaulting to 0?
> It also seems interesting that the missing messages in these recent failures seem to
always start at 0...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message