kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Widman (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KAFKA-5381) ERROR Uncaught exception in scheduled task 'delete-expired-consumer-offsets' (kafka.utils.KafkaScheduler)
Date Mon, 05 Jun 2017 21:45:04 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-5381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jeff Widman updated KAFKA-5381:
-------------------------------
    Description: 
We have a 6 node cluster of 0.10.0.1 brokers. Broker 4 had a hardware problem, so we re-assigned
all its partitions to other brokers. We immediately started observing the error described
in KAFKA-4362 from several of our consumers.

However, on broker 6, we also started seeing the following exceptions in {{KafkaScheduler}}
which have a somewhat similar-looking traceback:
{code}
[2017-06-03 17:23:57,926] ERROR Uncaught exception in scheduled task 'delete-expired-consumer-offsets'
(kafka.utils.KafkaScheduler)
java.lang.IllegalArgumentException: Message format version for partition 50 not found
        at kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
        at kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
        at scala.Option.getOrElse(Option.scala:121)
        at kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$getMessageFormatVersionAndTimestamp(GroupMetadataManager.scala:632)
        at kafka.coordinator.GroupMetadataManager$$anonfun$2$$anonfun$10.apply(GroupMetadataManager.scala:560)
        at kafka.coordinator.GroupMetadataManager$$anonfun$2$$anonfun$10.apply(GroupMetadataManager.scala:551)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at kafka.coordinator.GroupMetadataManager$$anonfun$2.apply$mcI$sp(GroupMetadataManager.scala:551)
        at kafka.coordinator.GroupMetadataManager$$anonfun$2.apply(GroupMetadataManager.scala:543)
        at kafka.coordinator.GroupMetadataManager$$anonfun$2.apply(GroupMetadataManager.scala:543)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:231)
        at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:239)
        at kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$deleteExpiredOffsets(GroupMetadataManager.scala:543)
        at kafka.coordinator.GroupMetadataManager$$anonfun$1.apply$mcV$sp(GroupMetadataManager.scala:87)
        at kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110)
        at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:56)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
{code}

This appears to be similar root cause to KAFKA-4362 where {{GroupMetadataManager.getMessageFormatVersionAndTimestamp}}
is throwing the error due to the offset partition being moved, but I'm unclear whether the
fix for that version also fixed the {{KafkaScheduler}} or if more work needs to be done here.

When the partition re-assignment was done, I do not know how this was executed by our Ops
team. I suspect they used the {{kafka-reassign-partitions.sh}} script and just gave it five
brokers rather than six. From my understanding, this would have randomly re-assigned all partitions
(I don't think its sticky), so probably at least one partition from the {{__consumer_offsets}}
topic was removed from broker 6. However, if that was the case, I would have expected all
brokers to have had these partitions removed and be throwing this error. But our logging infrastructure
shows that this error was only happening on broker 6, not on the other brokers. Not sure why
that is.

  was:
We have a 6 node cluster of 0.10.0.1 brokers. Broker 4 had a hardware problem, so we re-assigned
all its partitions to other brokers. We immediately started observing the error described
in KAFKA-4362 from several of our consumers.

However, on broker 6, we also started seeing the following exceptions in {{KafkaScheduler}}
which have a somewhat similar-looking traceback:
{code}
[2017-06-03 17:23:57,926] ERROR Uncaught exception in scheduled task 'delete-expired-consumer-offsets'
(kafka.utils.KafkaScheduler)
java.lang.IllegalArgumentException: Message format version for partition 50 not found
        at kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
        at kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
        at scala.Option.getOrElse(Option.scala:121)
        at kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$getMessageFormatVersionAndTimestamp(GroupMetadataManager.scala:632)
        at kafka.coordinator.GroupMetadataManager$$anonfun$2$$anonfun$10.apply(GroupMetadataManager.scala:560)
        at kafka.coordinator.GroupMetadataManager$$anonfun$2$$anonfun$10.apply(GroupMetadataManager.scala:551)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at kafka.coordinator.GroupMetadataManager$$anonfun$2.apply$mcI$sp(GroupMetadataManager.scala:551)
        at kafka.coordinator.GroupMetadataManager$$anonfun$2.apply(GroupMetadataManager.scala:543)
        at kafka.coordinator.GroupMetadataManager$$anonfun$2.apply(GroupMetadataManager.scala:543)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:231)
        at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:239)
        at kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$deleteExpiredOffsets(GroupMetadataManager.scala:543)
        at kafka.coordinator.GroupMetadataManager$$anonfun$1.apply$mcV$sp(GroupMetadataManager.scala:87)
        at kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110)
        at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:56)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
{code}

This appears to be similar root cause to KAFKA-4362 where {{GroupMetadataManager.getMessageFormatVersionAndTimestamp}}
is throwing the error due to the offset partition being moved, but I'm unclear whether the
fix for that version also fixed the {{KafkaScheduler}} or if more work needs to be done here.


> ERROR Uncaught exception in scheduled task 'delete-expired-consumer-offsets' (kafka.utils.KafkaScheduler)
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-5381
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5381
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jeff Widman
>
> We have a 6 node cluster of 0.10.0.1 brokers. Broker 4 had a hardware problem, so we
re-assigned all its partitions to other brokers. We immediately started observing the error
described in KAFKA-4362 from several of our consumers.
> However, on broker 6, we also started seeing the following exceptions in {{KafkaScheduler}}
which have a somewhat similar-looking traceback:
> {code}
> [2017-06-03 17:23:57,926] ERROR Uncaught exception in scheduled task 'delete-expired-consumer-offsets'
(kafka.utils.KafkaScheduler)
> java.lang.IllegalArgumentException: Message format version for partition 50 not found
>         at kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
>         at kafka.coordinator.GroupMetadataManager$$anonfun$14.apply(GroupMetadataManager.scala:633)
>         at scala.Option.getOrElse(Option.scala:121)
>         at kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$getMessageFormatVersionAndTimestamp(GroupMetadataManager.scala:632)
>         at kafka.coordinator.GroupMetadataManager$$anonfun$2$$anonfun$10.apply(GroupMetadataManager.scala:560)
>         at kafka.coordinator.GroupMetadataManager$$anonfun$2$$anonfun$10.apply(GroupMetadataManager.scala:551)
>         at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>         at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>         at scala.collection.immutable.List.foreach(List.scala:381)
>         at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>         at scala.collection.immutable.List.map(List.scala:285)
>         at kafka.coordinator.GroupMetadataManager$$anonfun$2.apply$mcI$sp(GroupMetadataManager.scala:551)
>         at kafka.coordinator.GroupMetadataManager$$anonfun$2.apply(GroupMetadataManager.scala:543)
>         at kafka.coordinator.GroupMetadataManager$$anonfun$2.apply(GroupMetadataManager.scala:543)
>         at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:231)
>         at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:239)
>         at kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$deleteExpiredOffsets(GroupMetadataManager.scala:543)
>         at kafka.coordinator.GroupMetadataManager$$anonfun$1.apply$mcV$sp(GroupMetadataManager.scala:87)
>         at kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110)
>         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:56)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> This appears to be similar root cause to KAFKA-4362 where {{GroupMetadataManager.getMessageFormatVersionAndTimestamp}}
is throwing the error due to the offset partition being moved, but I'm unclear whether the
fix for that version also fixed the {{KafkaScheduler}} or if more work needs to be done here.
> When the partition re-assignment was done, I do not know how this was executed by our
Ops team. I suspect they used the {{kafka-reassign-partitions.sh}} script and just gave it
five brokers rather than six. From my understanding, this would have randomly re-assigned
all partitions (I don't think its sticky), so probably at least one partition from the {{__consumer_offsets}}
topic was removed from broker 6. However, if that was the case, I would have expected all
brokers to have had these partitions removed and be throwing this error. But our logging infrastructure
shows that this error was only happening on broker 6, not on the other brokers. Not sure why
that is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message