kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-6649) ReplicaFetcher stopped after non fatal exception is thrown
Date Wed, 14 Mar 2018 04:17:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398074#comment-16398074

ASF GitHub Bot commented on KAFKA-6649:

huxihx opened a new pull request #4707: KAFKA-6649: Should catch OutOfRangeException for ReplicaFetcherThread
URL: https://github.com/apache/kafka/pull/4707
   `AbstractFetcherThread.processFetchRequest` should catch OffsetOutOfRangeException lest
the thread was forcibly stopped.
   *More detailed description of your change,
   if necessary. The PR title and PR message become
   the squashed commit message, so use a separate
   comment to ping reviewers.*
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> ReplicaFetcher stopped after non fatal exception is thrown
> ----------------------------------------------------------
>                 Key: KAFKA-6649
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6649
>             Project: Kafka
>          Issue Type: Bug
>          Components: replication
>    Affects Versions: 1.0.0,, 1.1.0, 1.0.1
>            Reporter: Julio Ng
>            Priority: Major
> We have seen several under-replication partitions, usually triggered by topic creation.
After digging in the logs, we see the below:
> {noformat}
> [2018-03-12 22:40:17,641] ERROR [ReplicaFetcher replicaId=12, leaderId=0, fetcherId=1]
Error due to (kafka.server.ReplicaFetcherThread)
> kafka.common.KafkaException: Error processing data for partition [[TOPIC_NAME_REMOVED]]-84
offset 2098535
>  at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204)
>  at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:169)
>  at scala.Option.foreach(Option.scala:257)
>  at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
>  at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:166)
>  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:166)
>  at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
>  at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
>  at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
>  at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:164)
>  at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> Caused by: org.apache.kafka.common.errors.OffsetOutOfRangeException: Cannot increment
the log start offset to 2098535 of partition [[TOPIC_NAME_REMOVED]]-84 since it is larger
than the high watermark -1
> [2018-03-12 22:40:17,641] INFO [ReplicaFetcher replicaId=12, leaderId=0, fetcherId=1]
Stopped (kafka.server.ReplicaFetcherThread){noformat}
> It looks like that after the ReplicaFetcherThread is stopped, the replicas start to
lag behind, presumably because we are not fetching from the leader anymore. Further examining,
the ShutdownableThread.scala object:
> {noformat}
> override def run(): Unit = {
>  info("Starting")
>  try {
>    while (isRunning)
>      doWork()
>  } catch {
>    case e: FatalExitError =>
>      shutdownInitiated.countDown()
>      shutdownComplete.countDown()
>      info("Stopped")
>      Exit.exit(e.statusCode())
>    case e: Throwable =>
>      if (isRunning)
>        error("Error due to", e)
>  } finally {
>    shutdownComplete.countDown()
>  }
>  info("Stopped")
> }{noformat}
> For the Throwable (non-fatal) case, it just exits the while loop and the thread stops
doing work. I am not sure whether this is the intended behavior of the ShutdownableThread,
or the exception should be caught and we should keep calling doWork()

This message was sent by Atlassian JIRA

View raw message