Return-Path: X-Original-To: apmail-aurora-dev-archive@minotaur.apache.org Delivered-To: apmail-aurora-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B155B10A9D for ; Fri, 17 Jan 2014 09:25:42 +0000 (UTC) Received: (qmail 44595 invoked by uid 500); 17 Jan 2014 09:25:42 -0000 Delivered-To: apmail-aurora-dev-archive@aurora.apache.org Received: (qmail 44564 invoked by uid 500); 17 Jan 2014 09:25:41 -0000 Mailing-List: contact dev-help@aurora.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.incubator.apache.org Delivered-To: mailing list dev@aurora.incubator.apache.org Received: (qmail 44556 invoked by uid 99); 17 Jan 2014 09:25:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jan 2014 09:25:41 +0000 X-ASF-Spam-Status: No, hits=-2000.1 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 17 Jan 2014 09:25:40 +0000 Received: (qmail 43684 invoked by uid 99); 17 Jan 2014 09:25:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jan 2014 09:25:20 +0000 Date: Fri, 17 Jan 2014 09:25:19 +0000 (UTC) From: "Bill Farner (JIRA)" To: dev@aurora.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (AURORA-51) Scheduler stalls during startup if storage recovery fails MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/AURORA-51?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Farner updated AURORA-51: ------------------------------ Summary: Scheduler stalls during startup if storage recovery fails (was: Scheduler stalls during startup if ) > Scheduler stalls during startup if storage recovery fails > --------------------------------------------------------- > > Key: AURORA-51 > URL: https://issues.apache.org/jira/browse/AURORA-51 > Project: Aurora > Issue Type: Bug > Components: Scheduler > Reporter: Bill Farner > Assignee: Bill Farner > Priority: Critical > > If SchedulerLifecycle encounters a RuntimeException while initializing storage, it takes no action to abort. The result is a leader in ZK that will never make progress and requires human intervention (killing the process). > It would be prudent to consider a sweeping improvement in the course of fixing this, such as initiating a shutdown on any uncaught exception when transitioning in SchedulerLifecycle. > {noformat} > E0117 09:04:17.426 THREAD21 org.apache.zookeeper.ClientCnxn$EventThread.processEvent: Error while calling watcher > org.apache.aurora.scheduler.storage.log.LogStorage$RecoveryFailedException: org.apache.aurora.scheduler.log.Log$Stream$StreamAccessException: Problem reading from log > at org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:329) > at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87) > at org.apache.aurora.scheduler.storage.log.LogStorage$2.execute(LogStorage.java:303) > at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:138) > at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:155) > at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:146) > at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87) > at org.apache.aurora.scheduler.storage.ForwardingStore.write(ForwardingStore.java:105) > at org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:475) > at org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:298) > at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:94) > at org.apache.aurora.scheduler.SchedulerLifecycle$5.execute(SchedulerLifecycle.java:240) > at org.apache.aurora.scheduler.SchedulerLifecycle$5.execute(SchedulerLifecycle.java:237) > at com.twitter.common.base.Closures$4.execute(Closures.java:120) > at com.twitter.common.base.Closures$4.execute(Closures.java:120) > at com.twitter.common.base.Closures$3.execute(Closures.java:98) > at com.twitter.common.util.StateMachine.transition(StateMachine.java:191) > at org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:446) > at com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168) > at com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155) > at com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665) > at com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638) > at com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579) > at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600) > at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597) > at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109) > at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107) > at com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127) > at com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107) > at com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622) > at com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579) > at com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591) > at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > Caused by: org.apache.aurora.scheduler.log.Log$Stream$StreamAccessException: Problem reading from log > at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream$2.hasNext(MesosLog.java:255) > at org.apache.aurora.scheduler.storage.log.LogManager$StreamManager.readFromBeginning(LogManager.java:190) > at org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:323) > ... 33 more > Caused by: org.apache.mesos.Log$OperationFailedException: Bad read range (includes pending entries) > at org.apache.mesos.Log$Reader.read(Native Method) > at org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule$4.read(MesosLogStreamModule.java:168) > at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream$2.hasNext(MesosLog.java:233) > ... 35 more > {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)