aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mohammed, Ashfaque" <asmoham...@paypal.com.INVALID>
Subject Need help with aurora crash
Date Mon, 16 Mar 2015 18:30:21 GMT
Hi,

I am using aurora for running 1300 instances of jenkins CIs in a mesos cluster. Aurora is
running in a HA mode with a three node cluster. A few days ago aurora crashed and the end
log can be seen below (Note that I have changed the actual zookeeper IP addresses with IP1
and IP2).

Appreciate any help with regards to avoiding this kind of a crash in future.

Aurora version --> 0.6.1-snapshot

Thanks.
Ash.

============entries from the aurora log file below================================================
I0305 21:23:25.546087  8811 replica.cpp:661] Replica learned APPEND action at position 93907
I0305 21:23:25.546380  8811 replica.cpp:508] Replica received write request for position 93908
I0305 21:23:25.651371  8811 leveldb.cpp:343] Persisting action (524328 bytes) to leveldb took
104.79926ms
I0305 21:23:25.651468  8811 replica.cpp:676] Persisted action at 93908
I0305 21:23:25.768111  8809 replica.cpp:655] Replica received learned notice for position
93908
I0305 21:23:25.809016  8809 leveldb.cpp:343] Persisting action (524330 bytes) to leveldb took
40.811957ms
I0305 21:23:25.809109  8809 replica.cpp:676] Persisted action at 93908
I0305 21:23:25.809135  8809 replica.cpp:661] Replica learned APPEND action at position 93908
I0305 21:23:25.831048  8812 replica.cpp:508] Replica received write request for position 93909
I0305 21:23:25.859412  8812 leveldb.cpp:343] Persisting action (524328 bytes) to leveldb took
28.015678ms
I0305 21:23:25.859561  8812 replica.cpp:676] Persisted action at 93909
I0305 21:23:26.522222  8814 replica.cpp:655] Replica received learned notice for position
93909
I0305 21:23:26.552645  8814 leveldb.cpp:343] Persisting action (524330 bytes) to leveldb took
30.324183ms
I0305 21:23:26.552736  8814 replica.cpp:676] Persisted action at 93909
I0305 21:23:26.552763  8814 replica.cpp:661] Replica learned APPEND action at position 93909
I0305 21:23:32.186401  8810 network.hpp:423] ZooKeeper group memberships changed
I0305 21:23:32.186600  8815 group.cpp:658] Trying to get '/aurora/replicated-log/0000000012'
in ZooKeeper
I0305 21:23:32.197412  8815 group.cpp:658] Trying to get '/aurora/replicated-log/0000000016'
in ZooKeeper
I0305 21:23:32.208843  8815 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@IP1:8083,
log-replica(1)@IP2:8083 }
W0305 21:23:36.191 THREAD128 com.twitter.common.zookeeper.ServerSetImpl$ServerSetWatcher.notifyServerSetChange:
server set empty for path /aurora/scheduler
W0305 21:23:36.191 THREAD128 org.apache.aurora.scheduler.http.LeaderRedirect$SchedulerMonitor.onChange:
No schedulers in host set, will not redirect despite not being leader.
I0305 21:23:36.202 THREAD128 com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate
/aurora/scheduler/singleton_candidate_0000000017 waiting for the next leader election, current
voting: [singleton_candidate_0000000012, singleton_candidate_0000000017]
I0305 21:28:36.605 THREAD128 com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate
/aurora/scheduler/singleton_candidate_0000000017 is now leader of group: [singleton_candidate_0000000017]
I0305 21:28:36.608 THREAD128 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
I0305 21:28:36.608 THREAD128 org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected
as leading scheduler!
I0305 21:28:36.628625  8815 log.cpp:656] Attempting to start the writer
I0305 21:28:36.629897  8809 replica.cpp:474] Replica received implicit promise request with
proposal 2
I0305 21:28:36.662684  8809 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took
32.691373ms
I0305 21:28:36.662773  8809 replica.cpp:342] Persisted promised to 2
I0305 21:28:42.186013  8809 network.hpp:423] ZooKeeper group memberships changed
I0305 21:28:42.186200  8814 group.cpp:658] Trying to get '/aurora/replicated-log/0000000016'
in ZooKeeper
I0305 21:28:42.197717  8813 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@IP1:8083
}
I0305 21:28:51.631983  8808 log.cpp:656] Attempting to start the writer
I0305 21:29:06.634974  8808 log.cpp:656] Attempting to start the writer
I0305 21:29:21.649663  8808 log.cpp:656] Attempting to start the writer
I0305 21:29:36.662432  8808 log.cpp:656] Attempting to start the writer
I0305 21:29:51.669358  8813 log.cpp:656] Attempting to start the writer
I0305 21:30:06.684087  8809 log.cpp:656] Attempting to start the writer
I0305 21:30:21.696292  8812 log.cpp:656] Attempting to start the writer
I0305 21:30:36.703007  8812 log.cpp:656] Attempting to start the writer
I0305 21:30:51.712620  8812 log.cpp:656] Attempting to start the writer
I0305 21:31:06.727669  8808 log.cpp:656] Attempting to start the writer
I0305 21:31:21.742475  8813 log.cpp:656] Attempting to start the writer
I0305 21:31:36.754021  8810 log.cpp:656] Attempting to start the writer
I0305 21:31:51.766616  8815 log.cpp:656] Attempting to start the writer
I0305 21:32:06.779209  8812 log.cpp:656] Attempting to start the writer
I0305 21:32:21.794214  8809 log.cpp:656] Attempting to start the writer
I0305 21:32:36.803851  8811 log.cpp:656] Attempting to start the writer
I0305 21:32:51.812130  8809 log.cpp:656] Attempting to start the writer
I0305 21:33:06.822957  8814 log.cpp:656] Attempting to start the writer
I0305 21:33:21.835047  8811 log.cpp:656] Attempting to start the writer
I0305 21:33:36.842078  8813 log.cpp:680] Attempting to append 5 bytes to the log
I0305 21:33:36.842 THREAD128 com.twitter.common.application.Lifecycle.shutdown: Shutting down
application
I0305 21:33:36.843 THREAD128 com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute:
Executing 7 shutdown commands.
I0305 21:33:36.854 THREAD128 org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
Shutdown initiated by: Thread: RedirectMonitor STARTING-EventThread (id 128)
java.lang.Thread.getStackTrace(Thread.java:1567)
  org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:159)
  com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
  com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
  org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.disableLog(MesosLog.java:350)
  org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:369)
  org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:315)
  org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.readAll(MesosLog.java:238)
  org.apache.aurora.scheduler.storage.log.StreamManagerImpl.readFromBeginning(StreamManagerImpl.java:109)
  org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:505)
  org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.CGLIB$recover$6(<generated>:-1)
  org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca$$FastClassByGuice$$ea0e64a7.invoke(<generated>:-1)
  com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
  com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
  com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
  org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.recover(<generated>:-1)
  org.apache.aurora.scheduler.storage.log.LogStorage$20.execute(LogStorage.java:485)
  org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
  org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149)
  org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:147)
  org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:144)
  org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:137)
  org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.CGLIB$write$0(<generated>:-1)
  org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73$$FastClassByGuice$$66aea947.invoke(<generated>:-1)
  com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
  com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
  com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
  org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.write(<generated>:-1)
  org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:144)
  org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.CGLIB$write$0(<generated>:-1)
  org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca$$FastClassByGuice$$fb95f8d6.invoke(<generated>:-1)
  com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
  com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
  com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
  com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
  org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.write(<generated>:-1)
  org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:635)
  org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:479)
  org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:92)
  org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:257)
  org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:253)
  com.twitter.common.base.Closures$4.execute(Closures.java:120)
  org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:384)
  org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:380)
  com.twitter.common.base.Closures$4.execute(Closures.java:120)
  com.twitter.common.base.Closures$3.execute(Closures.java:98)
  com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
  org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:426)
  com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168)
  com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155)
  com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665)
  com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638)
  com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579)
  com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600)
  com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597)
  com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109)
  com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107)
  com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127)
  com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107)
  com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622)
  com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579)
  com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591)
  org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
  org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
I0305 21:33:36.854 THREAD128 com.twitter.common.stats.TimeSeriesRepositoryImpl$3.execute:
Variable sampler shut down
I0305 21:33:36.855 THREAD128 org.apache.aurora.scheduler.http.JettyServerModule$HttpServerLauncher$2.execute:
Shutting down embedded http server
I0305 21:33:36.880 THREAD128 org.eclipse.jetty.server.handler.ContextHandler.doStop: stopped
o.e.j.s.ServletContextHandler{/,null}
I0305 21:33:36.938 THREAD128 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition LEADER_AWAITING_REGISTRATION -> DEAD
W0305 21:33:36.956 THREAD128 com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute:
Shutdown action failed.
com.twitter.common.util.StateMachine$IllegalStateTransitionException: State transition from
PREPARED to STOPPED is not allowed.
                at com.twitter.common.util.StateMachine.transition(StateMachine.java:183)
                at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.stop(CallOrderEnforcingStorage.java:109)
                at org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:339)
                at org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:313)
                at com.twitter.common.base.Closures$4.execute(Closures.java:120)
                at com.twitter.common.base.Closures$3.execute(Closures.java:98)
                at com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
                at org.apache.aurora.scheduler.SchedulerLifecycle$4.execute(SchedulerLifecycle.java:239)
                at com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
                at com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
                at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.disableLog(MesosLog.java:350)
                at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:369)
                at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:315)
                at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.readAll(MesosLog.java:238)
                at org.apache.aurora.scheduler.storage.log.StreamManagerImpl.readFromBeginning(StreamManagerImpl.java:109)
                at org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:505)
                at org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.CGLIB$recover$6(<generated>)
                at org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca$$FastClassByGuice$$ea0e64a7.invoke(<generated>)
                at com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
                at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
                at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
                at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
                at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
                at org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.recover(<generated>)
                at org.apache.aurora.scheduler.storage.log.LogStorage$20.execute(LogStorage.java:485)
                at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
                at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149)
                at org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:147)
                at org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:144)
                at org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:137)
                at org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.CGLIB$write$0(<generated>)
                at org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73$$FastClassByGuice$$66aea947.invoke(<generated>)
                at com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
                at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
                at org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
                at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
                at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
                at org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.write(<generated>)
                at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:144)
                at org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.CGLIB$write$0(<generated>)
                at org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca$$FastClassByGuice$$fb95f8d6.invoke(<generated>)
                at com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
                at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
                at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
                at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
                at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
                at org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.write(<generated>)
                at org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:635)
                at org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:479)
                at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:92)
                at org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:257)
                at org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:253)
                at com.twitter.common.base.Closures$4.execute(Closures.java:120)
                at org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:384)
                at org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:380)
                at com.twitter.common.base.Closures$4.execute(Closures.java:120)
                at com.twitter.common.base.Closures$3.execute(Closures.java:98)
                at com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
                at org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:426)
                at com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168)
                at com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155)
                at com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665)
                at com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638)
                at com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579)
                at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600)
                at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597)
                at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109)
                at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107)
                at com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127)
                at com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107)
                at com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622)
                at com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579)
                at com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591)
                at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
                at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)

I0305 21:33:36.960 THREAD1 com.twitter.common.application.AppLauncher.run: Application run()
exited.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message