aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Farner (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (AURORA-420) scheduler crash due to corrupt replica data?
Date Wed, 14 May 2014 19:29:15 GMT

    [ https://issues.apache.org/jira/browse/AURORA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997915#comment-13997915
] 

Bill Farner edited comment on AURORA-420 at 5/14/14 7:27 PM:
-------------------------------------------------------------

By default, the scheduler will automatically fail over after 24 hours \[1\] of leading.  This
is due to a limitation in the replicated log, since we have no means to trigger LevelDb compaction
(see MESOS-184 for more details).

{quote}
Last time when we started scheduler after similar crash, all jobs were lost.
{quote}

This part is troubling, and something we have not seen.  Can you provide more details on your
setup?
- Is there some sort of supervisor (e.g. monit, upstart) restarting the scheduler on exit?
- How many schedulers are running in the cluster?
- Did you override the {{-native_log_quorum_size}} command line argument \[2\]?  If so, to
what value?

{quote}
We were running around 30 apps in different slaves during the crash. The apps are still running
in slaves though.
{quote}

When the scheduler restarted, did it appear to have a completely blank database, or just stale?
 In the master UI, did it show up as a new framework?

\[1\] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/SchedulerModule.java#L54
\[2\] https://github.com/apache/incubator-aurora/blob/master/docs/deploying-aurora-scheduler.md


was (Author: wfarner):
By default, the scheduler will automatically fail over after 24 hours [1] of leading.  This
is due to a limitation in the replicated log, since we have no means to trigger LevelDb compaction
(see MESOS-184 for more details).

{quote}
Last time when we started scheduler after similar crash, all jobs were lost.
{quote}

This part is troubling, and something we have not seen.  Can you provide more details on your
setup?
Is there some sort of supervisor (e.g. monit, upstart) restarting the scheduler on exit?
How many schedulers are running in the cluster?
Did you override the {{-native_log_quorum_size}} command line argument [2]?  If so, to what
value?

{quote}
We were running around 30 apps in different slaves during the crash. The apps are still running
in slaves though.
{quote}

When the scheduler restarted, did it appear to have a completely blank database, or just stale?
 In the master UI, did it show up as a new framework?

[1] https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/SchedulerModule.java#L54
[2] https://github.com/apache/incubator-aurora/blob/master/docs/deploying-aurora-scheduler.md

> scheduler crash due to corrupt replica data?
> --------------------------------------------
>
>                 Key: AURORA-420
>                 URL: https://issues.apache.org/jira/browse/AURORA-420
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 0.6.0
>            Reporter: Bhuvan Arumugam
>
> We are using latest as of https://github.com/apache/incubator-aurora/commit/90423243977f141002319f9cd4bd59bcee33aefe.
Technically it's 0.5.1-snapshot.
> The scheduler seem to crash due to corrupt data in replica. It had crashed twice in last
2 days. Here is the log snippet.
> Last time when we started scheduler after similar crash, all jobs were lost. We were
running around 30 apps in different slaves during the crash. The apps are still running in
slaves though. The slaves are shown as running master ui. The scheduler seem to have trouble
reconnecting to the running tasks when it come back online. FWIW, we are not using checkpoint.
> Can you let me know?
>   1. how to prevent the crashes?
>   2. how to recover jobs from replica backup?
> {code}
> I0513 15:07:39.982774 25560 log.cpp:680] Attempting to append 125 bytes to the log
> I0513 15:07:39.982879 25545 coordinator.cpp:340] Coordinator attempting to write APPEND
action at position 29779
> I0513 15:07:39.983695 25543 replica.cpp:508] Replica received write request for position
29779
> I0513 15:07:39.986923 25543 leveldb.cpp:341] Persisting action (144 bytes) to leveldb
took 3.177192ms
> I0513 15:07:39.986961 25543 replica.cpp:676] Persisted action at 29779
> I0513 15:07:39.987192 25543 replica.cpp:655] Replica received learned notice for position
29779
> I0513 15:07:39.989861 25543 leveldb.cpp:341] Persisting action (146 bytes) to leveldb
took 2.637372ms
> I0513 15:07:39.989895 25543 replica.cpp:676] Persisted action at 29779
> I0513 15:07:39.989907 25543 replica.cpp:661] Replica learned APPEND action at position
29779
> I0513 22:07:46.621 THREAD5299 org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer:
Returning offers for 20140512-151150-360689681-5050-7152-6 for compaction.
> I0513 22:08:39.641 THREAD5301 org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer:
Returning offers for 20140512-151150-360689681-5050-7152-9 for compaction.
> I0513 22:10:20.474 THREAD29 org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run: Triggering
automatic failover.
> I0513 22:10:20.475 THREAD29 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition ACTIVE -> DEAD
> I0513 15:10:20.486500 25562 sched.cpp:731] Stopping framework '2014-03-26-13:02:35-360689681-5050-31080-0000'
> I0513 22:10:20.486 THREAD29 com.twitter.common.util.StateMachine$Builder$1.execute: storage
state machine transition READY -> STOPPED
> W0513 22:10:20.486 THREAD24 com.twitter.common.zookeeper.ServerSetImpl$ServerSetWatcher.notifyServerSetChange:
server set empty for path /aurora/scheduler
> I0513 22:10:20.486 THREAD31 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition DEAD -> DEAD
> I0513 22:10:20.486 THREAD29 com.twitter.common.application.Lifecycle.shutdown: Shutting
down application
> I0513 22:10:20.487 THREAD31 org.apache.aurora.scheduler.SchedulerLifecycle$8.execute:
Shutdown already invoked, ignoring extra call.
> W0513 22:10:20.486 THREAD24 org.apache.aurora.scheduler.http.LeaderRedirect$SchedulerMonitor.onChange:
No schedulers in host set, will not redirect despite not being leader.
> I0513 22:10:20.487 THREAD29 com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute:
Executing 8 shutdown commands.
> W0513 22:10:20.488 THREAD24 com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange:
All candidates have temporarily left the group: Group /aurora/scheduler
> E0513 22:10:20.488 THREAD24 org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onDefeated:
Lost leadership, committing suicide.
> I0513 22:10:20.489 THREAD24 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition DEAD -> DEAD
> I0513 22:10:20.489 THREAD24 org.apache.aurora.scheduler.SchedulerLifecycle$8.execute:
Shutdown already invoked, ignoring extra call.
> I0513 22:10:20.491 THREAD29 org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
Shutdown initiated by: Thread: Lifecycle-0 (id 29)
> java.lang.Thread.getStackTrace(Thread.java:1588)
>   org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:151)
>   com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
>   com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
>   org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:382)
>   org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:354)
>   com.twitter.common.base.Closures$4.execute(Closures.java:120)
>   com.twitter.common.base.Closures$3.execute(Closures.java:98)
>   com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
>   org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run(SchedulerLifecycle.java:287)
>   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>   java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   java.lang.Thread.run(Thread.java:744)
> I0513 22:10:20.491 THREAD29 com.twitter.common.stats.TimeSeriesRepositoryImpl$3.execute:
Variable sampler shut down
> I0513 22:10:20.491 THREAD29 org.apache.aurora.scheduler.thrift.ThriftServerLauncher$1.execute:
Stopping thrift server.
> I0513 22:10:20.491 THREAD29 org.apache.aurora.scheduler.thrift.ThriftServer.shutdown:
Received shutdown request, stopping server.
> I0513 22:10:20.491 THREAD29 org.apache.aurora.scheduler.thrift.ThriftServer.setStatus:
Moving from status ALIVE to STOPPING
> I0513 22:10:20.492 THREAD29 org.apache.aurora.scheduler.thrift.ThriftServer.setStatus:
Moving from status STOPPING to STOPPED
> I0513 22:10:20.492 THREAD29 com.twitter.common.application.modules.HttpModule$HttpServerLauncher$1.execute:
Shutting down embedded http server
> I0513 22:10:20.492 THREAD29 org.mortbay.log.Slf4jLog.info: Stopped SelectChannelConnector@0.0.0.0:8081
> I0513 22:10:20.594 THREAD29 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition DEAD -> DEAD
> I0513 22:10:20.594 THREAD29 org.apache.aurora.scheduler.SchedulerLifecycle$8.execute:
Shutdown already invoked, ignoring extra call.
> I0513 22:10:20.595 THREAD1 com.twitter.common.application.AppLauncher.run: Application
run() exited.
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message