aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Farner (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (AURORA-420) scheduler crash due to corrupt replica data?
Date Mon, 19 May 2014 22:41:40 GMT

     [ https://issues.apache.org/jira/browse/AURORA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bill Farner resolved AURORA-420.
--------------------------------

    Resolution: Cannot Reproduce

Closing, as we don't have enough information to take action on this issue.  I spoke with [~bhuvan]
offline, and unfortunately he does not have scheduler logs from this event, which is pretty
critical to nail down the data loss mentioned.  I strongly encourage [~bhuvan] or anyone else
experiencing similar data loss to reopen this ticket.

I believe we've clarified on the regular scheduler restarts, so in that regard the scheduler
is working as expected.

> scheduler crash due to corrupt replica data?
> --------------------------------------------
>
>                 Key: AURORA-420
>                 URL: https://issues.apache.org/jira/browse/AURORA-420
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 0.6.0
>            Reporter: Bhuvan Arumugam
>         Attachments: aurora-scheduler.log.previous-201405-13-1816
>
>
> We are using latest as of https://github.com/apache/incubator-aurora/commit/90423243977f141002319f9cd4bd59bcee33aefe.
Technically it's 0.5.1-snapshot.
> The scheduler seem to crash due to corrupt data in replica. It had crashed twice in last
2 days. Here is the log snippet.
> Last time when we started scheduler after similar crash, all jobs were lost. We were
running around 30 apps in different slaves during the crash. The apps are still running in
slaves though. The slaves are shown as running master ui. The scheduler seem to have trouble
reconnecting to the running tasks when it come back online. FWIW, we are not using checkpoint.
> Can you let me know?
>   1. how to prevent the crashes?
>   2. how to recover jobs from replica backup?
> {code}
> I0513 15:07:39.982774 25560 log.cpp:680] Attempting to append 125 bytes to the log
> I0513 15:07:39.982879 25545 coordinator.cpp:340] Coordinator attempting to write APPEND
action at position 29779
> I0513 15:07:39.983695 25543 replica.cpp:508] Replica received write request for position
29779
> I0513 15:07:39.986923 25543 leveldb.cpp:341] Persisting action (144 bytes) to leveldb
took 3.177192ms
> I0513 15:07:39.986961 25543 replica.cpp:676] Persisted action at 29779
> I0513 15:07:39.987192 25543 replica.cpp:655] Replica received learned notice for position
29779
> I0513 15:07:39.989861 25543 leveldb.cpp:341] Persisting action (146 bytes) to leveldb
took 2.637372ms
> I0513 15:07:39.989895 25543 replica.cpp:676] Persisted action at 29779
> I0513 15:07:39.989907 25543 replica.cpp:661] Replica learned APPEND action at position
29779
> I0513 22:07:46.621 THREAD5299 org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer:
Returning offers for 20140512-151150-360689681-5050-7152-6 for compaction.
> I0513 22:08:39.641 THREAD5301 org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer:
Returning offers for 20140512-151150-360689681-5050-7152-9 for compaction.
> I0513 22:10:20.474 THREAD29 org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run: Triggering
automatic failover.
> I0513 22:10:20.475 THREAD29 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition ACTIVE -> DEAD
> I0513 15:10:20.486500 25562 sched.cpp:731] Stopping framework '2014-03-26-13:02:35-360689681-5050-31080-0000'
> I0513 22:10:20.486 THREAD29 com.twitter.common.util.StateMachine$Builder$1.execute: storage
state machine transition READY -> STOPPED
> W0513 22:10:20.486 THREAD24 com.twitter.common.zookeeper.ServerSetImpl$ServerSetWatcher.notifyServerSetChange:
server set empty for path /aurora/scheduler
> I0513 22:10:20.486 THREAD31 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition DEAD -> DEAD
> I0513 22:10:20.486 THREAD29 com.twitter.common.application.Lifecycle.shutdown: Shutting
down application
> I0513 22:10:20.487 THREAD31 org.apache.aurora.scheduler.SchedulerLifecycle$8.execute:
Shutdown already invoked, ignoring extra call.
> W0513 22:10:20.486 THREAD24 org.apache.aurora.scheduler.http.LeaderRedirect$SchedulerMonitor.onChange:
No schedulers in host set, will not redirect despite not being leader.
> I0513 22:10:20.487 THREAD29 com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute:
Executing 8 shutdown commands.
> W0513 22:10:20.488 THREAD24 com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange:
All candidates have temporarily left the group: Group /aurora/scheduler
> E0513 22:10:20.488 THREAD24 org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onDefeated:
Lost leadership, committing suicide.
> I0513 22:10:20.489 THREAD24 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition DEAD -> DEAD
> I0513 22:10:20.489 THREAD24 org.apache.aurora.scheduler.SchedulerLifecycle$8.execute:
Shutdown already invoked, ignoring extra call.
> I0513 22:10:20.491 THREAD29 org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
Shutdown initiated by: Thread: Lifecycle-0 (id 29)
> java.lang.Thread.getStackTrace(Thread.java:1588)
>   org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:151)
>   com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
>   com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
>   org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:382)
>   org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:354)
>   com.twitter.common.base.Closures$4.execute(Closures.java:120)
>   com.twitter.common.base.Closures$3.execute(Closures.java:98)
>   com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
>   org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run(SchedulerLifecycle.java:287)
>   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>   java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   java.lang.Thread.run(Thread.java:744)
> I0513 22:10:20.491 THREAD29 com.twitter.common.stats.TimeSeriesRepositoryImpl$3.execute:
Variable sampler shut down
> I0513 22:10:20.491 THREAD29 org.apache.aurora.scheduler.thrift.ThriftServerLauncher$1.execute:
Stopping thrift server.
> I0513 22:10:20.491 THREAD29 org.apache.aurora.scheduler.thrift.ThriftServer.shutdown:
Received shutdown request, stopping server.
> I0513 22:10:20.491 THREAD29 org.apache.aurora.scheduler.thrift.ThriftServer.setStatus:
Moving from status ALIVE to STOPPING
> I0513 22:10:20.492 THREAD29 org.apache.aurora.scheduler.thrift.ThriftServer.setStatus:
Moving from status STOPPING to STOPPED
> I0513 22:10:20.492 THREAD29 com.twitter.common.application.modules.HttpModule$HttpServerLauncher$1.execute:
Shutting down embedded http server
> I0513 22:10:20.492 THREAD29 org.mortbay.log.Slf4jLog.info: Stopped SelectChannelConnector@0.0.0.0:8081
> I0513 22:10:20.594 THREAD29 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition DEAD -> DEAD
> I0513 22:10:20.594 THREAD29 org.apache.aurora.scheduler.SchedulerLifecycle$8.execute:
Shutdown already invoked, ignoring extra call.
> I0513 22:10:20.595 THREAD1 com.twitter.common.application.AppLauncher.run: Application
run() exited.
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message