aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Farner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-897) aurora scheduler unable to restore from snapshot
Date Mon, 27 Oct 2014 21:53:34 GMT

    [ https://issues.apache.org/jira/browse/AURORA-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185885#comment-14185885
] 

Bill Farner commented on AURORA-897:
------------------------------------

Thanks for filing!  The snapshot will be really helpful for debugging.  Feel free to attach
it here if you're comfortable with that, otherwise please reach out to me for a private exchange.

> aurora scheduler unable to restore from snapshot
> ------------------------------------------------
>
>                 Key: AURORA-897
>                 URL: https://issues.apache.org/jira/browse/AURORA-897
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 0.5.0
>            Reporter: Bhuvan Arumugam
>            Priority: Blocker
>
> Since this weekend, we are unable to upgrade scheduler. It fail as it couldn't restore
replicated log from snapshot. See following traceback for details.
> I couldn't replicate it in vagrant that have >2 jobs. It happen in one of our test
clusters, running with 1 scheduler. I could reproduce this problem, with restoring the snapshot
manually using {{aurora_admin scheduler_stage_recovery}} command. The backup size is 1.5M,
if it matter. I can share the snapshot if it'll help to debug.
> Scheduler works as of this commit: https://github.com/apache/incubator-aurora/commit/f2d0ca299ad759e9720a5a82547f7bce8a427d53
> Scheduler DOES NOT work as of this commit: https://github.com/apache/incubator-aurora/commit/3778330cd37bc8b96907654b496c15152dfecea7
> Commit delta: https://github.com/apache/incubator-aurora/compare/f2d0ca299ad759e9720a5a82547f7bce8a427d53...3778330cd37bc8b96907654b496c15152dfecea7
> I suspect one of commits related to AURORA-84 have caused this regression. The StorageBackfill
class was modified to fix that jira.
> {code}
> I1025 16:07:31.336 THREAD1 org.apache.aurora.scheduler.storage.log.LogStorage.replay:
Applying snapshot taken on Sat Oct 25 16:05:29 UTC 2014
> I1025 16:07:31.339 THREAD1 org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl$9.execute:
Restoring snapshot.
> I1025 16:07:31.891233 37970 leveldb.cpp:438] Reading position from leveldb took 254951ns
> I1025 16:07:31.893616 37976 leveldb.cpp:438] Reading position from leveldb took 110726ns
> I1025 16:07:31.909817 37971 leveldb.cpp:438] Reading position from leveldb took 186434ns
> I1025 16:07:31.913910 37971 leveldb.cpp:438] Reading position from leveldb took 81185ns
> I1025 16:07:31.917501 37983 leveldb.cpp:438] Reading position from leveldb took 31971ns
> I1025 16:07:31.919 THREAD1 org.apache.aurora.scheduler.storage.StorageBackfill.backfill:
Backfilling task config job keys.
> I1025 16:07:32.023 THREAD1 org.apache.aurora.scheduler.storage.StorageBackfill.backfill:
Performing shard uniqueness sanity check.
> E1025 16:07:32.076 THREAD1 org.apache.aurora.scheduler.SchedulerLifecycle$9.execute:
Caught unchecked exception: java.lang.IllegalArgumentException
> java.lang.IllegalArgumentException
>         at com.google.common.base.Preconditions.checkArgument(Preconditions.java:111)
>         at org.apache.aurora.scheduler.base.JobKeys.assertValid(JobKeys.java:100)
>         at org.apache.aurora.scheduler.base.Query$Builder.byInstances(Query.java:337)
>         at org.apache.aurora.scheduler.base.Query.instanceScoped(Query.java:102)
>         at org.apache.aurora.scheduler.storage.StorageBackfill.guaranteeShardUniqueness(StorageBackfill.java:74)
>         at org.apache.aurora.scheduler.storage.StorageBackfill.access$100(StorageBackfill.java:44)
>         at org.apache.aurora.scheduler.storage.StorageBackfill$2.apply(StorageBackfill.java:158)
>         at org.apache.aurora.scheduler.storage.StorageBackfill$2.apply(StorageBackfill.java:151)
>         at org.apache.aurora.scheduler.storage.mem.MemTaskStore.mutateTasks(MemTaskStore.java:194)
>         at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
>         at org.apache.aurora.scheduler.storage.log.WriteAheadStorage.mutateTasks(WriteAheadStorage.java:196)
>         at org.apache.aurora.scheduler.storage.StorageBackfill.backfill(StorageBackfill.java:151)
>         at org.apache.aurora.scheduler.SchedulerLifecycle$6$1.execute(SchedulerLifecycle.java:251)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149)
>         at org.apache.aurora.scheduler.storage.log.LogStorage$7.apply(LogStorage.java:562)
>         at org.apache.aurora.scheduler.storage.log.LogStorage$7.apply(LogStorage.java:559)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage.doWork(MemStorage.java:175)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage.access$400(MemStorage.java:59)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:200)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:197)
>         at org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:150)
>         at org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:197)
>         at org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:559)
>         at org.apache.aurora.scheduler.storage.log.LogStorage$2.execute(LogStorage.java:331)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage.doWork(MemStorage.java:175)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage.access$400(MemStorage.java:59)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:200)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:197)
>         at org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:150)
>         at org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:197)
>         at org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:548)
>         at org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:320)
>         at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:92)
>         at org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:248)
>         at org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:244)
>         at com.twitter.common.base.Closures$4.execute(Closures.java:120)
>         at org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:423)
>         at org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:419)
>         at com.twitter.common.base.Closures$4.execute(Closures.java:120)
>         at com.twitter.common.base.Closures$3.execute(Closures.java:98)
>         at com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
>         at org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:465)
>         at com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168)
>         at com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155)
>         at com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665)
>         at com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638)
>         at com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579)
>         at com.twitter.common.zookeeper.Group$4.get(Group.java:551)
>         at com.twitter.common.zookeeper.Group$4.get(Group.java:548)
>         at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109)
>         at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107)
>         at com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127)
>         at com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107)
>         at com.twitter.common.zookeeper.Group.watch(Group.java:548)
>         at com.twitter.common.zookeeper.CandidateImpl.offerLeadership(CandidateImpl.java:136)
>         at com.twitter.common.zookeeper.SingletonService.lead(SingletonService.java:165)
>         at org.apache.aurora.scheduler.app.SchedulerMain.run(SchedulerMain.java:253)
>         at com.twitter.common.application.AppLauncher.run(AppLauncher.java:102)
>         at com.twitter.common.application.AppLauncher.launch(AppLauncher.java:181)
>         at com.twitter.common.application.AppLauncher.launch(AppLauncher.java:142)
>         at org.apache.aurora.scheduler.app.SchedulerMain.main(SchedulerMain.java:269)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message