Return-Path: X-Original-To: apmail-aurora-dev-archive@minotaur.apache.org Delivered-To: apmail-aurora-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AE17617888 for ; Wed, 18 Mar 2015 18:14:46 +0000 (UTC) Received: (qmail 97343 invoked by uid 500); 18 Mar 2015 18:14:46 -0000 Delivered-To: apmail-aurora-dev-archive@aurora.apache.org Received: (qmail 97297 invoked by uid 500); 18 Mar 2015 18:14:46 -0000 Mailing-List: contact dev-help@aurora.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.incubator.apache.org Delivered-To: mailing list dev@aurora.incubator.apache.org Received: (qmail 97286 invoked by uid 99); 18 Mar 2015 18:14:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Mar 2015 18:14:46 +0000 X-ASF-Spam-Status: No, hits=-1997.8 required=5.0 tests=ALL_TRUSTED,HTML_MESSAGE,T_RP_MATCHES_RCVD,WEIRD_PORT X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 18 Mar 2015 18:14:44 +0000 Received: (qmail 94595 invoked by uid 99); 18 Mar 2015 18:13:10 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Mar 2015 18:13:10 +0000 Received: from mail-qc0-f178.google.com (mail-qc0-f178.google.com [209.85.216.178]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id E7B3A1A02E4 for ; Wed, 18 Mar 2015 18:13:09 +0000 (UTC) Received: by qcto4 with SMTP id o4so45684579qct.3 for ; Wed, 18 Mar 2015 11:13:08 -0700 (PDT) X-Gm-Message-State: ALoCoQkI3pANola62/P2cqSFm9WODdFAcSgmaBcX8K5GukfkLTyzYV34XQ6lxgYYqJLyF3ArbHLi MIME-Version: 1.0 X-Received: by 10.140.46.7 with SMTP id j7mr87857283qga.12.1426702388858; Wed, 18 Mar 2015 11:13:08 -0700 (PDT) Received: by 10.229.64.131 with HTTP; Wed, 18 Mar 2015 11:13:08 -0700 (PDT) In-Reply-To: References: Date: Wed, 18 Mar 2015 11:13:08 -0700 Message-ID: Subject: Re: Attempting to Push Aurora to ~11 requests/second From: Bill Farner To: "dev@aurora.incubator.apache.org" Content-Type: multipart/alternative; boundary=001a113959bc16e110051194091e X-Virus-Checked: Checked by ClamAV on apache.org --001a113959bc16e110051194091e Content-Type: text/plain; charset=UTF-8 If you can provide a script to trigger this in the vagrant environment it will be a tremendous help in finding the cause. On Wednesday, March 18, 2015, Ryan Orr wrote: > We're attempting to get Aurora to handle 11 job requests a second. We > realize we're going to be limited by the resource intensive nature of the > CLI and will be giving Herc a shot here soon; however we seem to get the > scheduler to crash in less than an hour while submitting about 2 > jobs/second. > > The job we're submitting simply echo's "Hello" ever minute for 5 minutes, > then exits. Here's what the logs look like at the time of the crash: > > I0316 18:22:54.348743 20799 log.cpp:680] Attempting to append 1650 bytes to > the log > I0316 18:22:54.348808 20799 coordinator.cpp:340] Coordinator attempting to > write APPEND action at position 22668 > I0316 18:22:54.348989 20799 replica.cpp:508] Replica received write request > for position 22668 > I0316 18:22:54.352383 20799 leveldb.cpp:343] Persisting action (1671 bytes) > to leveldb took 3.370462ms > I0316 18:22:54.352414 20799 replica.cpp:676] Persisted action at 22668 > I0316 18:22:54.352565 20799 replica.cpp:655] Replica received learned > notice for position 22668 > I0316 18:22:54.356220 20799 leveldb.cpp:343] Persisting action (1673 bytes) > to leveldb took 3.634224ms > I0316 18:22:54.356253 20799 replica.cpp:676] Persisted action at 22668 > I0316 18:22:54.356266 20799 replica.cpp:661] Replica learned APPEND action > at position 22668 > I0316 18:22:54.561 THREAD127 > org.apache.zookeeper.ClientCnxn$SendThread.startConnect: Opening socket > connection to server sandbox01/192.168.4.101:2181 > I0316 18:22:54.563 THREAD127 > org.apache.zookeeper.ClientCnxn$SendThread.primeConnection: Socket > connection established to sandbox01/192.168.4.101:2181, initiating session > I0316 18:22:54.564 THREAD127 > org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult: Session > establishment complete on server sandbox01/192.168.4.101:2181, sessionid = > 0x14c03fdca89bc0f, negotiated timeout = 4000 > I0316 18:22:55.374 THREAD143 > org.apache.aurora.scheduler.async.OfferManager$OfferManagerImpl.addOffer: > Returning offers for 20150304-172015-1711581376-5050-15195-S5 for > compaction > I0316 18:22:55.374 THREAD143 > org.apache.aurora.scheduler.async.OfferManager$OfferManagerImpl.addOffer: > Returning offers for 20150304-172015-1711581376-5050-15195-S3 for > compaction > I0316 18:22:55.394 THREAD137 > com.twitter.common.util.StateMachine$Builder$1.execute: > > 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316 > state machine transition PENDING -> ASSIGNED > I0316 18:22:55.394 THREAD137 > org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work > command SAVE_STATE for > > 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316 > I0316 18:22:55.394 THREAD137 > org.apache.aurora.scheduler.state.TaskAssigner$TaskAssignerImpl.assign: > Offer on slave {URL} (id 20150304-172015-1711581376-5050-15195-S2) is being > assigned task for > > 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316. > I0316 18:22:55.395851 20795 log.cpp:680] Attempting to append 1451 bytes to > the log > I0316 18:22:55.395921 20795 coordinator.cpp:340] Coordinator attempting to > write APPEND action at position 22669 > I0316 18:22:55.396108 20795 replica.cpp:508] Replica received write request > for position 22669 > I0316 18:22:57.621 THREAD1905 > org.apache.aurora.scheduler.mesos.MesosScheduler.Impl.logStatusUpdate: > Received status update for task > > 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316 > in state TASK_STARTING: Initializing sandbox. > I0316 18:22:58.396 THREAD137 > com.twitter.common.application.Lifecycle.shutdown: Shutting down > application > I0316 18:22:58.396 THREAD137 > > com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute: > Executing 7 shutdown commands. > I0316 18:22:58.561 THREAD137 > > org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute: > Shutdown initiated by: Thread: TaskScheduler-0 (id 137) > java.lang.Thread.getStackTrace(Thread.java:1589) > > The stack trace doesn't show anything interesting, I can't figure out why > it just randomly issues a shutdown. Let me know if there's anything else > that would be helpful from the log, we're on an internal network so I can't > just copy/paste the logs over. > -- -=Bill --001a113959bc16e110051194091e--