Mailing-List: contact dev-help@aurora.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@aurora.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CA+LT1GH=BfJxGMjPUMgQE7GMZCd+Yn4=XYT8_Uo-rOUiWsCjRA@mail.gmail.com>
References: 
 <CA+LT1GH=BfJxGMjPUMgQE7GMZCd+Yn4=XYT8_Uo-rOUiWsCjRA@mail.gmail.com>
Date: Wed, 18 Mar 2015 11:13:08 -0700
Message-ID: 
 <CAGRA8uNN3tqTeT9o8bDJF=nW9xxYXReGq-9OSUt8F3W9k4ZKbQ@mail.gmail.com>
Subject: Re: Attempting to Push Aurora to ~11 requests/second
From: Bill Farner <wfarner@apache.org>
To: "dev@aurora.incubator.apache.org" <dev@aurora.incubator.apache.org>
Content-Type: multipart/alternative; boundary=001a113959bc16e110051194091e

--001a113959bc16e110051194091e
Content-Type: text/plain; charset=UTF-8

If you can provide a script to trigger this in the vagrant environment it
will be a tremendous help in finding the cause.

On Wednesday, March 18, 2015, Ryan Orr <ryanorr12@gmail.com> wrote:

> We're attempting to get Aurora to handle 11 job requests a second. We
> realize we're going to be limited by the resource intensive nature of the
> CLI and will be giving Herc a shot here soon; however we seem to get the
> scheduler to crash in less than an hour while submitting about 2
> jobs/second.
>
> The job we're submitting simply echo's "Hello" ever minute for 5 minutes,
> then exits. Here's what the logs look like at the time of the crash:
>
> I0316 18:22:54.348743 20799 log.cpp:680] Attempting to append 1650 bytes to
> the log
> I0316 18:22:54.348808 20799 coordinator.cpp:340] Coordinator attempting to
> write APPEND action at position 22668
> I0316 18:22:54.348989 20799 replica.cpp:508] Replica received write request
> for position 22668
> I0316 18:22:54.352383 20799 leveldb.cpp:343] Persisting action (1671 bytes)
> to leveldb took 3.370462ms
> I0316 18:22:54.352414 20799 replica.cpp:676] Persisted action at 22668
> I0316 18:22:54.352565 20799 replica.cpp:655] Replica received learned
> notice for position 22668
> I0316 18:22:54.356220 20799 leveldb.cpp:343] Persisting action (1673 bytes)
> to leveldb took 3.634224ms
> I0316 18:22:54.356253 20799 replica.cpp:676] Persisted action at 22668
> I0316 18:22:54.356266 20799 replica.cpp:661] Replica learned APPEND action
> at position 22668
> I0316 18:22:54.561 THREAD127
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect: Opening socket
> connection to server sandbox01/192.168.4.101:2181
> I0316 18:22:54.563 THREAD127
> org.apache.zookeeper.ClientCnxn$SendThread.primeConnection: Socket
> connection established to sandbox01/192.168.4.101:2181, initiating session
> I0316 18:22:54.564 THREAD127
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult: Session
> establishment complete on server sandbox01/192.168.4.101:2181, sessionid =
> 0x14c03fdca89bc0f, negotiated timeout = 4000
> I0316 18:22:55.374 THREAD143
> org.apache.aurora.scheduler.async.OfferManager$OfferManagerImpl.addOffer:
> Returning offers for 20150304-172015-1711581376-5050-15195-S5 for
> compaction
> I0316 18:22:55.374 THREAD143
> org.apache.aurora.scheduler.async.OfferManager$OfferManagerImpl.addOffer:
> Returning offers for 20150304-172015-1711581376-5050-15195-S3 for
> compaction
> I0316 18:22:55.394 THREAD137
> com.twitter.common.util.StateMachine$Builder$1.execute:
>
> 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316
> state machine transition PENDING -> ASSIGNED
> I0316 18:22:55.394 THREAD137
> org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work
> command SAVE_STATE for
>
> 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316
> I0316 18:22:55.394 THREAD137
> org.apache.aurora.scheduler.state.TaskAssigner$TaskAssignerImpl.assign:
> Offer on slave {URL} (id 20150304-172015-1711581376-5050-15195-S2) is being
> assigned task for
>
> 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316.
> I0316 18:22:55.395851 20795 log.cpp:680] Attempting to append 1451 bytes to
> the log
> I0316 18:22:55.395921 20795 coordinator.cpp:340] Coordinator attempting to
> write APPEND action at position 22669
> I0316 18:22:55.396108 20795 replica.cpp:508] Replica received write request
> for position 22669
> I0316 18:22:57.621 THREAD1905
> org.apache.aurora.scheduler.mesos.MesosScheduler.Impl.logStatusUpdate:
> Received status update for task
>
> 1426265234214-root-prod-gl6e9DL4vM3J0XTxM0k9GMT4QeXRTgYz-0-8e0df05c-b6c7-40bd-924f-b858f39da316
> in state TASK_STARTING: Initializing sandbox.
> I0316 18:22:58.396 THREAD137
> com.twitter.common.application.Lifecycle.shutdown: Shutting down
> application
> I0316 18:22:58.396 THREAD137
>
> com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute:
> Executing 7 shutdown commands.
> I0316 18:22:58.561 THREAD137
>
> org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
> Shutdown initiated by: Thread: TaskScheduler-0 (id 137)
> java.lang.Thread.getStackTrace(Thread.java:1589)
>
> The stack trace doesn't show anything interesting, I can't figure out why
> it just randomly issues a shutdown. Let me know if there's anything else
> that would be helpful from the log, we're on an internal network so I can't
> just copy/paste the logs over.
>


-- 
-=Bill

--001a113959bc16e110051194091e--