aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Howell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-470) Tasks get stuck in THROTTLED state on restart or leader change
Date Sat, 24 May 2014 05:34:01 GMT

    [ https://issues.apache.org/jira/browse/AURORA-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008049#comment-14008049
] 

Nathan Howell commented on AURORA-470:
--------------------------------------

There is nothing else related to the job or task in the log, and no entries around 20:53:47.
This is on the newly elected leader, which became a leader just recently at 20:51:30.002.

It looks like the trigger causing a leader election is ZooKeeper timing out a session. If
the task state is updated through the replicated log I wouldn't think it's related though...

This is from the next failure (about an hour later), but this is happening semi-frequently
right now even though the ZK nodes are mostly idle, maybe around 5-10 iops. Here are the relevant
log sections:

.err (timestamps are from the file, looks like a mix of utc and pdt)
{code}
I0523 21:56:34.022322 21919 leveldb.cpp:337] Persisting action (27 bytes) to leveldb took
271691ns
I0523 21:56:34.022356 21919 replica.cpp:671] Persisted action at 1402962
I0523 21:56:34.022377 21919 replica.cpp:656] Replica learned APPEND action at position 1402962
I0524 04:56:38.080 THREAD17 org.apache.zookeeper.ClientCnxn$SendThread.run: Client session
timed out, have not heard from server in 8300ms for sessionid 0x1461589250bea42, closing socket
connection and attempting reconnect
I0523 21:56:38.080940 21911 log.cpp:849] Attempting to append 2352 bytes to the log
I0523 21:56:38.081123 21903 coordinator.cpp:335] Coordinator attempting to write APPEND action
at position 1402962
I0523 21:56:38.081984 21916 replica.cpp:515] Replica received write request for position 1402962
I0523 21:56:38.082059 21916 leveldb.cpp:432] Reading position from leveldb took 23172ns
E0523 21:56:38.082765 21911 log.cpp:866] Failed to append the log: Coordinator demoted
I0524 04:56:38.382 THREAD17 org.apache.zookeeper.ClientCnxn$SendThread.startConnect: Opening
socket connection to server ...zook01.prod...
I0524 04:56:38.383 THREAD17 org.apache.zookeeper.ClientCnxn$SendThread.primeConnection: Socket
connection established to ...zook01.prod..., initiating session
I0524 04:56:38.384 THREAD17 org.apache.zookeeper.ClientCnxn$SendThread.run: Unable to reconnect
to ZooKeeper service, session 0x1461589250bea42 has expired, closing socket connection
I0524 04:56:38.385 THREAD18 org.apache.zookeeper.ZooKeeper.<init>: Initiating client
connection, connectString=...zook01.prod....,...zook02.prod...,...zook03.prod... sessionTimeout=3000
watcher=com.twitter.common.zookeeper.ZooKeeperClient$2@b2c67db
I0524 04:56:38.387 THREAD7720 org.apache.zookeeper.ClientCnxn$SendThread.startConnect: Opening
socket connection to server ...zook02.prod...
I0524 04:56:38.388 THREAD7720 org.apache.zookeeper.ClientCnxn$SendThread.primeConnection:
Socket connection established to ...zook02.prod..., initiating session
I0524 04:56:38.389 THREAD7720 org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult:
Session establishment complete on server ...zook02..., sessionid = 0x24615892536f7ef, negotiated
timeout = 4000
{code}

and the .log:
{code}
I0524 04:56:38.082 THREAD7718 com.twitter.common.application.Lifecycle.shutdown: Shutting
down application
I0524 04:56:38.083 THREAD7718 com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute:
Executing 8 shutdown commands.
I0524 04:56:38.087 THREAD7718 org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
Shutdown initiated by: Thread: Thread-7692 (id 7718)
java.lang.Thread.getStackTrace(Thread.java:1588)
  org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:141)
  com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
  com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
  org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.disableLog(MesosLog.java:354)
  org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:374)
  org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:318)
  org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:148)
...
{code}

> Tasks get stuck in THROTTLED state on restart or leader change
> --------------------------------------------------------------
>
>                 Key: AURORA-470
>                 URL: https://issues.apache.org/jira/browse/AURORA-470
>             Project: Aurora
>          Issue Type: Story
>          Components: Scheduler
>    Affects Versions: 0.5.0
>            Reporter: Nathan Howell
>
> We're seeing cases where tasks get stuck in the THROTTLED state indefinitely. From what
I can tell from the logs, this happens if a task is throttled when Aurora is shutdown or a
new leader is elected.
> It looks like the timer that changes the state from THROTTLED to PENDING is only setup
on a transition to the THROTTLED state... it seems like there  is no way to get these tasks
running again except to restart them manually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message