aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David McLaughlin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load
Date Thu, 01 Dec 2016 07:31:58 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15711180#comment-15711180
] 

David McLaughlin commented on AURORA-1840:
------------------------------------------

Would anyone object to reverting the two commits above on master until we can root cause this,
since it will be a noop for anyone using the default settings and who aren't having issues
with curator? cc [~jsirois] [~zmanji] 


> Issue with Curator-backed discovery under heavy load
> ----------------------------------------------------
>
>                 Key: AURORA-1840
>                 URL: https://issues.apache.org/jira/browse/AURORA-1840
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: David McLaughlin
>            Priority: Blocker
>
> We've been having some performance issues recently with our production clusters at Twitter.
A side-effect of these are occassional stop-the-world GC pauses for up to 15 seconds. This
has been happening at our scale for quite some time, but previous versions of the Scheduler
were resilient to this and no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The Scheduler
is now failing over any time we see a GC pause, even though these pauses are within the session
timeout. Here is an example pause in the scheduler logs with the associated ZK session timeout
that leads to a failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] redacted-9f565b4-067e-422f-b641-c6000f9ae2c8
state machine transition PENDING -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work command SAVE_STATE
for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] Offer on
agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8.

> W1118 19:40:31.744 [main-SendThread(redacted:2181), ClientCnxn$SendThread:1108] Client
session timed out, have not heard from server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in our GC logs
- a CMS promotion failure caused the pause) and this somehow triggers a session timeout of
20s to fire. Note: we have seen GC pauses as little as 7s cause the same behavior. 
> That's problem one. The second problem is that we are using the 'zk_session_timeout'
flag and have the value set to 30s. This does not seem to be passed to the Curator client
as it is getting a value of 20s from somewhere. 
> We have confirmed that the Scheduler no longer fails over when deploying from HEAD with
these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers are (currently
several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message