aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <ma...@apache.org>
Subject [PROPOSAL] DB snapshotting
Date Mon, 29 Feb 2016 19:34:53 GMT
(Apologies for the wordy problem statement but I feel it's really
necessary to justify the proposal).

Over the past two weeks we have been battling a nasty scheduler issue
in production: the scheduler suddenly stops responding to any user
requests and subsequently gets killed by our health monitoring. Upon
restart, a leader may only function for a few seconds and almost
immediately hangs again.

The long and painful investigation pointed towards internal H2 table
lock contention that resulted in a massive db-write starvation and a
state where a scheduler write lock would *never* be released. This was
relatively easy to replicate in Vagrant by creating a large update
(~4K instances) with a large batch_size (~1K), while bombarding the
scheduler with getJobUpdateDetails() requests for that job. The
scheduler would enter a locked up state on the very first write op
following the update creation (e.g. a status update for an instance
transition from the first batch) and stay in that state for minutes
until all getJobUpdateDetails() requests are served. This behavior is
well explained by the following sentence from [1]:

    "When a lock is released, and multiple connections are waiting for
it, one of them is picked at random."

What happens here is that in a situation when many more read requests
are competing for a shared table lock, the H2 PageStore does not help
write requests requiring an exclusive table lock in any way to
succeed. This leads to db-write starvation and eventual scheduler
native store write starvation as there is no timeout on a scheduler
write lock.

We have played with various available H2/MyBatis configuration
settings to mitigate the above with no noticeable impact. That, until
we switched to H2 MVStore [2], at which point we were able to
completely eliminate the scheduler lockup without making any other
code changes! So, the solution has finally been found? The answer
would be YES until you try MVStore-enabled H2 with any reasonable size
production DB on scheduler restart. There was a reason why we disabled
MVStore in the scheduler [3] in the first place and that reason was
poor MVStore performance with bulk inserts. Re-populating
MVStore-enabled H2 DB took at least 2.5 times longer than normal. This
is unacceptable in prod where every second of scheduler downtime
counts.

Back to the drawing board, we tried all relevant settings and
approaches to speed up MVStore inserts on restart but nothing really
helped. Finally, the only reasonable way forward was to eliminate the
point of slowness altogether - namely remove thrift-to-sql migration
on restart. Fortunately, H2 supports an easy to operate command to
generate the entire DB dump with a single statement [4]. We were now
able to bypass the lengthly DB repopulation on restart by storing the
entire DB dump in snapshot and replaying it on scheduler restart.


Now, the proposal. Given that MVStore vastly outperforms PageStore we
currently use, I suggest we move our H2 to it AND adopt db
snapshotting instead of thrift snapshotting to speed up scheduler
restarts. The rough POC is available here [5]. We are running a
version of this build in production since last week and were able to
completely eliminate scheduler lockups. As a welcome side effect, we
also observed faster scheduler restart times due to eliminating
thrift-to-sql chattiness. Depending on the snapshot freshness the
observed failover downtimes got reduced by ~40%.

Moving to db snapshotting will require us to rethink DB schema
versioning and thrift deprecating/removal policy. We will have to move
to pre-/post- snapshot restore SQL migration scripts to handle any
schema changes, which is a common industry pattern but something we
have not tried yet. The upside though is that we can get an early
start here as we will have to adopt strict SQL migration rules anyway
when we move to persistent DB storage. Also, given that migrating to
H2 TaskStore will likely further degrade scheduler restart times,
having a better performing DB snapshotting solution in place will
definitely aid migration.

Thanks,
Maxim

[1] - http://www.h2database.com/html/advanced.html?#transaction_isolation
[2] - http://www.h2database.com/html/mvstore.html
[3] - https://github.com/apache/aurora/blob/824e396ab80874cfea98ef47829279126838a3b2/src/main/java/org/apache/aurora/scheduler/storage/db/DbModule.java#L119
[4] - http://www.h2database.com/html/grammar.html#script
[5] - https://github.com/maxim111333/incubator-aurora/blob/mv_store/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L317-L370

Mime
View raw message