aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <ma...@apache.org>
Subject Re: [PROPOSAL] DB snapshotting
Date Wed, 02 Mar 2016 17:18:25 GMT
Thanks Bill! Filed https://issues.apache.org/jira/browse/AURORA-1627
to track it.

On Mon, Feb 29, 2016 at 11:41 AM, Bill Farner <wfarner@apache.org> wrote:
> Thanks for the detailed write up and real-world details!  I generally
> support momentum towards a single task store implementation, so +1
> on dealing with that.
>
> I anticipated there would be a performance win from straight-to-SQL
> snapshots, so I am a +1 on that as well.
>
> In summary, +1 on all fronts!
>
> On Monday, February 29, 2016, Maxim Khutornenko <maxim@apache.org> wrote:
>
>> (Apologies for the wordy problem statement but I feel it's really
>> necessary to justify the proposal).
>>
>> Over the past two weeks we have been battling a nasty scheduler issue
>> in production: the scheduler suddenly stops responding to any user
>> requests and subsequently gets killed by our health monitoring. Upon
>> restart, a leader may only function for a few seconds and almost
>> immediately hangs again.
>>
>> The long and painful investigation pointed towards internal H2 table
>> lock contention that resulted in a massive db-write starvation and a
>> state where a scheduler write lock would *never* be released. This was
>> relatively easy to replicate in Vagrant by creating a large update
>> (~4K instances) with a large batch_size (~1K), while bombarding the
>> scheduler with getJobUpdateDetails() requests for that job. The
>> scheduler would enter a locked up state on the very first write op
>> following the update creation (e.g. a status update for an instance
>> transition from the first batch) and stay in that state for minutes
>> until all getJobUpdateDetails() requests are served. This behavior is
>> well explained by the following sentence from [1]:
>>
>>     "When a lock is released, and multiple connections are waiting for
>> it, one of them is picked at random."
>>
>> What happens here is that in a situation when many more read requests
>> are competing for a shared table lock, the H2 PageStore does not help
>> write requests requiring an exclusive table lock in any way to
>> succeed. This leads to db-write starvation and eventual scheduler
>> native store write starvation as there is no timeout on a scheduler
>> write lock.
>>
>> We have played with various available H2/MyBatis configuration
>> settings to mitigate the above with no noticeable impact. That, until
>> we switched to H2 MVStore [2], at which point we were able to
>> completely eliminate the scheduler lockup without making any other
>> code changes! So, the solution has finally been found? The answer
>> would be YES until you try MVStore-enabled H2 with any reasonable size
>> production DB on scheduler restart. There was a reason why we disabled
>> MVStore in the scheduler [3] in the first place and that reason was
>> poor MVStore performance with bulk inserts. Re-populating
>> MVStore-enabled H2 DB took at least 2.5 times longer than normal. This
>> is unacceptable in prod where every second of scheduler downtime
>> counts.
>>
>> Back to the drawing board, we tried all relevant settings and
>> approaches to speed up MVStore inserts on restart but nothing really
>> helped. Finally, the only reasonable way forward was to eliminate the
>> point of slowness altogether - namely remove thrift-to-sql migration
>> on restart. Fortunately, H2 supports an easy to operate command to
>> generate the entire DB dump with a single statement [4]. We were now
>> able to bypass the lengthly DB repopulation on restart by storing the
>> entire DB dump in snapshot and replaying it on scheduler restart.
>>
>>
>> Now, the proposal. Given that MVStore vastly outperforms PageStore we
>> currently use, I suggest we move our H2 to it AND adopt db
>> snapshotting instead of thrift snapshotting to speed up scheduler
>> restarts. The rough POC is available here [5]. We are running a
>> version of this build in production since last week and were able to
>> completely eliminate scheduler lockups. As a welcome side effect, we
>> also observed faster scheduler restart times due to eliminating
>> thrift-to-sql chattiness. Depending on the snapshot freshness the
>> observed failover downtimes got reduced by ~40%.
>>
>> Moving to db snapshotting will require us to rethink DB schema
>> versioning and thrift deprecating/removal policy. We will have to move
>> to pre-/post- snapshot restore SQL migration scripts to handle any
>> schema changes, which is a common industry pattern but something we
>> have not tried yet. The upside though is that we can get an early
>> start here as we will have to adopt strict SQL migration rules anyway
>> when we move to persistent DB storage. Also, given that migrating to
>> H2 TaskStore will likely further degrade scheduler restart times,
>> having a better performing DB snapshotting solution in place will
>> definitely aid migration.
>>
>> Thanks,
>> Maxim
>>
>> [1] - http://www.h2database.com/html/advanced.html?#transaction_isolation
>> [2] - http://www.h2database.com/html/mvstore.html
>> [3] -
>> https://github.com/apache/aurora/blob/824e396ab80874cfea98ef47829279126838a3b2/src/main/java/org/apache/aurora/scheduler/storage/db/DbModule.java#L119
>> [4] - http://www.h2database.com/html/grammar.html#script
>> [5] -
>> https://github.com/maxim111333/incubator-aurora/blob/mv_store/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L317-L370
>>

Mime
View raw message