aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <>
Subject Re: [PROPOSAL] DB snapshotting
Date Wed, 02 Mar 2016 20:50:37 GMT
Ah, sorry, missed that conversation on IRC.

I have not looked into that. Would be interesting to explore that
route. Given our ultimate goal is to get rid of the replicated log
altogether it does not stand as an immediate priority to me though.

On Wed, Mar 2, 2016 at 11:51 AM, Erb, Stephan
<> wrote:
> +1 for the plan and the ticket.
> In addition, for reference a couple of messages from IRC from yesterday:
> 23:42 <serb> mkhutornenko:  interesting storage proposal on the mailinglist! I
only wondered one thing...
> 23:42 <serb> it feeld kind of weird that we use H2 as a non-replicated database
and build some scaffolding around it in order to distribute its state via the Mesos replicated
> 23:42 <serb> Have you looked into H2, if it would be possible to replace/subclass
their in-process transaction log with a replicated Mesos one?
> 23:43 <serb> Then we would not need that logic that performs a simultaneous inserts
into the log and the taskstore, as the backend would handle that by itself
> 23:44 <serb> (I know close to nothing about the storage layer, so that's like my
perspective from 10.000 feet)
> 00:22 <wfarner> serb: that crossed my mind as well.  I have only drilled in a bit,
would love to more
> ________________________________________
> From: Maxim Khutornenko <>
> Sent: Wednesday, March 2, 2016 18:18
> To:
> Subject: Re: [PROPOSAL] DB snapshotting
> Thanks Bill! Filed
> to track it.
> On Mon, Feb 29, 2016 at 11:41 AM, Bill Farner <> wrote:
>> Thanks for the detailed write up and real-world details!  I generally
>> support momentum towards a single task store implementation, so +1
>> on dealing with that.
>> I anticipated there would be a performance win from straight-to-SQL
>> snapshots, so I am a +1 on that as well.
>> In summary, +1 on all fronts!
>> On Monday, February 29, 2016, Maxim Khutornenko <> wrote:
>>> (Apologies for the wordy problem statement but I feel it's really
>>> necessary to justify the proposal).
>>> Over the past two weeks we have been battling a nasty scheduler issue
>>> in production: the scheduler suddenly stops responding to any user
>>> requests and subsequently gets killed by our health monitoring. Upon
>>> restart, a leader may only function for a few seconds and almost
>>> immediately hangs again.
>>> The long and painful investigation pointed towards internal H2 table
>>> lock contention that resulted in a massive db-write starvation and a
>>> state where a scheduler write lock would *never* be released. This was
>>> relatively easy to replicate in Vagrant by creating a large update
>>> (~4K instances) with a large batch_size (~1K), while bombarding the
>>> scheduler with getJobUpdateDetails() requests for that job. The
>>> scheduler would enter a locked up state on the very first write op
>>> following the update creation (e.g. a status update for an instance
>>> transition from the first batch) and stay in that state for minutes
>>> until all getJobUpdateDetails() requests are served. This behavior is
>>> well explained by the following sentence from [1]:
>>>     "When a lock is released, and multiple connections are waiting for
>>> it, one of them is picked at random."
>>> What happens here is that in a situation when many more read requests
>>> are competing for a shared table lock, the H2 PageStore does not help
>>> write requests requiring an exclusive table lock in any way to
>>> succeed. This leads to db-write starvation and eventual scheduler
>>> native store write starvation as there is no timeout on a scheduler
>>> write lock.
>>> We have played with various available H2/MyBatis configuration
>>> settings to mitigate the above with no noticeable impact. That, until
>>> we switched to H2 MVStore [2], at which point we were able to
>>> completely eliminate the scheduler lockup without making any other
>>> code changes! So, the solution has finally been found? The answer
>>> would be YES until you try MVStore-enabled H2 with any reasonable size
>>> production DB on scheduler restart. There was a reason why we disabled
>>> MVStore in the scheduler [3] in the first place and that reason was
>>> poor MVStore performance with bulk inserts. Re-populating
>>> MVStore-enabled H2 DB took at least 2.5 times longer than normal. This
>>> is unacceptable in prod where every second of scheduler downtime
>>> counts.
>>> Back to the drawing board, we tried all relevant settings and
>>> approaches to speed up MVStore inserts on restart but nothing really
>>> helped. Finally, the only reasonable way forward was to eliminate the
>>> point of slowness altogether - namely remove thrift-to-sql migration
>>> on restart. Fortunately, H2 supports an easy to operate command to
>>> generate the entire DB dump with a single statement [4]. We were now
>>> able to bypass the lengthly DB repopulation on restart by storing the
>>> entire DB dump in snapshot and replaying it on scheduler restart.
>>> Now, the proposal. Given that MVStore vastly outperforms PageStore we
>>> currently use, I suggest we move our H2 to it AND adopt db
>>> snapshotting instead of thrift snapshotting to speed up scheduler
>>> restarts. The rough POC is available here [5]. We are running a
>>> version of this build in production since last week and were able to
>>> completely eliminate scheduler lockups. As a welcome side effect, we
>>> also observed faster scheduler restart times due to eliminating
>>> thrift-to-sql chattiness. Depending on the snapshot freshness the
>>> observed failover downtimes got reduced by ~40%.
>>> Moving to db snapshotting will require us to rethink DB schema
>>> versioning and thrift deprecating/removal policy. We will have to move
>>> to pre-/post- snapshot restore SQL migration scripts to handle any
>>> schema changes, which is a common industry pattern but something we
>>> have not tried yet. The upside though is that we can get an early
>>> start here as we will have to adopt strict SQL migration rules anyway
>>> when we move to persistent DB storage. Also, given that migrating to
>>> H2 TaskStore will likely further degrade scheduler restart times,
>>> having a better performing DB snapshotting solution in place will
>>> definitely aid migration.
>>> Thanks,
>>> Maxim
>>> [1] -
>>> [2] -
>>> [3] -
>>> [4] -
>>> [5] -

View raw message