aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Farner <wfar...@apache.org>
Subject Re: [PROPOSAL] DB snapshotting
Date Wed, 02 Mar 2016 20:52:19 GMT
Seems prudent to explore rather than write off though.  For all we know it
simplifies a lot.

On Wednesday, March 2, 2016, Maxim Khutornenko <maxim@apache.org> wrote:

> Ah, sorry, missed that conversation on IRC.
>
> I have not looked into that. Would be interesting to explore that
> route. Given our ultimate goal is to get rid of the replicated log
> altogether it does not stand as an immediate priority to me though.
>
> On Wed, Mar 2, 2016 at 11:51 AM, Erb, Stephan
> <Stephan.Erb@blue-yonder.com <javascript:;>> wrote:
> > +1 for the plan and the ticket.
> >
> > In addition, for reference a couple of messages from IRC from yesterday:
> >
> > 23:42 <serb> mkhutornenko:  interesting storage proposal on the
> mailinglist! I only wondered one thing...
> > 23:42 <serb> it feeld kind of weird that we use H2 as a non-replicated
> database and build some scaffolding around it in order to distribute its
> state via the Mesos replicated log.
> > 23:42 <serb> Have you looked into H2, if it would be possible to
> replace/subclass their in-process transaction log with a replicated Mesos
> one?
> > 23:43 <serb> Then we would not need that logic that performs a
> simultaneous inserts into the log and the taskstore, as the backend would
> handle that by itself
> > 23:44 <serb> (I know close to nothing about the storage layer, so that's
> like my perspective from 10.000 feet)
> >
> > 00:22 <wfarner> serb: that crossed my mind as well.  I have only drilled
> in a bit, would love to more
> >
> > ________________________________________
> > From: Maxim Khutornenko <maxim@apache.org <javascript:;>>
> > Sent: Wednesday, March 2, 2016 18:18
> > To: dev@aurora.apache.org <javascript:;>
> > Subject: Re: [PROPOSAL] DB snapshotting
> >
> > Thanks Bill! Filed https://issues.apache.org/jira/browse/AURORA-1627
> > to track it.
> >
> > On Mon, Feb 29, 2016 at 11:41 AM, Bill Farner <wfarner@apache.org
> <javascript:;>> wrote:
> >> Thanks for the detailed write up and real-world details!  I generally
> >> support momentum towards a single task store implementation, so +1
> >> on dealing with that.
> >>
> >> I anticipated there would be a performance win from straight-to-SQL
> >> snapshots, so I am a +1 on that as well.
> >>
> >> In summary, +1 on all fronts!
> >>
> >> On Monday, February 29, 2016, Maxim Khutornenko <maxim@apache.org
> <javascript:;>> wrote:
> >>
> >>> (Apologies for the wordy problem statement but I feel it's really
> >>> necessary to justify the proposal).
> >>>
> >>> Over the past two weeks we have been battling a nasty scheduler issue
> >>> in production: the scheduler suddenly stops responding to any user
> >>> requests and subsequently gets killed by our health monitoring. Upon
> >>> restart, a leader may only function for a few seconds and almost
> >>> immediately hangs again.
> >>>
> >>> The long and painful investigation pointed towards internal H2 table
> >>> lock contention that resulted in a massive db-write starvation and a
> >>> state where a scheduler write lock would *never* be released. This was
> >>> relatively easy to replicate in Vagrant by creating a large update
> >>> (~4K instances) with a large batch_size (~1K), while bombarding the
> >>> scheduler with getJobUpdateDetails() requests for that job. The
> >>> scheduler would enter a locked up state on the very first write op
> >>> following the update creation (e.g. a status update for an instance
> >>> transition from the first batch) and stay in that state for minutes
> >>> until all getJobUpdateDetails() requests are served. This behavior is
> >>> well explained by the following sentence from [1]:
> >>>
> >>>     "When a lock is released, and multiple connections are waiting for
> >>> it, one of them is picked at random."
> >>>
> >>> What happens here is that in a situation when many more read requests
> >>> are competing for a shared table lock, the H2 PageStore does not help
> >>> write requests requiring an exclusive table lock in any way to
> >>> succeed. This leads to db-write starvation and eventual scheduler
> >>> native store write starvation as there is no timeout on a scheduler
> >>> write lock.
> >>>
> >>> We have played with various available H2/MyBatis configuration
> >>> settings to mitigate the above with no noticeable impact. That, until
> >>> we switched to H2 MVStore [2], at which point we were able to
> >>> completely eliminate the scheduler lockup without making any other
> >>> code changes! So, the solution has finally been found? The answer
> >>> would be YES until you try MVStore-enabled H2 with any reasonable size
> >>> production DB on scheduler restart. There was a reason why we disabled
> >>> MVStore in the scheduler [3] in the first place and that reason was
> >>> poor MVStore performance with bulk inserts. Re-populating
> >>> MVStore-enabled H2 DB took at least 2.5 times longer than normal. This
> >>> is unacceptable in prod where every second of scheduler downtime
> >>> counts.
> >>>
> >>> Back to the drawing board, we tried all relevant settings and
> >>> approaches to speed up MVStore inserts on restart but nothing really
> >>> helped. Finally, the only reasonable way forward was to eliminate the
> >>> point of slowness altogether - namely remove thrift-to-sql migration
> >>> on restart. Fortunately, H2 supports an easy to operate command to
> >>> generate the entire DB dump with a single statement [4]. We were now
> >>> able to bypass the lengthly DB repopulation on restart by storing the
> >>> entire DB dump in snapshot and replaying it on scheduler restart.
> >>>
> >>>
> >>> Now, the proposal. Given that MVStore vastly outperforms PageStore we
> >>> currently use, I suggest we move our H2 to it AND adopt db
> >>> snapshotting instead of thrift snapshotting to speed up scheduler
> >>> restarts. The rough POC is available here [5]. We are running a
> >>> version of this build in production since last week and were able to
> >>> completely eliminate scheduler lockups. As a welcome side effect, we
> >>> also observed faster scheduler restart times due to eliminating
> >>> thrift-to-sql chattiness. Depending on the snapshot freshness the
> >>> observed failover downtimes got reduced by ~40%.
> >>>
> >>> Moving to db snapshotting will require us to rethink DB schema
> >>> versioning and thrift deprecating/removal policy. We will have to move
> >>> to pre-/post- snapshot restore SQL migration scripts to handle any
> >>> schema changes, which is a common industry pattern but something we
> >>> have not tried yet. The upside though is that we can get an early
> >>> start here as we will have to adopt strict SQL migration rules anyway
> >>> when we move to persistent DB storage. Also, given that migrating to
> >>> H2 TaskStore will likely further degrade scheduler restart times,
> >>> having a better performing DB snapshotting solution in place will
> >>> definitely aid migration.
> >>>
> >>> Thanks,
> >>> Maxim
> >>>
> >>> [1] -
> http://www.h2database.com/html/advanced.html?#transaction_isolation
> >>> [2] - http://www.h2database.com/html/mvstore.html
> >>> [3] -
> >>>
> https://github.com/apache/aurora/blob/824e396ab80874cfea98ef47829279126838a3b2/src/main/java/org/apache/aurora/scheduler/storage/db/DbModule.java#L119
> >>> [4] - http://www.h2database.com/html/grammar.html#script
> >>> [5] -
> >>>
> https://github.com/maxim111333/incubator-aurora/blob/mv_store/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L317-L370
> >>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message