aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Hindman <>
Subject Re: Evolving the aurora scheduler storage
Date Thu, 17 Apr 2014 20:00:02 GMT
Using an in-memory storage system like H2 makes a lot of sense to me.

As for the replicated log, I'd love to talk about how we could write a
concrete implementation of H2's transaction log using the Mesos replicated
log. I think H2 was moving to a different transaction log implementation
too (MVStore?) which was still log structured. We've been chatting about
adding features into the replicated log like read-streaming which should
make standby masters fast to startup on failover.


On Thu, Apr 17, 2014 at 1:16 PM, Bill Farner <> wrote:

> Over the next quarter, i'd like to embark on some work to improve the
> storage system in the scheduler.  This work can probably be summarized as
> "stop building a database".
> *Background*
> The current storage system uses a replicated write-ahead log [1] (provided
> by libmesos, some details in [2]), and a primary in-memory storage system
> [3].  Most of what i'll discuss relates to MemTaskStore [4], which is by
> far the largest (in terms of memory) and most complex.
> The current storage layout is non-relational.  If callers want to deal with
> things in terms of jobs (collection of instances) and instances (currently
> active tasks in a job), they must do so with an appropriately-scoped task
> query, and group results to get the desired information.  This is
> particularly problematic in places like the web interface, where we must do
> an O(n) walk of all tasks [5] to aggregate role and job statistics.  Over
> time, we have implemented indices on the task store [6] to speed up some
> operations, but a better data layout would yield bigger gains.
> *Hierarchical storage* [7]
> It has become clear that we need to rework the storage implementation and
> APIs to directly support the concepts that we use in practice (Role,
> Environment, Job, Instance, Task).  This would simplify both implementation
> and consumption of the storage APIs.  It would also allow us to consume
> considerably less memory (by normalizing task configurations - the most
> memory-large objects in the system).  This also leaves us with a natural
> place to associate auxiliary information with different hierarchy levels
> (e.g. Jobs).
> *Complexity*
> Implementing a storage system is easy to do poorly, difficult to do well.
>  Right now, we're probably somewhere in between.  We have bespoke
> implementations of complex things like transactions [8], read-write locking
> [9], and weakly-consistent reads [10].  I'd like to lean on other projects
> who have committed to solving these problems well rather than continue to
> roll our own.
> *SQL*
> If you've read this far, you're probably thinking "use a database, dummy!".
>  Well, that's what i propose we do.  This would allow us to offload much of
> the difficult code that is not our project's core competency.  It will
> allow us to reorganize our storage layout in the future without writing
> complex [java] code, and avoid homemade implementation of things like data
> relationships and consistency.
> The first step i would like to take is replacing the in-memory storage with
> H2 [11].  I'm leaning towards H2 because we actually used it in aurora in
> the past with good results, and is in active development.
> *ORM*
> I'd like to leverage an ORM layer to handle interaction with the database.
>  My default choice here would usually be hibernate, but AFAICT licensing
> prevents that.  Further surveying has led me to MyBatis [12].  I also
> considered Apache Cayenne [13], but prefer MyBatis because it is actively
> releasing code (the latest stable release of Cayenne was 3 years ago) and
> it does not push us to use a code generator.  The latter is a big deal,
> since it allows us to adapt the ORM layer to the existing objects.  This
> will make it easier to minimize ripple of a new storage implementation to
> the rest of the code.
> *Future direction*
> The commentary above only describes the in-memory storage, but it could be
> extended to cover the replicated log as well.  While there are nice
> features of the current arrangement (ease of installation, no SPOF), we do
> take on a significant amount of complexity by using the replicated log.
>  For example, we have to implement snapshotting [14], log replay [15],
> backups [16], and backup recovery [17].  We currently lack good tooling for
> offline debugging of log data, or manipulating log contents.  These are all
> things we would get for free with a database server.  We also can formulate
> a much more obvious picture of how to do storage migrations across release
> versions that is resilient to code refactors.
> I'm interested in what folks think of the thought process here and the
> approach proposed.  Coming back to the thesis -- i feel like a lot of our
> complexity is implementation of a database, and i would love to position
> ourselves to spend more of our effort focusing on building an awesome
> scheduler.
> -=Bill
> [1]
> [2]
> [3]
> [4]
> [5]
> [6]
> [7]
> [8]
> [9]
> [10]
> [11]
> [12]
> [13]
> [14]
> [15]
> [16]
> [17]

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message