mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Santhosh Kumar Shanmugham <>
Subject Re: [Proposal] Replicated log storage compaction
Date Tue, 03 Jul 2018 03:57:00 GMT
+1. Aurora will hugely benefit from this change.

On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <> wrote:

> Hi everyone,
> I'd like to propose adding "manual" LevelDB compaction to the
> replicated log truncation process.
> Motivation
> Mesos Master and Aurora Scheduler use the replicated log to persist
> information about the cluster. This log is periodically truncated to
> prune outdated log entries. However the replicated log storage is not
> compacted and grows without bounds. This leads to problems like
> synchronous failover of all master/scheduler replicas happening
> because all of them ran out of disk space.
> The only time when log storage compaction happens is during recovery.
> Because of that periodic failovers are required to control the
> replicated log storage growth. But this solution is suboptimal.
> Failovers are not instant: e.g. Aurora Scheduler needs to recover the
> storage which depending on the cluster can take several minutes.
> During the downtime tasks cannot be (re-)scheduled and users cannot
> interact with the service.
> Proposal
> In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
> work well with LevelDB background compaction algorithm. Fortunately,
> LevelDB provides a way to force compaction with DB::CompactRange()
> method. Replicated log storage can trigger it after persisting learned
> TRUNCATE action and deleting truncated log positions. The compacted
> range will be from previous first position of the log to the new first
> position (the one the log was truncated up to).
> Performance impact
> Mesos Master and Aurora Scheduler have 2 different replicated log
> usage profiles. For Mesos Master every registry update (agent
> (re-)registration/marking, maintenance schedule update, etc.) induces
> writing a complete snapshot which depending on the cluster size can
> get pretty big (in a scale test fake cluster with 55k agents it is
> ~15MB). Every snapshot is followed by a truncation of all previous
> entries, which doesn't block the registrar and happens kind of in the
> background. In the scale test cluster with 55k agents compactions
> after such truncations take ~680ms.
> To reduce the performance impact for the Master compaction can be
> triggered only after more than some configurable number of keys were
> deleted.
> Aurora Scheduler writes incremental changes of its storage to the
> replicated log. Every hour a storage snapshot is created and persisted
> to the log, followed by a truncation of all entries preceding the
> snapshot. Therefore, storage compactions will be infrequent but will
> deal with potentially large number of keys. In the scale test cluster
> such compactions took ~425ms each.
> Please let me know what you think about it.
> Thanks!
> --
> Ilya Pronin

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message