jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: MongoMK^2 design proposal
Date Tue, 29 Jan 2013 12:33:59 GMT

On Tue, Jan 29, 2013 at 2:07 PM, Thomas Mueller <mueller@adobe.com> wrote:
> To better understand the design it would help me if you could list the
> MongoDb collections, and the key / properties / values. I guess the
> segment ids are MongoDB object ids? Or is it (part of) the path?

Here's a quick draft, most likely incomplete.

A segment document would look something like this:

  { "_id": "<Segment UUID>",
    "type": "segment",
    "data": <binary data of the segment>,
    "refs": [ <UUIDs of referenced segments> ],
    "created": <create timestamp, might be useful for GC> }

(The "refs" property might need to be stored in (or copied to) a
separate document to allow the garbage collector to efficiently build
the segment graph.)

And a journal document might look like this:

  { "_id": "<Journal UUID>",
    "type": "journal",
    "parent": "<Parent Journal UUID, not present in root journal>"
    "base": <base revision of branch or other low-level journal, not
present in root>,
    "head": "<Segment UUID>:<node record offset>",
    "tail": [ <list of older revisions we want to keep in this journal> ],
    "created": <create timestamp>,
    "updated": <update timestamp>,
    "merge": <merge strategy metadata> }

>> Segments are immutable, so a commit would create a new segment
> So there are no MongoDB updates, as in the current design?


> A potential problem (depending on the segment id) is that all writes go
> to the same MongoDb shard, or to a random one.

Segment IDs would be random UUIDs to distribute content uniformly
across a sharded backend.

> I would actually prefer if the primary key represents (part of) the path,
> so that MongoDb sharding works well (locality of access).

Segments themselves already provide locality of access (related nodes
would usually end up in the same segment), so I don't believe there's
too much need to worry about locality of different segments.

However, the segment identifier could well be anything as long as it's
unique, so we could adjust it if needed.

>>A quick estimate of the size overhead of a minimal
>>commit that updates just a single property is in the order of hundreds
>>of bytes, depending a bit on the content structure.
> I thought the segment size is a few KB. So would you buffer writes to get
> the "hundreds of bytes"? In my view, commits should be written immediately
> so that other cluster nodes can read them. Or does "hundreds of bytes"
> take into account that you can split segments?

I was referring to the extra size overhead that comes from having to
include copies of all parent up to the root when making a commit that
modifies some content. A commit can only be completed once the segment
containing the new revision of the root node has been added to the

> About atomic updates: I thought segments are immutable?

We need atomic updates of the journal documents. They are mutable.


Jukka Zitting

View raw message