jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: MongoMK^2 design proposal
Date Tue, 29 Jan 2013 10:48:32 GMT
Hi,

On Tue, Jan 29, 2013 at 12:01 PM, Marcel Reutegger <mreutegg@adobe.com> wrote:
>> both internal storage (as seen below) and garbage collection. Segments
>> that are no longer referenced can be efficiently identified by
>> traversing the graph of segment-level references without having to
>> parse or even fetch the contents of each segment.
>
> But it still requires to traverse the complete graph, right?

Right. A 10GB repository would likely contain some 1k-100k segments,
which should still be traversable in just minutes. If that isn't good
enough in practice, we could organize the segments in generations to
reduce the number of segments that need to be traversed in one go.

>> Journals are special, atomically updated documents that record the
>> state of the repository as a sequence of references to successive
>> root node records.
>
> when you write 'root node record', do you mean the segment, which
> contains the root node?

The journal would contain a reference to the root node record, i.e. a
combination of the segment UUID and the offset within that segment. In
some cases it might be possible for a single segment to contain
multiple root nodes.

> how do you decide whether it's OK to write to a lower level journal and
> when to write to the root journal?

That can be decided either at deployment time (this client will always
write to this journal) or at runtime (this commit should go to this
journal) if the client application wants more fine-grained control
over commit behavior (cf. write concerns in MongoDB)

> Who merges the journal and when?

A lower level journal could contain metadata to indicate when and how
it should be merged. A client that commits to or reads from such a
journal should interpret that metadata and take care of merging when
needed.

> How do you make sure a client sticks to a given lower level journal? Switching
> journals means you might not see your own writes.

As described above, in a typical scenario the journal would be
selected at deployment time, so the client would always use the same
journal. If the application wants more fine-grained control, it needs
to also deal with such issues as partial visibility of changes.

>> Each record is uniquely addressable by its location within the segment
>> and the UUID of that segment. Assuming that the size of a segment is
>> limited to 16MB (maximum size of a MongoDB document) and that a single
>> segment can contain references to up to 255 other segments, then a
>> reference to any record in any segment can be stored in just 4 bytes
>> (1 byte to identify the segment, 3 bytes for the record offset).
>
> I'm wondering if 255 references are sufficient. You need those in
> various places. e.g. when building higher level structure like lists,
> maps etc. but also to reference templates.

I did some benchmarking with typical content (CQ5) and in practice the
majority of references would point to the same segment or a small set
(dozens) of other segments. Once the limit of 255 segment references
is reached, you can always start a new segment and thus reset the
counter.

If this turns out to be a practical limit, we could either limit the
maximum size of segments to say 4MB or pad all records to four-byte
boundaries to increase the number of possible external segment
references to 1023 while still needing just 4 bytes per record
reference.

> It seems the requirements for the underlying storage system are minimal,
> which is kind of nice, but makes me wonder why we need MongoDB then?
> Most of the data in MongoDB would consist of segment documents that
> contain binaries, right?

Right. The main benefit of using an underlying storage engine like
MongoDB is that we get distribution, sharding, and atomic updates
essentially for free. As noted, any other database that provides
similar functionality should also work with the proposed design.

>> Template records
>
> I'm a bit skeptical if the added complexity is worth the disk space saving.

I thought so too, but simple benchmarks show that the size savings are
significant (20-50%) especially for small, frequently accessed nodes
like ACL entries or CQ paragraphs that would ideally be kept as close
as possible to the containing parent node.

However, template records are by no means necessary for the design to
work, so we could well try it without the extra complexity.

>> A node that contains more than N properties or M child nodes (exact size
>> TBD, M ~ 1k) is stored differently, using map records for the properties
>> and child nodes. This way a node can become arbitrarily large and still
>> remain reasonably efficient to access and modify. The main downside of
>> this alternative storage layout is that the ordering of child nodes is
>> lost.
>
> There's no requirement to keep the child nodes in insertion order, right?

That's one of the trade-offs we discussed earlier. It seems safe to
drop the requirement of orderability (or even of insertion-order) for
large nodes. This shouldn't be much of a backwards compatibility issue
given the previous lack of performance with large nodes, and for
safety we could also add a commit validator that prevents an orderable
node from growing beyond that threshold.

> What is not yet clear to me is, how do you decide what the size of a segment
> should be? This can be quite delicate. On the one hand you want to avoid
> small segments because they are inefficient with all the UUID lookups. But
> larger segments are also not ideal, because they increase the cost of a write.
> I understand the complete segments needs to be re-written in this case,
> or do you have a design in mind where a segment can overlay another one?
> however, this leads to fragmentation.

I was thinking of each commit (or merge) resulting in one or more (in
case of a large commit) segments. To prevent segments from remaining
too small, a process like Lucene's index merge could be used in
combination with garbage collection to combine them.

For example, if the garbage collector detects a sequence of more than
a few small segments (say < 256kB) referenced by only one journal,
then it could replace them all with one or more larger segments
containing copies of all the records accessible from those smaller
segments. Then it can update the journal to point to the combined
segments and let the next garbage collection round (once clients no
longer refer to the smaller segments) take care of disposing the small
segments that are no longer needed.

BR,

Jukka Zitting

Mime
View raw message