jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Mueller <muel...@adobe.com>
Subject Re: [MongoMK] BlobStore garbage collection
Date Mon, 05 Nov 2012 13:04:36 GMT
Hi,

If possible, I would try to avoid having to traverse over the whole
repository. I know it is currently required for the data store garbage
collection, but with an index on references to binary content it wouldn't
be necessary there either (it would only have to traverse over all
references to binary content, which should be a much faster).

To delete all node data, an idea (actually it is from Marcel, I hope I
didn't misunderstand it) is to use the journal. Basically, before deleting
the old journal entries, first read them and see what nodes were deleted
or overwritten at that time. The old versions of those node can be
deleted. That way you wouldn't need to traverse.

> What's considered an "old" node or commit?

It would be nice to have a configuration setting where this is defined,
with a reasonable default. Initially, we could pick an arbitrary value,
for example 5 minutes, and then let's see if we run into problems with
that. For some use cases it might make sense to use a much higher value,
for example 1 day - for example for a 'migration' or 'large import' use
case, so that you could rollback to an old version if there is a problem
(assuming there is a way to rollback).

Regards,
Thomas






On 11/5/12 12:08 PM, "Mete Atamel" <matamel@adobe.com> wrote:

>On a related note, I think we also need a NodeStore (nodes & commits)
>garbage collection in MongoMK. Otherwise, MongoDB will be full of old node
>and commit data with no real benefit. The basic implementation idea is to
>have a background task to periodically go through old nodes and commits
>and delete them but this raises questions such as:
>
>1- What's considered an "old" node or commit? Technically, anything other
>than the head revision is old but can we remove them right away or do we
>need to retain a number of revisions? If the latter, then how far back do
>we need to retain?
>
>2- How often should the NodeStore GC run and for how long? How should this
>be controlled?
>
>3- Do other MicroKernel implementations handle this, if so how?
>
>If you have any feedback on any of this, I'd like to hear.
>
>-Mete
>
>On 11/2/12 4:38 PM, "Mete Atamel" <matamel@adobe.com> wrote:
>
>>Thanks. Yes, I also think it's worthwhile to try implementing MongoDB
>>BlobStore based on AbstractBlobStore. Do we have tests somewhere where we
>>can compare different BlobStore implementations?
>>
>>-Mete
>>
>>On 11/2/12 3:50 PM, "Thomas Mueller" <mueller@adobe.com> wrote:
>>
>>>Hi,
>>>
>>>I would definitely at least *try* to implement a MongoDB BlobStore based
>>>on the AbstractBlobStore. It should be quite simple (one class). Then,
>>>it
>>>would be interesting to know which implementation is faster: the GridFS
>>>one or an implementation based on AbstractBlobStore :-) Specially if the
>>>difference is big. If GridFS is faster, maybe we could learn something
>>>from them.
>>>
>>>It looks like GridFS uses md5 hashes, that sounds a bit risky to me,
>>>specially if anonymous users can create binaries. An attacker could
>>>upload
>>>two files with the same md5 hash, which would at least "confuse" Oak and
>>>maybe GridFS, or maybe worse. I mean, using md5 for your own files is
>>>fine, but it seems problematic for Oak, because it would somewhat limit
>>>the use cases.
>>>
>>>Regards,
>>>Thomas
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>On 11/2/12 10:30 AM, "Mete Atamel" <matamel@adobe.com> wrote:
>>>
>>>>Hi,
>>>>
>>>>One of the things I need to implement for MongoMK is BlobStore garbage
>>>>collection. I see that there's an initial implementation for garbage
>>>>collection in AbstractBlobStore in oak-mk and I also see this bug [0]
>>>>to
>>>>improve that initial implementation.
>>>>
>>>>MongoMK uses a GridFS based BlobStore, separate from AbstractBlobStore
>>>>in
>>>>oak-mk. I could potentially come up with my own GC, based on that
>>>>GridFS
>>>>implementation, or I could try a new AbstractBlobStore implementation
>>>>for
>>>>MongoMK (not GridFS based). With the second approach, I potentially get
>>>>current and future garbage collection improvements for free.
>>>>
>>>>Not sure which path to follow yet but I wanted to see what others
>>>>thought
>>>>before starting to work on it.
>>>>
>>>>Thanks,
>>>>Mete
>>>>
>>>>[0] https://issues.apache.org/jira/browse/OAK-377
>>>>
>>>
>>
>


Mime
View raw message