jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Mueller <muel...@adobe.com>
Subject Re: [MongoMK] BlobStore garbage collection
Date Wed, 07 Nov 2012 09:48:17 GMT

Didn't we talk once about defining a format for blob id references, so
that a value of the format "bin:{blobId}" (or similar) is reference?


On 11/7/12 10:17 AM, "Michael D├╝rig" <mduerig@apache.org> wrote:

>On a related note: how does the garbage collector even find out whether
>a binary is "referenced"? That is, on the Microkernel level, what does
>it actually mean for a binary to be referenced?
>On 6.11.12 18:45, Michael Marth wrote:
>> this might be a weird question from the leftfield, but are we actually
>>sure that the existing data store concept is worth the trouble? afaiu it
>>saves us from storing the same binary twice, but leads into the DSGC
>>topic. would it be possible to make it optional to store/address
>>binaries by hash (and thus not need DSGC for these configurations)?
>> In any case we should definitely avoid to require repo traversal for
>>DSGC. This would operationally limit the repo sizes Oak can support.
>> --
>> Michael Marth | Engineering Manager
>> +41 61 226 55 22 | mmarth@adobe.com<mailto:mmarth@adobe.com>
>> Barf├╝sserplatz 6, CH-4001 Basel, Switzerland
>> On Nov 6, 2012, at 9:24 AM, Thomas Mueller wrote:
>> Hi,
>> 1- What's considered an "old" node or commit? Technically, anything
>> than the head revision is old but can we remove them right away or do we
>> need to retain a number of revisions? If the latter, then how far back
>> we need to retain?
>> we discussed this a while back, no good solution back then[1]
>> Yes. Somebody has to decide which revisions are no longer needed.
>> it doesn't need to be us :-) We might set a default value (10 minutes or
>> so), and then give the user the ability to change that, depending on
>> whether he cares more about disk space or the ability to read old data /
>> roll back to an old state.
>> To free up disk space, BlobStore garbage collection is actually more
>> important, because usually 90% of the disk space is used by the
>> So it would be nice if items (files) in the BlobStore are deleted as
>> as possible after deleting old revisions. In Jackrabbit 2.x we have seen
>> that node and data store garbage collection that has to traverse the
>> repository is problematic if the repository is large. So garbage
>> collection can be a scalability issue: if we have to traverse all
>> revisions of all nodes in order to delete unused data, we basically tie
>> garbage collection speed with repository size, unless if we find a way
>> run it in parallel. But running mark & sweep garbage collection
>> in parallel is not easy (is it even possible? if yes I would have
>> modern JVMs should have it since a long time). So I think if we don't
>> to traverse the repository to delete old nodes, but just traverse the
>> journal, this would be much less of a problem.
>> Regards,
>> Thomas

View raw message