jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Dürig <mdue...@apache.org>
Subject Re: [MongoMK] BlobStore garbage collection
Date Wed, 07 Nov 2012 09:52:04 GMT


On 7.11.12 9:48, Thomas Mueller wrote:
> Hi,
>
> Didn't we talk once about defining a format for blob id references, so
> that a value of the format "bin:{blobId}" (or similar) is reference?

This is exactly the problem I wanted to pinpoint. There is a conceptual 
leak here: in order for the Microkernel implementation to know that 
something is a reference to a binary, it has to know about the 
interpretation of the items in the repository by the upper layers.

Michael

>
> Regards,
> Thomas
>
>
>
> On 11/7/12 10:17 AM, "Michael Dürig" <mduerig@apache.org> wrote:
>
>>
>> On a related note: how does the garbage collector even find out whether
>> a binary is "referenced"? That is, on the Microkernel level, what does
>> it actually mean for a binary to be referenced?
>>
>> Michael
>>
>> On 6.11.12 18:45, Michael Marth wrote:
>>> this might be a weird question from the leftfield, but are we actually
>>> sure that the existing data store concept is worth the trouble? afaiu it
>>> saves us from storing the same binary twice, but leads into the DSGC
>>> topic. would it be possible to make it optional to store/address
>>> binaries by hash (and thus not need DSGC for these configurations)?
>>>
>>> In any case we should definitely avoid to require repo traversal for
>>> DSGC. This would operationally limit the repo sizes Oak can support.
>>>
>>>
>>> --
>>> Michael Marth | Engineering Manager
>>> +41 61 226 55 22 | mmarth@adobe.com<mailto:mmarth@adobe.com>
>>> Barfüsserplatz 6, CH-4001 Basel, Switzerland
>>>
>>> On Nov 6, 2012, at 9:24 AM, Thomas Mueller wrote:
>>>
>>> Hi,
>>>
>>> 1- What's considered an "old" node or commit? Technically, anything
>>> other
>>> than the head revision is old but can we remove them right away or do we
>>> need to retain a number of revisions? If the latter, then how far back
>>> do
>>> we need to retain?
>>>
>>> we discussed this a while back, no good solution back then[1]
>>>
>>> Yes. Somebody has to decide which revisions are no longer needed.
>>> Luckily
>>> it doesn't need to be us :-) We might set a default value (10 minutes or
>>> so), and then give the user the ability to change that, depending on
>>> whether he cares more about disk space or the ability to read old data /
>>> roll back to an old state.
>>>
>>> To free up disk space, BlobStore garbage collection is actually more
>>> important, because usually 90% of the disk space is used by the
>>> BlobStore.
>>> So it would be nice if items (files) in the BlobStore are deleted as
>>> soon
>>> as possible after deleting old revisions. In Jackrabbit 2.x we have seen
>>> that node and data store garbage collection that has to traverse the
>>> whole
>>> repository is problematic if the repository is large. So garbage
>>> collection can be a scalability issue: if we have to traverse all
>>> revisions of all nodes in order to delete unused data, we basically tie
>>> garbage collection speed with repository size, unless if we find a way
>>> to
>>> run it in parallel. But running mark & sweep garbage collection
>>> completely
>>> in parallel is not easy (is it even possible? if yes I would have
>>> guessed
>>> modern JVMs should have it since a long time). So I think if we don't
>>> need
>>> to traverse the repository to delete old nodes, but just traverse the
>>> journal, this would be much less of a problem.
>>>
>>> Regards,
>>> Thomas
>>>
>>>
>>>
>

Mime
View raw message