jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Guggisberg <stefan.guggisb...@gmail.com>
Subject Re: [MongoMK] BlobStore garbage collection
Date Wed, 07 Nov 2012 10:24:17 GMT
On Wed, Nov 7, 2012 at 10:52 AM, Michael Dürig <mduerig@apache.org> wrote:
> On 7.11.12 9:48, Thomas Mueller wrote:
>> Hi,
>> Didn't we talk once about defining a format for blob id references, so
>> that a value of the format "bin:{blobId}" (or similar) is reference?
> This is exactly the problem I wanted to pinpoint. There is a conceptual leak
> here: in order for the Microkernel implementation to know that something is
> a reference to a binary, it has to know about the interpretation of the
> items in the repository by the upper layers.

the format of references to binaries is documented in the MicroKernel java doc,
see "Retention Policy for Binaries" [0].


[0] http://svn.apache.org/repos/asf/jackrabbit/oak/trunk/oak-mk-api/src/main/java/org/apache/jackrabbit/mk/api/MicroKernel.java

> Michael
>> Regards,
>> Thomas
>> On 11/7/12 10:17 AM, "Michael Dürig" <mduerig@apache.org> wrote:
>>> On a related note: how does the garbage collector even find out whether
>>> a binary is "referenced"? That is, on the Microkernel level, what does
>>> it actually mean for a binary to be referenced?
>>> Michael
>>> On 6.11.12 18:45, Michael Marth wrote:
>>>> this might be a weird question from the leftfield, but are we actually
>>>> sure that the existing data store concept is worth the trouble? afaiu it
>>>> saves us from storing the same binary twice, but leads into the DSGC
>>>> topic. would it be possible to make it optional to store/address
>>>> binaries by hash (and thus not need DSGC for these configurations)?
>>>> In any case we should definitely avoid to require repo traversal for
>>>> DSGC. This would operationally limit the repo sizes Oak can support.
>>>> --
>>>> Michael Marth | Engineering Manager
>>>> +41 61 226 55 22 | mmarth@adobe.com<mailto:mmarth@adobe.com>
>>>> Barfüsserplatz 6, CH-4001 Basel, Switzerland
>>>> On Nov 6, 2012, at 9:24 AM, Thomas Mueller wrote:
>>>> Hi,
>>>> 1- What's considered an "old" node or commit? Technically, anything
>>>> other
>>>> than the head revision is old but can we remove them right away or do we
>>>> need to retain a number of revisions? If the latter, then how far back
>>>> do
>>>> we need to retain?
>>>> we discussed this a while back, no good solution back then[1]
>>>> Yes. Somebody has to decide which revisions are no longer needed.
>>>> Luckily
>>>> it doesn't need to be us :-) We might set a default value (10 minutes or
>>>> so), and then give the user the ability to change that, depending on
>>>> whether he cares more about disk space or the ability to read old data /
>>>> roll back to an old state.
>>>> To free up disk space, BlobStore garbage collection is actually more
>>>> important, because usually 90% of the disk space is used by the
>>>> BlobStore.
>>>> So it would be nice if items (files) in the BlobStore are deleted as
>>>> soon
>>>> as possible after deleting old revisions. In Jackrabbit 2.x we have seen
>>>> that node and data store garbage collection that has to traverse the
>>>> whole
>>>> repository is problematic if the repository is large. So garbage
>>>> collection can be a scalability issue: if we have to traverse all
>>>> revisions of all nodes in order to delete unused data, we basically tie
>>>> garbage collection speed with repository size, unless if we find a way
>>>> to
>>>> run it in parallel. But running mark & sweep garbage collection
>>>> completely
>>>> in parallel is not easy (is it even possible? if yes I would have
>>>> guessed
>>>> modern JVMs should have it since a long time). So I think if we don't
>>>> need
>>>> to traverse the repository to delete old nodes, but just traverse the
>>>> journal, this would be much less of a problem.
>>>> Regards,
>>>> Thomas

View raw message