jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael D├╝rig <mdue...@apache.org>
Subject Re: [MongoMK] BlobStore garbage collection
Date Wed, 07 Nov 2012 09:17:54 GMT

On a related note: how does the garbage collector even find out whether 
a binary is "referenced"? That is, on the Microkernel level, what does 
it actually mean for a binary to be referenced?


On 6.11.12 18:45, Michael Marth wrote:
> this might be a weird question from the leftfield, but are we actually sure that the
existing data store concept is worth the trouble? afaiu it saves us from storing the same
binary twice, but leads into the DSGC topic. would it be possible to make it optional to store/address
binaries by hash (and thus not need DSGC for these configurations)?
> In any case we should definitely avoid to require repo traversal for DSGC. This would
operationally limit the repo sizes Oak can support.
> --
> Michael Marth | Engineering Manager
> +41 61 226 55 22 | mmarth@adobe.com<mailto:mmarth@adobe.com>
> Barf├╝sserplatz 6, CH-4001 Basel, Switzerland
> On Nov 6, 2012, at 9:24 AM, Thomas Mueller wrote:
> Hi,
> 1- What's considered an "old" node or commit? Technically, anything other
> than the head revision is old but can we remove them right away or do we
> need to retain a number of revisions? If the latter, then how far back do
> we need to retain?
> we discussed this a while back, no good solution back then[1]
> Yes. Somebody has to decide which revisions are no longer needed. Luckily
> it doesn't need to be us :-) We might set a default value (10 minutes or
> so), and then give the user the ability to change that, depending on
> whether he cares more about disk space or the ability to read old data /
> roll back to an old state.
> To free up disk space, BlobStore garbage collection is actually more
> important, because usually 90% of the disk space is used by the BlobStore.
> So it would be nice if items (files) in the BlobStore are deleted as soon
> as possible after deleting old revisions. In Jackrabbit 2.x we have seen
> that node and data store garbage collection that has to traverse the whole
> repository is problematic if the repository is large. So garbage
> collection can be a scalability issue: if we have to traverse all
> revisions of all nodes in order to delete unused data, we basically tie
> garbage collection speed with repository size, unless if we find a way to
> run it in parallel. But running mark & sweep garbage collection completely
> in parallel is not easy (is it even possible? if yes I would have guessed
> modern JVMs should have it since a long time). So I think if we don't need
> to traverse the repository to delete old nodes, but just traverse the
> journal, this would be much less of a problem.
> Regards,
> Thomas

View raw message