jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Mueller" <thomas.tom.muel...@gmail.com>
Subject Re: Fw: Realtime datastore garbage collector
Date Wed, 28 Nov 2007 16:37:27 GMT

> > There is a problem with this approach: an identifier can be added to
> > multiple properties. Also, it may be used at other places. So you
> > would need to keep a reference count as well. Also, you would need to
> > be sure the reference counts are updated correctly ('transactional').
> Can you provide a test for this scenario?
> but I'm not sure if it's correct.

Yes, that's the main problem for me as well - I couldn't tell if the
implementation is correct because I don't fully understand those
things, and because there are not so many tests for the current
Jackrabbit code. For example, on a rollback you would need to
re-insert the identifier. I will write more tests, but still that
wouldn't make sure all scenarios are tested sufficiently.

> > A simpler mechanism would be to store back-references: each data
> > record / identifier would know who references it. The garbage
> > collection could then follow the back-references and check if they are
> > still valid (and if not remove them). Items without valid back
> > references could be deleted. This allows to delete very large objects
> > quickly (if they are not used of course).
> An you elaborate on this? Maybe I can test the idea then.

Sure. For each DataRecord there would be a list of references. This
list is persisted as well (best in the same transaction as the data is
persisted, or before that, unfortunately there is currently no way to
guarantee that). That means each data record knows who references it:

Reference list:

When somebody else stores the same item the list would be extended:
Reference list:

And so on. Items are only appended to the reference list, but never
removed (even if the reference is deleted), to simplify things. That
means the list could be stale.

Now when running the garbage collection, the reference lists of the
largest objects are scanned and updated. If no more references are
found, the item is deleted.

> > But at this time, I would argue it is safer to keep the data store
> > mechanism as is, without trying add more features (adding more data
> > store implementations is not a problem of course), unless we really
> > fix a bug. I think it makes more sense to spend the time improving the
> > architecture of Jackrabbit before trying to add more complex
> > algorithms to the data store (which are not required afterwards).
> This is not another feature, it's the most useful version of the GC. I think it's critical
for large
> repositories to have a GC that periodically reclaims unused space.

Yes, GC must be implemented. What I wanted to say was: I think it is
better at this stage if the implementation is defensive.

> Regarding the scenario I presented, what I would like to know is if we consider it an
> risk or not. I'm still not sure about this issue.

I think it is too risky to remove the transient identifier when the
item is stored in the persistence manager.


View raw message