jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Mueller" <thomas.tom.muel...@gmail.com>
Subject Re: NGP: Value records
Date Thu, 14 Jun 2007 15:43:53 GMT

> mark-and-sweep garbage
> I agree that it's slow and late, but I don't think either is a big
> problem.

I think it is. But it is not very important to decide what garbage
collection algorithm to use at this stage. It is still possible to
switch the algorithm later on. OK it is a bit more work.

> The garbage collection process can be run in the background
> (it doesn't block normal access) so performance isn't essential

It can't run while others are writing to the repository, and I think
that's a problem. Example: Lets say the garbage collection algorithm
scans the repository from left to right, and 'A' is a link to 'File
A'. Now at the beginning of the scan, the repository looks like this:
after some time:
now somebody moves the node that contains A:
the scan finishes and didn't find a reference to A:

> given the amount of space that the approach saves in typical setups
> I'm not too worried about reclaiming unused space later than
> necessary.

That depends on the setup. If you use the repository to manage movies
(no versioning), then I would be worried.

> The main problem I have with reference counting in this case is that
> it would bind the data store into transaction handling

Yes, a little bit.

> and all related issues.

Could you clarify? I would increment the counts early (before
committing) and decrement the counts late (after the commit), then the
worst case is, after a crash, to have a counter that is too high

Actually, what about back references: each large object knows who (it
thinks) is pointing to it. Mark and sweep would then be trivial. The
additional space used would be minimal (compared to a large object).

> It would also introduce locking inside the data store to avoid
> problems with concurrent reference changes.

Manipulating references to large objects is not that common I think:
moving nodes (maybe) and versioning. I would use simple 'synchronized'

> > why not store large Strings in the global data store
> I was thinking of perhaps adding isString() and getString() methods

> DataRecord for checking whether a given binary stream is valid UTF-8
> and for retrieving the encoded string value in case it is.

I probably lost you here. The application decides if it wants to use
PropertyType.STRING or PropertyType.BINARY. No need to guess the type
from the byte array. I was thinking about storing large instances of
PropertyType.STRING (java.lang.String) as a file.

> Together with the above inline mechanism we should in fact be able to
> make no distinction between binary and string values in the
> persistence layer.

Yes. You could add a property 'isLarge' to InternalValue, or you could
extend InternalValue. Actually I think InternalValue is quite memory
intensive, it uses two objects for each INTEGER. I suggest to use an
interface, and InternalValueInt, InternalValueString,
InternalValueLong and so on. And/or use a cache for the most commonly
used objects (integer 0-1000, empty String, boolean true/false). But
that's another discussion. Sorry.


View raw message