jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Mueller <muel...@adobe.com>
Subject Re: [MongoMK] BlobStore garbage collection
Date Wed, 07 Nov 2012 08:51:09 GMT
Hi,

>are we actually sure that the existing data store concept is worth the
>trouble?

No, we can't really be sure. There are multiple concepts, one is using a
(cryptographic) hash, an other is to use garbage collection.

As for "garbage collection" versus "delete blob when deleting the
referencing node", the main advantages I see are: simpler creation of
blobs (no state; no delete on rollback), and simpler, fast node copy (no
need to duplicate blobs), and simpler node delete (no need to delete
blobs). The big disadvantage is much slower delete (garbage collection). I
like a lot that the implementation is simpler, but I would like to make
the garbage collection faster. One way to do that is to use an index on
references to binaries. We already have such an index mechanism for
referenceable nodes (index on property types), I believe this should speed
up garbage collection a lot so that is should no be a problem.

As for the second concept "hash" (versus using a counter or so): the main
advantages are: saving space (data de-duplication), ability to combine /
share blob stores from multiple repositories. The theoretical disadvantage
is slower store performance due to calculating the hash code. I'm not
aware that calculating the hash code ever was a problem for our use case,
so I wouldn't change much here.

For Jackrabbit 2.x, one disadvantage was that blobs were arbitrary large.
This was a problem because it was required to store temporary files, which
is a problem if the file system is sharded. Because of that, Jackrabbit
2.x doesn't currently support sharded file data stores, which is a pity.
If we implement it, it would be half the I/O speed than necessary (files
would need to be written twice). It was also a problem for the database
data store because some databases have a problem with arbitrary large
blobs (for example MySQL). Also, database blob handles had to be kept open
which was very problematic. For Oak, I implemented a mechanism to split
large blobs into smaller blocks, so all those problems are solved. By the
way DropBox uses a similar mechanism (it also splits binaries into blocks
of 2 MB).

>In any case we should definitely avoid to require repo traversal for DSGC.

Yes I fully agree. Using an index on blob references should solve this
(see above). Any other ideas?

Regards,
Thomas


Mime
View raw message