jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Seidel. Robert" <Robert.Sei...@aeb.de>
Subject AW: AW: AW: AW: AW: Incremental/deduplicating versioning
Date Thu, 14 Jul 2011 06:27:22 GMT
>> The DataStore is called from Jackrabbit (addRecord) with some stream and
>has absolutely no idea what the original file was.

>It doesn't need to know.

To store the diff of two versions of a file it does, that's all I say.

Example: You have document a & b in three versions each: a1, a2 and a3 and also b1, b2,
b3

If the user calls addRecord with the content of a3, it does need to know which file is a2
to store the "diff between a2 and a3". It doesn't need to know, if you cross link the documents
because that would save more space - maybe the diff of b3 and a3 is a lot smaller than a2
and a3, but that is not storing "the diff of two file versions".

The general way to reduce duplicates also leads to problems, when cleaning up the DataStore
or if some file is corrupt, than a lot of data is corrupt.

> It's always possible to create a concrete diff between two files. The question is: how
large is the diff.

Creating a diff of binaries makes no sense at all. The diff would be larger than simply storing
the binary as it is. 
Analyzing this will take some time, which leads to the next point.

>> Sure it can look for similar files (performance?)

>There are various algorithms on how to search efficiently, depending on
>the requirement (performance, memory, compression ratio).

The fastest way is always store and retrieve the content as it is, no algorithm can be faster.
So deduplication will always affect the performance, even the determination of the hash sum
currently does.

>>but that is maybe not the diff to the previous file from application view.

>Why would you want the diff from the application view?

Why would I want to have the DataStore creating dependencies between files that are completely
independent? To retrieve a corrupt A3, if B3 is corrupt?

Regards, Robert

-----Urspr√ľngliche Nachricht-----
Von: Thomas Mueller [mailto:mueller@adobe.com] 
Gesendet: Mittwoch, 13. Juli 2011 14:25
An: users@jackrabbit.apache.org
Betreff: Re: AW: AW: AW: AW: Incremental/deduplicating versioning

Hi,

>This is not possible.

Well, it is.

> The DataStore is called from Jackrabbit (addRecord) with some stream and
>has absolutely no idea what the original file was.

It doesn't need to know.

>So it can't determine the concrete diff.

It's always possible to create a concrete diff between two files. The
question is: how large is the diff.

> Sure it can look for similar files (performance?)

There are various algorithms on how to search efficiently, depending on
the requirement (performance, memory, compression ratio).


>but that is maybe not the diff to the previous file from application view.

Why would you want the diff from the application view?

Regards,
Thomas


Mime
View raw message