From users-return-17759-apmail-jackrabbit-users-archive=jackrabbit.apache.org@jackrabbit.apache.org Thu Jul 14 06:28:07 2011 Return-Path: X-Original-To: apmail-jackrabbit-users-archive@minotaur.apache.org Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 43485736F for ; Thu, 14 Jul 2011 06:28:07 +0000 (UTC) Received: (qmail 98866 invoked by uid 500); 14 Jul 2011 06:28:06 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 98555 invoked by uid 500); 14 Jul 2011 06:28:00 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 98540 invoked by uid 99); 14 Jul 2011 06:27:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jul 2011 06:27:55 +0000 X-ASF-Spam-Status: No, hits=1.1 required=5.0 tests=FRT_ADOBE2,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [194.172.26.33] (HELO MX1.aeb.de) (194.172.26.33) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jul 2011 06:27:47 +0000 X-IronPort-AV: E=Sophos;i="4.65,527,1304287200"; d="scan'208";a="7577460" Received: from unknown (HELO s-hqmx11.pmbelz.de) ([10.237.5.11]) by MX1I.pmbelz.de with ESMTP; 14 Jul 2011 08:27:23 +0200 Received: from S-HQMX8.pmbelz.de ([fe80::3005:e363:6e82:6273]) by s-hqmx11.pmbelz.de ([fe80::e0cc:54bc:3bc4:19e7%11]) with mapi id 14.01.0289.001; Thu, 14 Jul 2011 08:27:23 +0200 From: "Seidel. Robert" To: "users@jackrabbit.apache.org" Subject: AW: AW: AW: AW: AW: Incremental/deduplicating versioning Thread-Topic: AW: AW: AW: AW: Incremental/deduplicating versioning Thread-Index: AcxBNx6hSRUCWIoYTOSjEmhDa1Xc/wAAXR6g///jC4D//93R0IAAJbyA///dlgCAACoaAP//ykWAAAyR3QD//rWoAA== Date: Thu, 14 Jul 2011 06:27:22 +0000 Message-ID: <7628B7424DEF784CA2ECB07668F69CF44177D10C@S-HQMX8.pmbelz.de> References: <7628B7424DEF784CA2ECB07668F69CF44177CF6B@S-HQMX8.pmbelz.de> In-Reply-To: Accept-Language: de-DE, en-US Content-Language: de-DE X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.237.10.12] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 >> The DataStore is called from Jackrabbit (addRecord) with some stream and >has absolutely no idea what the original file was. >It doesn't need to know. To store the diff of two versions of a file it does, that's all I say. Example: You have document a & b in three versions each: a1, a2 and a3 and = also b1, b2, b3 If the user calls addRecord with the content of a3, it does need to know wh= ich file is a2 to store the "diff between a2 and a3". It doesn't need to kn= ow, if you cross link the documents because that would save more space - ma= ybe the diff of b3 and a3 is a lot smaller than a2 and a3, but that is not = storing "the diff of two file versions". The general way to reduce duplicates also leads to problems, when cleaning = up the DataStore or if some file is corrupt, than a lot of data is corrupt. > It's always possible to create a concrete diff between two files. The que= stion is: how large is the diff. Creating a diff of binaries makes no sense at all. The diff would be larger= than simply storing the binary as it is.=20 Analyzing this will take some time, which leads to the next point. >> Sure it can look for similar files (performance?) >There are various algorithms on how to search efficiently, depending on >the requirement (performance, memory, compression ratio). The fastest way is always store and retrieve the content as it is, no algor= ithm can be faster. So deduplication will always affect the performance, ev= en the determination of the hash sum currently does. >>but that is maybe not the diff to the previous file from application view= . >Why would you want the diff from the application view? Why would I want to have the DataStore creating dependencies between files = that are completely independent? To retrieve a corrupt A3, if B3 is corrupt= ? Regards, Robert -----Urspr=FCngliche Nachricht----- Von: Thomas Mueller [mailto:mueller@adobe.com]=20 Gesendet: Mittwoch, 13. Juli 2011 14:25 An: users@jackrabbit.apache.org Betreff: Re: AW: AW: AW: AW: Incremental/deduplicating versioning Hi, >This is not possible. Well, it is. > The DataStore is called from Jackrabbit (addRecord) with some stream and >has absolutely no idea what the original file was. It doesn't need to know. >So it can't determine the concrete diff. It's always possible to create a concrete diff between two files. The question is: how large is the diff. > Sure it can look for similar files (performance?) There are various algorithms on how to search efficiently, depending on the requirement (performance, memory, compression ratio). >but that is maybe not the diff to the previous file from application view. Why would you want the diff from the application view? Regards, Thomas