Return-Path: X-Original-To: apmail-jackrabbit-oak-dev-archive@minotaur.apache.org Delivered-To: apmail-jackrabbit-oak-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E0BA4100CF for ; Thu, 20 Mar 2014 09:50:54 +0000 (UTC) Received: (qmail 31796 invoked by uid 500); 20 Mar 2014 09:50:54 -0000 Delivered-To: apmail-jackrabbit-oak-dev-archive@jackrabbit.apache.org Received: (qmail 31631 invoked by uid 500); 20 Mar 2014 09:50:49 -0000 Mailing-List: contact oak-dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: oak-dev@jackrabbit.apache.org Delivered-To: mailing list oak-dev@jackrabbit.apache.org Received: (qmail 31620 invoked by uid 99); 20 Mar 2014 09:50:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Mar 2014 09:50:48 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of alex.parvulescu@gmail.com designates 209.85.128.171 as permitted sender) Received: from [209.85.128.171] (HELO mail-ve0-f171.google.com) (209.85.128.171) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Mar 2014 09:50:43 +0000 Received: by mail-ve0-f171.google.com with SMTP id cz12so634236veb.30 for ; Thu, 20 Mar 2014 02:50:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=1ZJNbqV+zmlQ6qZ8XozKoll5AWEJQf3RBFcXEQgdmX0=; b=m4dSQyexNmwx9aRfAQjQVN4HUhGViP97DoQrJXr/XkHIOs5ctbmrF9QLrtiNuX623d jFBnKIQyUwzW/bo5ixRLlyVh7KM91CYEXPluPvy1b2CfKPOP5RY1cG5Xu/MPIDjmRBgf D9A7D6ziXgCKn3W1q7H3g+4YJ7QiRkkGwocSko2yW5yOggm25frOW1RgRgnmLhae8yLF PeFTh32sCsVqDUUVh6OSqQXF/9Ow7ZgFUHUhUd8ajZWMtd1RI4rrwT6yRABsDRUho8li CeXVTwLQiATecjozt9z8JAtP7BBIvaxkY3omEdRP5INk4hTQ7hh91Zya9volEGpHqkJF PYLw== MIME-Version: 1.0 X-Received: by 10.58.122.164 with SMTP id lt4mr33327045veb.2.1395309022473; Thu, 20 Mar 2014 02:50:22 -0700 (PDT) Received: by 10.221.27.2 with HTTP; Thu, 20 Mar 2014 02:50:22 -0700 (PDT) Date: Thu, 20 Mar 2014 10:50:22 +0100 Message-ID: Subject: Inefficient backup on TarMK From: Alex Parvulescu To: oak-dev@jackrabbit.apache.org Content-Type: multipart/alternative; boundary=047d7b2ed255a2f8b404f506b213 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b2ed255a2f8b404f506b213 Content-Type: text/plain; charset=ISO-8859-1 Hi, I'd like to ask advice about a problem I've noticed recently concerning the tarmk backup. At its core, the tarmk backup relies on a regular content diff. First backup doesn't find anyhting, copies all nodes over, second backup and on, diffs the content to incrementally apply the changes. One optimization of the tarmk diff is to check if the segment ids of 2 node states are the same, this makes for a really fast compareTo method. These 2 combined make for a fast and incremental backup, so far so good. Th problem I experienced comes in when there is enough content writes that a segment flush is triggered, so basically the same node, even unchanged ends up in a different segment, so with a different segment id. Now the backup fails to fast-match the node states and falls back to traversing of the content, to match-and-apply changes, except there are none. With time more and more segments are created, and as far as I can see nodes that have no changes migrate to different segments. All these migrations are seen as changes and generate content traversals. The reason this escalates is that the incremental backup will never update the segment ids on the target instance, it will only look at content, so an incremental backup will report more and more changes and traverse the repo content simply because the segments will restructure. thoughts? thanks, alex --047d7b2ed255a2f8b404f506b213--