Return-Path: X-Original-To: apmail-subversion-users-archive@minotaur.apache.org Delivered-To: apmail-subversion-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C3A5B105D3 for ; Sat, 6 Dec 2014 11:22:03 +0000 (UTC) Received: (qmail 84951 invoked by uid 500); 6 Dec 2014 11:22:03 -0000 Delivered-To: apmail-subversion-users-archive@subversion.apache.org Received: (qmail 84916 invoked by uid 500); 6 Dec 2014 11:22:03 -0000 Mailing-List: contact users-help@subversion.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list users@subversion.apache.org Received: (qmail 84906 invoked by uid 99); 6 Dec 2014 11:22:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Dec 2014 11:22:01 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [66.111.4.27] (HELO out3-smtp.messagingengine.com) (66.111.4.27) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Dec 2014 11:21:57 +0000 Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id 23066209D3; Sat, 6 Dec 2014 06:17:22 -0500 (EST) Received: from frontend2 ([10.202.2.161]) by compute3.internal (MEProxy); Sat, 06 Dec 2014 06:17:22 -0500 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= daniel.shahaf.name; h=x-sasl-enc:date:from:to:cc:subject :message-id:references:mime-version:content-type :content-transfer-encoding:in-reply-to; s=mesmtp; bh=w74NQWyaqN/ QyJnfVIVU6c9n7v8=; b=M2hZiZDcm1tHu2f5VZTle3Yv78seHOl1trdM71UqZW/ qpepF8aqMN/Zjey2/14EHonzc3ZhyxN/JQiXLE2retwlfjq30r0jtPugSpKcN06P q0V9ONvXBeONXB0/CPQySdYempFqK62b9d7qjpRxm5xMTpw7SzMxbR6InhfFxHhk = DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=x-sasl-enc:date:from:to:cc:subject :message-id:references:mime-version:content-type :content-transfer-encoding:in-reply-to; s=smtpout; bh=w74NQWyaqN /QyJnfVIVU6c9n7v8=; b=GICS0pmWel5l6bEmfCCpa9y8Jg8nmzPXADMhWHaJUd KwxrzPQ9yS9quktf9zepEmPzjXi/XOqdN2rYzS7ZJ/FbafvhMU26KJR9m0PLr7uT y1eaDlDeCX4n0qSKxnotjvFE2XiSHfnqU10H5ELkFk7iVq6FCR4jKNO52kdZsDhR w= X-Sasl-enc: wNr26UV6q/Abp2hnM5zUANIJfFrFh/MkuylSM7xE7jGH 1417864641 Received: from tarsus.local2 (unknown [109.67.158.80]) by mail.messagingengine.com (Postfix) with ESMTPA id F35FC680190; Sat, 6 Dec 2014 06:17:20 -0500 (EST) Date: Sat, 6 Dec 2014 11:17:05 +0000 From: Daniel Shahaf To: Mark Phippard Cc: Thomas Harold , users@subversion.apache.org Subject: Re: Efficiency of rep-sharing (deduplication) in 1.8 and later Message-ID: <20141206111705.GA15156@tarsus.local2> References: <54130E7B.6060800@nybeta.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Virus-Checked: Checked by ClamAV on apache.org Mark Phippard wrote on Fri, Sep 12, 2014 at 11:24:43 -0400: > On Fri, Sep 12, 2014 at 11:17 AM, Thomas Harold > wrote: > > > I have a question about how efficient SVN is at de-duplication within a > > repository with regards to files that appear in multiple locations, but > > which have the same content. > > > > I know a small improvement was made in 1.8... > > > > http://subversion.apache.org/docs/release-notes/1.8.html#fsfs-enhancements > > > > > When representation sharing has been enabled, Subversion 1.8 will now > > > be able to detect files and properties with identical contents within > > > the same revision and only store them once. This is a common > > > situation when you for instance import a non-incremental dump file or > > > when users apply the same change to multiple branches in a single > > > commit. > > > > #1 - If a commit puts files A, B and C into the repository, and a latter > > commit puts files B, C and D into the repository at a different > > location, is SVN smart enough to realize that B and C are already stored > > in the repository? > > > > In other words, does it track each individual file separately, even if > > they were all part of one big revision? > > > > Representation cache is based on the sha of the rep. So it does not matter > what the filename is or where it is stored. If it has the same sha as an > existing rep, then it will be be shared. > > The small improvement in 1.8 was simply to do this for files being added > within the same revision, but the other scenario was already supported. > > I think it is worth pointing out that a rep is not necessarily a "file". > It is the specific delta that SVN would be storing in the repository DB. The sha1 of the rep itself doesn't matter. The rep-cache.db file is a cache of (sha1 of fulltext ↦ location of rep generating that fulltext). As to the idea of doing the sha1 at chunk level rather than at file level: I suggest to discuss that on dev@. Some backend devs might otherwise miss the discussion. Cheers, Daniel