Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 50688 invoked from network); 10 Nov 2009 18:19:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Nov 2009 18:19:05 -0000 Received: (qmail 37288 invoked by uid 500); 10 Nov 2009 18:19:04 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 37187 invoked by uid 500); 10 Nov 2009 18:19:04 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 37160 invoked by uid 99); 10 Nov 2009 18:19:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Nov 2009 18:19:01 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of buschmic@gmail.com designates 74.125.78.27 as permitted sender) Received: from [74.125.78.27] (HELO ey-out-2122.google.com) (74.125.78.27) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Nov 2009 18:18:48 +0000 Received: by ey-out-2122.google.com with SMTP id 22so79384eye.3 for ; Tue, 10 Nov 2009 10:18:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=CEjf2PC+sqGniOeN02AcaVimh+CCSfRCB2FKlsNfIL8=; b=KWD32OcvwdK2LLdklsyPKfycL9+mtFExu27BBWZ5up/iJUTUZWBePl0PNk3YHDBpwM kMwAar1K0za4SuIndD8zT9zMOEfidxwlqQp1qba4dqI3RbOvBcmxnnIc19YEM1P8Zc+1 2d+wc6hFTiEvj3Kfp5Zw0IR3qeCtQe4MTgbAo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=HpiWU49j2psyUnEkB9CMsg6XlO2NmsHIxTGiOVnLgxCmX5SqB+AlwFN6ATHAIcJdl/ D/XLDJAz5+KI95OaMoeLZafcdQM1ZUTurpRox67G9pzr6xRjy6dXpmlsvUeAqbc+TaSb wmwItHSnuDZV2iWYkx5VMtgGcPSiJrE7S+tq0= Received: by 10.216.90.15 with SMTP id d15mr111699wef.219.1257877107788; Tue, 10 Nov 2009 10:18:27 -0800 (PST) Received: from dyn9030038181.svl.ibm.com ([32.97.110.56]) by mx.google.com with ESMTPS id x6sm2946666gvf.16.2009.11.10.10.18.24 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 10 Nov 2009 10:18:26 -0800 (PST) Message-ID: <4AF9AE6D.80507@gmail.com> Date: Tue, 10 Nov 2009 10:18:21 -0800 From: Michael Busch User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.4pre) Gecko/20090915 Thunderbird/3.0b4 MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: Questions about doc store files (.cfx) References: <4AF7D019.9020406@gmail.com> <9ac0c6aa0911090256j2fc60ea9k53ac6fe939990ac5@mail.gmail.com> <4AF830EA.30503@gmail.com> <9ac0c6aa0911090900l4436974ye9e85690cf50d652@mail.gmail.com> <4AF8C47B.1020100@gmail.com> <4AF8F4CC.6070609@gmail.com> <9ac0c6aa0911100157i7937b667o3d3baeb769ea7674@mail.gmail.com> In-Reply-To: <9ac0c6aa0911100157i7937b667o3d3baeb769ea7674@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 11/10/09 1:57 AM, Michael McCandless wrote: > >> I think this is exactly what happens? I wrote a small test program that >> creates a situation like mentioned above in the "expungeDelete" scenario. It >> ends up with a docstore containing docs from two segments, but after >> expungeDeletes only one segment references the docstore. The non-deleted >> docs from the other segment end up in a new segment, so they are twice on >> disk (once orphaned in the old docstore, once in the new segment). >> Is that the desired behavior? >> > Right this is what happens -- since segment C wasn't merged, it > remains as the only segment still referencing the shared doc stores, > and, yes, this does result in duplicate storage for some docs (until C > is merged away). IFD keeps track of whether a given set of doc stores > is still referenced. > > OK, thanks for clarifying! > I think in practice this should not result in too much duplication. > If C is large, it's likely to have accumulated deletes as well. If C > is small, it's likely to get merged away in the course of normal > merging. > > I agree - it shouldn't happen very often. I was just not sure how the current behavior in this corner case was and wanted to understand it. > But, if we are really concerned with it, we could modify the merge > policy to bias its selection on this ("remove stores that are wasting > too much space") basis. > I'm not too concerned, because I also don't think this should happen very often. > I think this makes the parallel index job's simpler, right? Ie, how > the segments are sharing the stores within their own index does not > restrict what merging is done. > > Yes exactly. It won't prevent us from keeping the parallel indexes independent in this regard. Then the compound (.cfx and .cfs) files are rather orthogonal to this. I talked to Marvin on ApacheCon; in Lucy he wants to have all the compound file support in the store package, separately from the indexer. I think that would make sense in Lucene too, there's not really the need to have it tightly integrated in the IndexWriter and SegmentMerger. We can generalize the compound file concept further, so that with parallel indexes the files can be selected in either direction for inclusion in a compound file. E.g. if we separated the inverted index and store, so that they are logically two parallel index components, then the .cfx file as it works now would contain files from two parallel index components (term vectors from inverted index, stored fields from the store). This is fine if you don't want to update those components individually and can remain this way for the default IndexWriter implementation. But if we generalize the compound concept, then people can alter this behavior to better suit their update requirements. I think this would actually be a very clean design (even though it might sound complicated here). > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org