Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 75088 invoked from network); 9 Nov 2009 15:11:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Nov 2009 15:11:04 -0000 Received: (qmail 62563 invoked by uid 500); 9 Nov 2009 15:11:04 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 62477 invoked by uid 500); 9 Nov 2009 15:11:03 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 62469 invoked by uid 99); 9 Nov 2009 15:11:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2009 15:11:03 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of buschmic@gmail.com designates 209.85.218.222 as permitted sender) Received: from [209.85.218.222] (HELO mail-bw0-f222.google.com) (209.85.218.222) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2009 15:11:01 +0000 Received: by bwz22 with SMTP id 22so3885702bwz.5 for ; Mon, 09 Nov 2009 07:10:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=7paer9MWrJWe3g03wWObx4xyRanJbk4CIYYZLhkhvR4=; b=afk3RsjQnRcfRrWxguSNfT1B0wXSalriOsvsrIGudHEfqqxU7V0FdADpmsgAU6pxfZ 4a0IK4G/NkJMDVqiZRrOjekr+Jk/ojKWdOwRlnob+yhiiT2LCMpA8VN4Av2iIww0N0EE d5CYObAlrFAmVwS0715aWcsImF2odikAI2gb4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=j57ynRfNe3H8WB8l/XJVzvPvReS1uSal1nJFp7exakGEh6rn4M+GIXTsEtQavtX3R+ jxpWJUSLVM81IPdeSvhuGayvVhamIHrncNSg3ea2CdbNWVS11M/av5LZlSTvevuS4HYP X7+raS8PiHAHs873qaK6DybIZVQmr9J4wt694= Received: by 10.204.32.146 with SMTP id c18mr5016815bkd.88.1257779439701; Mon, 09 Nov 2009 07:10:39 -0800 (PST) Received: from michael-buschs-macbook-pro-2.local (c-98-248-34-169.hsd1.ca.comcast.net [98.248.34.169]) by mx.google.com with ESMTPS id c28sm5775895fka.24.2009.11.09.07.10.36 (version=TLSv1/SSLv3 cipher=RC4-MD5); Mon, 09 Nov 2009 07:10:38 -0800 (PST) Message-ID: <4AF830EA.30503@gmail.com> Date: Mon, 09 Nov 2009 07:10:34 -0800 From: Michael Busch User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.4pre) Gecko/20090915 Thunderbird/3.0b4 MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: Questions about doc store files (.cfx) References: <4AF7D019.9020406@gmail.com> <9ac0c6aa0911090256j2fc60ea9k53ac6fe939990ac5@mail.gmail.com> In-Reply-To: <9ac0c6aa0911090256j2fc60ea9k53ac6fe939990ac5@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit On 11/9/09 2:56 AM, Michael McCandless wrote: > I think you're asking about the benefit of using "shared doc stores" at > all? > > CFX is just the compound format of these shared files; if compound > file is off, then they are still shared, just as separate (.fdx/t, > .tvx/d/f) files. > > Oh yeah, that's true. I do mean the shared doc stores in general. > For building up a single large index, I suspect the win is > sizable, if you store fields and compute term vectors. You save alot > of IO not merging these files, within that one IndexWriter session. > > That said, the win is probably less than it used to be, now that we > bulk-copy when merging these files. Previously, without bulk copy, it > also consumed alot of CPU to merge the files. > > And it's true that the gains only apply within one IW session, so I'd > expect this means in practice when building a huge index from scratch > you see sizable gains, but then when rolling smallish updates into the > index over time, there's no real gain. Though that's something we could > [alternatively] pursue improving (eg if we allowed a single segment to > reference multiple doc stores). > > Ok, thanks for clarifying. > I do think keeping the IO cost down during merging is important; > removing shared doc stores would be at step backwards (though, > I agree, would simplify things). > > Well, I was just wondering if you or anyone else had any numbers that quantify the benefits of the shared stores. If it really helps a lot I agree it's a good thing to have them. But they do add a layer of complexity to the code (and to the way one has to think about segments), so if the win is smallish this might not be desirable. Btw: I'm not trying to say it's required to remove them for parallel indexing. It'd be just be simpler without them. You can think about a segmented parallel index as a matrix of segments. And about the shared doc stores as merging multiple cells in a single row or column of a spreadsheet. It'd be a bit easier if that wasn't possible and it always was a true matrix. Michael > Mike > > On Mon, Nov 9, 2009 at 3:17 AM, Michael Busch wrote: > >> Hi, >> >> I'm wondering about the benefits of having the .cfx files. The main >> advantage is that you avoid merging (copying) stored fields and TermVectors >> during segment merge, right? And I think .cfx files are only shared across >> segments if the same IndexWriter is used to flush multiple segments and then >> to commit all those segments in a single transaction. Then those segments >> share the same .cfx file, correct? And in such a case .cfx files are also >> not merged into .cfs files? >> >> How big is usually the win of using .cfx files? I'm wondering, because the >> .cfx file is the only one that spans over multiple segments and therefore >> adds more complexity to the code. For parallel indexing it'd be nice to not >> have those kind of files that belong to multiple segments, especially when >> we want to update certain fields. >> >> Michael >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-dev-help@lucene.apache.org >> >> >> > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org