Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 77698 invoked from network); 11 Oct 2006 06:54:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 11 Oct 2006 06:54:22 -0000 Received: (qmail 77719 invoked by uid 500); 11 Oct 2006 06:54:19 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 77518 invoked by uid 500); 11 Oct 2006 06:54:19 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 77507 invoked by uid 99); 11 Oct 2006 06:54:18 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Oct 2006 23:54:18 -0700 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=DNS_FROM_RFC_ABUSE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of dbalmain.ml@gmail.com designates 66.249.82.233 as permitted sender) Received: from [66.249.82.233] (HELO wx-out-0506.google.com) (66.249.82.233) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Oct 2006 23:54:18 -0700 Received: by wx-out-0506.google.com with SMTP id s15so121233wxc for ; Tue, 10 Oct 2006 23:53:57 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=e0tzNj+/7hPqb9FcxpBuiTSNitu3zDEsNW2/gF6P/ovLpV0CaDlwlvYg4m4Z7qDYpESXXwNe4cD6kO+0cDnDWEyXxfdlswmsq3EGzBZTM16zQJ3EaL3KPhzFWINaa3lMW0jEMjR5hzEDnKgIjief5s7Bz/GBxFY0UD3aUbWifPQ= Received: by 10.90.105.19 with SMTP id d19mr79621agc; Tue, 10 Oct 2006 23:53:57 -0700 (PDT) Received: by 10.90.81.19 with HTTP; Tue, 10 Oct 2006 23:53:57 -0700 (PDT) Message-ID: Date: Wed, 11 Oct 2006 15:53:57 +0900 From: "David Balmain" To: java-dev@lucene.apache.org Subject: Re: Ferret's changes In-Reply-To: <452C8F98.8060301@manawiz.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20061010082407.23341.qmail@web50302.mail.yahoo.com> <452C8F98.8060301@manawiz.com> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On 10/11/06, Chuck Williams wrote: > David Balmain wrote on 10/10/2006 03:56 PM: > > Actually not using single doc segments was only possible due to the > > fact that I have constant field numbers so both optimizations stem > > from this one change. So it I'm not sure if it is worth answering your > > question but I'll try anyway. It obviously depends if you are storing > > the fields and term-vectors. Most Ferret using are indexing data from > > a database and are only storing an id field and no term-vectors so the > > biggest optimization for them is the merge algorithm I'm using for > > term-infos. On the other hand if you want to highlight the fields, > > (Ferret has a very accurate highlighting algorithm that actually uses > > the queries to get the exact terms and phrases matched) then you need > > to store the field with term-vectors. In this case the merging of > > fields and term-vectors is going to be a lot more important. > > Hi David, > > I use a rich global field model and use term vectors for fast accurate > excerpting in Lucene. Whether or not to store term vectors is the one > index property that is not fixed in my model. The reason is that my > collections tend to contain a mix of many small email messages and a > comparatively small number of much larger documents. Term vectors are a > significant advantage for excerpting large documents, but add no value > and unnecessarily bloat the index for all the small emails. I use a > size threshold to only store term vectors when the body content of the > field exceeds that threshold. I personally would always store term vectors since I use a StandardTokenizer and Stemming. In this case highlighting matches in small documents is not trivial. Ferret's highlighter matches even sloppy phrase queries and phrases with gaps between the terms correctly. I couldn't do this without the use of term vectors. > Would your model in Ferret support that particular field variation? Do > you have an alternative representation to achieve similar benefits? I > suppose it would be possible for the single conceptual field 'body' to > be represented with two physical fields 'smallBody' and 'largeBody' > where the former stores term vectors and the latter does not. > > Chuck If I really wanted to solve this problem I would use this solution. It is pretty easy to search multiple fields when I need to. Ferret's Query language even supports it: smallBody|largeBody:"phrase to search for" In the end, I think the benifits of my model far outweight the costs. For me at least anyway. Dave --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org