Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: domain of dbalmain.ml@gmail.com designates
 66.249.82.233 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
        s=beta; d=gmail.com;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=e0tzNj+/7hPqb9FcxpBuiTSNitu3zDEsNW2/gF6P/ovLpV0CaDlwlvYg4m4Z7qDYpESXXwNe4cD6kO+0cDnDWEyXxfdlswmsq3EGzBZTM16zQJ3EaL3KPhzFWINaa3lMW0jEMjR5hzEDnKgIjief5s7Bz/GBxFY0UD3aUbWifPQ=
Message-ID: <d792e0dc0610102353v78f84faevb91963420c97ee61@mail.gmail.com>
Date: Wed, 11 Oct 2006 15:53:57 +0900
From: "David Balmain" <dbalmain.ml@gmail.com>
To: java-dev@lucene.apache.org
Subject: Re: Ferret's changes
In-Reply-To: <452C8F98.8060301@manawiz.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <20061010082407.23341.qmail@web50302.mail.yahoo.com>
	 <c68e39170610100722x92b958dt1e5710722839228c@mail.gmail.com>
	 <c7b302c50610100935j40b952d2wf0b446a52b6e9c9d@mail.gmail.com>
	 <d792e0dc0610101856i2a22fcc4p91501d0bdc879caa@mail.gmail.com>
	 <452C8F98.8060301@manawiz.com>

On 10/11/06, Chuck Williams <chuck@manawiz.com> wrote:
> David Balmain wrote on 10/10/2006 03:56 PM:
> > Actually not using single doc segments was only possible due to the
> > fact that I have constant field numbers so both optimizations stem
> > from this one change. So it I'm not sure if it is worth answering your
> > question but I'll try anyway. It obviously depends if you are storing
> > the fields and term-vectors. Most Ferret using are indexing data from
> > a database and are only storing an id field and no term-vectors so the
> > biggest optimization for them is the merge algorithm I'm using for
> > term-infos. On the other hand if you want to highlight the fields,
> > (Ferret has a very accurate highlighting algorithm that actually uses
> > the queries to get the exact terms and phrases matched) then you need
> > to store the field with term-vectors. In this case the merging of
> > fields and term-vectors is going to be a lot more important.
>
> Hi David,
>
> I use a rich global field model and use term vectors for fast accurate
> excerpting in Lucene.  Whether or not to store term vectors is the one
> index property that is not fixed in my model.  The reason is that my
> collections tend to contain a mix of many small email messages and a
> comparatively small number of much larger documents.  Term vectors are a
> significant advantage for excerpting large documents, but add no value
> and unnecessarily bloat the index for all the small emails.  I use a
> size threshold to only store term vectors when the body content of the
> field exceeds that threshold.

I personally would always store term vectors since I use a
StandardTokenizer and Stemming. In this case highlighting matches in
small documents is not trivial. Ferret's highlighter matches even
sloppy phrase queries and phrases with gaps between the terms
correctly. I couldn't do this without the use of term vectors.

> Would your model in Ferret support that particular field variation?  Do
> you have an alternative representation to achieve similar benefits?  I
> suppose it would be possible for the single conceptual field 'body' to
> be represented with two physical fields 'smallBody' and 'largeBody'
> where the former stores term vectors and the latter does not.
>
> Chuck

If I really wanted to solve this problem I would use this solution. It
is pretty easy to search multiple fields when I need to. Ferret's
Query language even supports it:

    smallBody|largeBody:"phrase to search for"

In the end, I think the benifits of my model far outweight the costs.
For me at least anyway.

Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org