lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Balmain" <dbalmain...@gmail.com>
Subject Re: Ferret's changes
Date Wed, 11 Oct 2006 06:53:57 GMT
On 10/11/06, Chuck Williams <chuck@manawiz.com> wrote:
> David Balmain wrote on 10/10/2006 03:56 PM:
> > Actually not using single doc segments was only possible due to the
> > fact that I have constant field numbers so both optimizations stem
> > from this one change. So it I'm not sure if it is worth answering your
> > question but I'll try anyway. It obviously depends if you are storing
> > the fields and term-vectors. Most Ferret using are indexing data from
> > a database and are only storing an id field and no term-vectors so the
> > biggest optimization for them is the merge algorithm I'm using for
> > term-infos. On the other hand if you want to highlight the fields,
> > (Ferret has a very accurate highlighting algorithm that actually uses
> > the queries to get the exact terms and phrases matched) then you need
> > to store the field with term-vectors. In this case the merging of
> > fields and term-vectors is going to be a lot more important.
>
> Hi David,
>
> I use a rich global field model and use term vectors for fast accurate
> excerpting in Lucene.  Whether or not to store term vectors is the one
> index property that is not fixed in my model.  The reason is that my
> collections tend to contain a mix of many small email messages and a
> comparatively small number of much larger documents.  Term vectors are a
> significant advantage for excerpting large documents, but add no value
> and unnecessarily bloat the index for all the small emails.  I use a
> size threshold to only store term vectors when the body content of the
> field exceeds that threshold.

I personally would always store term vectors since I use a
StandardTokenizer and Stemming. In this case highlighting matches in
small documents is not trivial. Ferret's highlighter matches even
sloppy phrase queries and phrases with gaps between the terms
correctly. I couldn't do this without the use of term vectors.

> Would your model in Ferret support that particular field variation?  Do
> you have an alternative representation to achieve similar benefits?  I
> suppose it would be possible for the single conceptual field 'body' to
> be represented with two physical fields 'smallBody' and 'largeBody'
> where the former stores term vectors and the latter does not.
>
> Chuck

If I really wanted to solve this problem I would use this solution. It
is pretty easy to search multiple fields when I need to. Ferret's
Query language even supports it:

    smallBody|largeBody:"phrase to search for"

In the end, I think the benifits of my model far outweight the costs.
For me at least anyway.

Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message