lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Balmain" <dbalmain...@gmail.com>
Subject Re: Ferret's changes
Date Wed, 11 Oct 2006 01:56:30 GMT
On 10/11/06, Ning Li <ning.li.li@gmail.com> wrote:
> On 10/10/06, Yonik Seeley <yonik@apache.org> wrote:
> > On 10/10/06, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
> > > Hi,
> > >
> > > Maybe I missed it, but I was surprised that nobody here wondered about the
algorithm and data structure changes that Dave Balmain made in Ferret, to make it go faster
(than Java Lucene).
> >
> > Not using single doc segments for buffered docs has come up
> > http://www.nabble.com/-jira--Created%3A-%28LUCENE-565%29-Supporting-deleteDocuments-in-IndexWriter-%28Code-and-Performance-Results-Provided%29-tf1580652.html#a6177808
>
> After reading the interview article, I thought not using single doc
> segments contributed most of the indexing performance improvement. But
> in the mailing list discussion on "Global field semantics", Dave
> Balmain mentioned most of the indexing performance benefits come from
> having constant field numbers, which greatly optimizes the merging of
> term vectors and stored fields.
>
> Exactly how much performance improvement each of these two
> optimizations provides will depend on a workload. But in general, is
> one playing a more significant role than the other? What about for the
> benchmark workload Yonik pointed out at
> http://rubyforge.org/forum/forum.php?forum_id=9058 ?
>
> Cheers,
> Ning

Actually not using single doc segments was only possible due to the
fact that I have constant field numbers so both optimizations stem
from this one change. So it I'm not sure if it is worth answering your
question but I'll try anyway. It obviously depends if you are storing
the fields and term-vectors. Most Ferret using are indexing data from
a database and are only storing an id field and no term-vectors so the
biggest optimization for them is the merge algorithm I'm using for
term-infos. On the other hand if you want to highlight the fields,
(Ferret has a very accurate highlighting algorithm that actually uses
the queries to get the exact terms and phrases matched) then you need
to store the field with term-vectors. In this case the merging of
fields and term-vectors is going to be a lot more important.

Cheers,
Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message