lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <>
Subject Re: Ferret's changes
Date Tue, 10 Oct 2006 17:15:17 GMT
On 10/10/06, David Balmain <> wrote:
> I did set maxBufferedDocs to 1000 and optimized both indeces at the
> end but I didn't use non-compound format. I think it is better to use
> compound file format as it is default in both libraries and the
> penalty will be similar in both cases.

When people care about performance, I always advise to use the
non-compound format.  If the number of files gets to large, it's
better (in general) to decrease mergeFactor rather than switch to the
compound format.

Since we are benchmarking performance, we should use the
recommendations we would give to others who are trying to get the best
performance (like use the latest JVM, use -server, use a big enough
heap, increase maxBufferedDocs, etc).

> If you really like I can tell
> you what the difference is for my tests. Please feel free to tell me
> where else I can improve the Lucene benchmarker.

Where is the source?  I'd be most interested in testing it on the
current version of Lucene in the SVN trunk.

> > So is Ferret faster for searching too?  The absence of stats suggests
> > that it's not :-)
> :-) Well, I'd like to think the absence of stats for searching has
> nothing to do with Lucene being faster.

Does that mean it's an unknown?  You haven't tested it?

> For starters, the indexing
> time is the a lot more noticable to the user.

In general, I would have assumed the opposite.  I guess it depends a
lot on the usage patterns, but at CNET, indexing time is relatively
unimportant for most of our collections... as long as it keeps up with
document changes, it isn't too much of an issue.

Searches, on the other hand, are very important.  Some searches are
even done as part of the dynamic generation of page content, so the
latency of the search adds to the latency of the page as a whole!  In
other collections, throughput is most important, as long as most
searches take less than 1 second.  But since our searches are normally
CPU bound, there is normally a rather direct correlation between the
latency of a single request and the throughput of the system as a

When I've had to do performance work in the past, it's *always* been
on the search side.

> And benchmarking
> searching is a little more difficult. There are numerous Queries,
> Filters and Sorts to test and it's important to test with optimized
> and unoptimized indexes. Anyway, I'll attempt to put a search
> benchmark out tomorrow.

It doesn't have to be all or nothing... we could just start out with
some of the most common:
 - some single term queries
  - some multi term queries
 - some phrase queries
No filters, sort by relevance, take the top 50, don't retrieve stored fields.
Assuming there is no caching, putting these queries in a loop to get a
run that lasts several minutes would be good.

In the future, it would be nice to test multiple clients (threads) at
once since since it more closely simulates a server environment.

One could also think about automating the creation of queries... find
the top terms in the corpus and use those terms to create random
queries.  Certainly not as realistic as using a  real query log, but
it can be used for any corpus.

-Yonik Solr, the open-source Lucene search server

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message