lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Gulzau" <>
Subject RE: only getting Hits with score >= threshold
Date Tue, 17 May 2005 15:39:10 GMT
> -----Original Message-----
> From: Otis Gospodnetic []
> You could use HitCollector for this:

After playing around i'm a bit stuck :-\

I use lucene as client server application with the help of
RemoteSearchable and MultiSearcher.

My first approach was to use a wrapper on client side for Hits which only
delivers Hits with a "good" score.

+ easy to implemt
+ works on normalized scores
- poor performance

Testquery was:
  (NAME:peter AUTHOR:peter^0.9 NAME_AUTHOR:peter^0.6 SUBTITLE:peter^0.2)

Due to "LANG_PRIO:100^0.0010" lucene got ~200.000 Hits (~85% of the documents
have LANG_PRIO=100).

In the wrapper class i determine the real length() of Hits (without the docs
beneath myThresh with a kind of quicksort(?))

  private int getLength(int nFrom, int nTo) {
    int nHalf = (nFrom+(nTo-nFrom)/2);
    if (nFrom == nTo) return nFrom;
    if (score(nHalf)*100 < myThresh) {
      return getLength(nFrom, nHalf);
    return getLength(nHalf+1, nTo);

On server side this results to 2 IndexSearcher Calls:

  search($BooleanWeight2@4a6cbf, null, 100)
  search: 391ms
  search($BooleanWeight2@2c1e6b, null, 220420)
  search: 813ms

I think "getMoreDocs(int min)" doesn't work well with my queries, because it
prefetches to many TopDocs:

  int n = min * 2;  // double # retrieved

Additionally "getMoreDocs()" does score all docs on every call. So some work
is done which has already done in the first call.
It's a bit tricky to know how many docs are needed in advance :-\

Second try was to use a ThresholdHitCollector.

When calling, filter, new ThresholdHitCollector(...));

i got the following exception:$1
java.rmi.MarshalException: error marshalling arguments; nested exception is:$1
	at sun.rmi.server.UnicastRef.invoke(Unknown Source)
	at Source)

My current approach is to call, filter);

on client side and subclassing IndexSearcher on server side.
The class MyIndexSearcher uses the ThresholdHitCollector:

  public TopDocs search(Weight weight, Filter filter, final int nDocs)
   throws IOException {
    // nDocs is ignored. return all TopDocs instead
    Scorer scorer = weight.scorer(getIndexReader());
    if (scorer == null) return new TopDocs(0, new ScoreDoc[0]);

    ThresholdHitCollector hc = new ThresholdHitCollector();


    return new TopDocs(hc.getTotalHits(), hc.getScoreDocs());

  search($BooleanWeight2@1d3c6fd, null, 50)
  search: 234ms

Unfortunately this solution has 2 disadvantages:

- threshold works on raw scores
- lucene has to be patched (access privileges, making Hits an Interface, ...)
+ but: good performance (for me)

Is it possible to get normalized scores in HitCollector?
(e.g. via custom Similarity?)

Is it a good idea to patch Lucene for subclassing?

Oh oh, i hope somebody does understand my weird mail ;)


	Kai Gulzau

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message