lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: Efficient doc information retrieval.
Date Thu, 15 Nov 2001 17:23:27 GMT
> From: Winton Davies [mailto:wdavies@overture.com]
> 
>   Not really, all documents have an accountID, but I need to search 
> all the documents
> first, and each document that is returned has an accountID, but I 
> just want one document
> per accountID.
> 
> so:
> 
>   doc1 acc1
>   doc2 acc1
>   doc3 acc1
>   doc4 acc2
>   doc5 acc2
>   doc6 acc2
> 
> Lets say the query "X" returns hits in this order:
> 
>   doc1
>   doc2
>   doc3
>   doc4
>   doc5
> 
> what I want returned is:
> 
>   doc1  (best of acc1)
>   doc4  (best of acc2)

You might try something like:
  construct an int[] array that has an "account number" for each document
  construct a BitSet to keep track of whether you've seen a hit for each
account
  use a HitCollector that uses these as follows:
    if (!seen.get(accounts[doc])) {
      seen.set(accounts[doc]);
      collect(doc);
    }
That should be quite fast.

You'll want to construct the accounts array once when you open the index,
perhaps by enumerating some terms that store this information.  You'll need
to construct a new "seen" BitSet for each query, or clear a cached one.

Doug

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message