lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject RE: Getting word count
Date Fri, 19 Oct 2001 19:49:57 GMT
> From: Dmitry Serebrennikov []
> If you are referring to the number of documents containing a 
> particular 
> term, that is available from IndexReader.termDocs(Term t). 
> However, if 
> it is anything more complex than a single term (like a phrase or some 
> other query), I think the only way is to actually run a 
> search on this 
> query and get the length of the Hits object returned.

That's right.

> Slightly more 
> efficient, but requiring a bit more work, is to create a HitCollector 
> that uses a BitVector (see org.apache.lucene.util.BitVector) 
> to mark off 
> documents that the searcher finds. Afterwards you can get the 
> count from 
> the bit vector. This will skip over sorting that is done in 
> the standard 
> HitCollector.

You don't need the bit vector.  You can just count the number of times that
collect() is called.

> You cannot simply count the number of times the method 
> collect() is called on your collector because some queries 
> may result in 
> the same document being selected more than once and so you'd 
> end up with 
> a double-count. (Can anyone confirm that this is the case?)

It should not be the case.  The collect() method should be called at most
once per document.


View raw message