lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Getting word count
Date Fri, 19 Oct 2001 19:56:19 GMT
Hello again, 

Thanks for your answer, Dmitry. Indeed, simple terms would be too easy ;-) I need also to
know the number of occurences for exact phrases.

The problem is that I do not want to count the number of documents but the number of global
occurences in the whole index. For example, I want to know how many time there is the exact
phrase "personal computer" in all the documents of the index.

Counting the hits is not appropriated for this.

Thanks a lot


> If you are referring to the number of documents containing a particular 
> term, that is available from IndexReader.termDocs(Term t). However, if 
> it is anything more complex than a single term (like a phrase or some 
> other query), I think the only way is to actually run a search on this 
> query and get the length of the Hits object returned. Slightly more 
> efficient, but requiring a bit more work, is to create a HitCollector 
> that uses a BitVector (see org.apache.lucene.util.BitVector) to mark off 
> documents that the searcher finds. Afterwards you can get the count from 
> the bit vector. This will skip over sorting that is done in the standard 
> HitCollector. You cannot simply count the number of times the method 
> collect() is called on your collector because some queries may result in 
> the same document being selected more than once and so you'd end up with 
> a double-count. (Can anyone confirm that this is the case?)
> Nioche, Julien wrote:
> >Hello All,
> >
> >I'm trying to get a word count information for exact phrases, i-e to know
> >how many times a given form occur in the index. Does anyone know how I can
> >do this in a clean way? 
> >
> >Does it recquire modifying the score() methods of the different Scorers? Or
> >is this information already computed somewhere else?
> >
> >Thanks a lot for your help
> >
> >Julien Nioche
> >

View raw message