lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: Getting word count
Date Fri, 19 Oct 2001 20:46:51 GMT
I see. This information is definetely available, but you'll have to 
extract it yourself. The key will be TermPositions enumerations that you 
can get for each term in your phrase. Then you'd walk down each of these 
TermPositions to find documents where all of the terms in your phrase 
occur. Then you'd look at the positions in which these terms occur and 
decide if they form a phrase or not. If so, you count a hit and move on. 
This is essentially what the PhraseQuery does.

Not sure what the best way to approach this would be. I think I would 
start by taking apart the Phrase Query and figuring out how it works. 
Probably the most elegant way to solve this would be to create a 
different type of query (some modification of the Phrase Query) that 
collects the information you want. This might even be useful for others.

You will find that queries must be in the lucene.search package. It is 
actually pretty easy to allow external query objects (you'd have to make 
changes to about 5 or 10 classes to make required methods and classes 
public or protected), but I don't know what this list's thinking will be 
on making this change. Anyone?

-Dmitry

julien.nioche@lingway.com wrote:

>Hello again, 
>
>Thanks for your answer, Dmitry. Indeed, simple terms would be too easy ;-) I need also
to know the number of occurences for exact phrases.
>
>The problem is that I do not want to count the number of documents but the number of global
occurences in the whole index. For example, I want to know how many time there is the exact
phrase "personal computer" in all the documents of the index.
>
>Counting the hits is not appropriated for this.
>
>Thanks a lot
>
>Julien
>
>
>>If you are referring to the number of documents containing a particular 
>>term, that is available from IndexReader.termDocs(Term t). However, if 
>>it is anything more complex than a single term (like a phrase or some 
>>other query), I think the only way is to actually run a search on this 
>>query and get the length of the Hits object returned. Slightly more 
>>efficient, but requiring a bit more work, is to create a HitCollector 
>>that uses a BitVector (see org.apache.lucene.util.BitVector) to mark off 
>>documents that the searcher finds. Afterwards you can get the count from 
>>the bit vector. This will skip over sorting that is done in the standard 
>>HitCollector. You cannot simply count the number of times the method 
>>collect() is called on your collector because some queries may result in 
>>the same document being selected more than once and so you'd end up with 
>>a double-count. (Can anyone confirm that this is the case?)
>>
>>Nioche, Julien wrote:
>>
>>>Hello All,
>>>
>>>I'm trying to get a word count information for exact phrases, i-e to know
>>>how many times a given form occur in the index. Does anyone know how I can
>>>do this in a clean way? 
>>>
>>>Does it recquire modifying the score() methods of the different Scorers? Or
>>>is this information already computed somewhere else?
>>>
>>>Thanks a lot for your help
>>>
>>>Julien Nioche
>>>
>
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message