lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ype Kingma <>
Subject Re: raw hit count
Date Sun, 30 Nov 2003 21:17:34 GMT
Kent, Erik,

On Saturday 29 November 2003 17:20, Erik Hatcher wrote:
> I enjoy at least attempting to answer questions here, even if I'm half
> wrong, so by all means correct me if I misspeak....

Me too, :)

> On Saturday, November 29, 2003, at 06:37  PM, Kent Gibson wrote:
> > All I would like to know is how many times a query was
> > found in a particular document. I have no problems
> > getting the score from hits.score(). hits.length is
> > the number of times in total that the query was found,
> > however I want the the number of times the query was
> > found on a document by document basis. is this
> > possible?

Could you be a bit more precise on what you mean
by 'the number of times the query was found'? For a single
query term, it is straightforward, but what about eg. a query for three
optional terms?

> The 'coord' factor used in computing the score is exactly this.  See
> the javadoc for it:
> Similarity.html#coord(int,%20int)

AFAIK, this overlap is the number of terms the document and the query
have in common.
For a query consisting of a single term, the overlap is always one,
and the number of times the query occurs in a document is the term frequency
in the document.

> You could implement a custom Similarity to capture the "overlap" or
> adjust the the factor depending on what you're trying to accomplish.
> >  The only idea I have is to rerun the search,
> > but I can't even see how to run a search on only one
> > document!
> You could always rerun a search with a Filter with only one bit enabled
> and see if zero or one document is returned - that would be quite
> trivial and fast.

You could also implement a Similarity that ignores the total number
of terms in the searched document field, see lengthNorm() in
As lengthNorm() is applied at indexing time, you will have to reindex
for this to work for you.
At query time you can then use a tf() implementation that is linear, instead
of the default square root in DefaultSimilarity, and a constant idf(),
instead of the default log of the inverse document frequency.
You should then get a document score that is proportional
to the number of query terms in the document.

Kind regards,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message