lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Proposal: extracting term-level stats from query process
Date Thu, 11 Mar 2004 17:56:04 GMT
markharw00d@yahoo.co.uk wrote:
> I think the TermScorer could be used to produce some useful feedback on performance of
terms used in queries with the addition of some new methods:
> int getNumDocMatches();

Is this just IndexReader#docFreq(Term), or is the sum of all of the 
TermDocs#freq() for the term?

> float getAverageScore();

Would the average really that useful?  This could the same for a term 
which has ten very strong matches and ninety very weak matches as for a 
term that has 100 middling matches.

> These could be used in the following scenarios:
> * selecting which terms to offer spelling correction on (when numDocMatches==0)

Would the above be better than IndexReader#docFreq(Term) for this?

> * influencing the highlighter selections (doc fragments scored based on contained term
weights)

I don't see how the above would help here.  The ideal way to score 
fragments would be to create an index (e.g., using a RAMDirectory) of 
fragments, then search this with the query to find the top matches.  One 
can approximate this more efficiently by looking for fragments with a 
high density of query terms, perhaps taking idf's into account.

> * For "more like this" natural language type queries the highlighter could highlight
only "significantly" scored terms and
> ignore low-scoring noise words.

The best method to identify significant words is with 
Similarity#idf(Term,Searcher).  Significant words have higher idfs, 
noise words have lower idfs.

> I know it would be possible to derive all this information using existing 
> APIs but it would effectively involve another pass of the same index data.

Unless I am mistaken, I think most of what you're after can be 
accomplished with only another access to the term dictionary data, and 
does not require another pass over, e.g., the TermDocs.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message