Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 82521 invoked from network); 11 Mar 2004 11:37:59 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 11 Mar 2004 11:37:59 -0000 Received: (qmail 387 invoked by uid 500); 11 Mar 2004 11:37:58 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 200 invoked by uid 500); 11 Mar 2004 11:37:56 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 186 invoked from network); 11 Mar 2004 11:37:56 -0000 Received: from unknown (HELO server0027.freedom2surf.net) (194.106.33.36) by daedalus.apache.org with SMTP; 11 Mar 2004 11:37:56 -0000 Received: from Z-ISV03M0372 ([194.106.34.5]) by server0027.freedom2surf.net (8.12.6/8.12.6/Debian-7) with SMTP id i2BBbtdA006874 for ; Thu, 11 Mar 2004 11:37:55 GMT Date: Thu, 11 Mar 2004 11:37:55 GMT Message-Id: <200403111137.i2BBbtdA006874@server0027.freedom2surf.net> From: markharw00d@yahoo.co.uk To: lucene-dev@jakarta.apache.org Subject: Proposal: extracting term-level stats from query process X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I think the TermScorer could be used to produce some useful feedback on performance of terms used in queries with the addition of some new methods: int getNumDocMatches(); float getAverageScore(); These could be used in the following scenarios: * selecting which terms to offer spelling correction on (when numDocMatches==0) * influencing the highlighter selections (doc fragments scored based on contained term weights) * For "more like this" natural language type queries the highlighter could highlight only "significantly" scored terms and ignore low-scoring noise words. The stats accumulation code that would need adding to term scorer would add negligible overhead but the main issue would be how to expose the TermScorer object to users. I had initially planned to do all of this with a new class that required no Lucene changes. That would have looked like this: //wrap normal query in a new query ProfilerQuery pq=new ProfilerQuery(anyLuceneQuery); //run query as normal searcher.search(pq...) //analyze results ProfiledTermStats[] ts=pq.getTermStats() for(int i=0;i