Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 13314 invoked from network); 26 Jul 2007 20:23:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Jul 2007 20:23:37 -0000 Received: (qmail 18173 invoked by uid 500); 26 Jul 2007 20:23:26 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 18131 invoked by uid 500); 26 Jul 2007 20:23:26 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 18099 invoked by uid 99); 26 Jul 2007 20:23:26 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jul 2007 13:23:26 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jul 2007 13:23:23 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 9A560714159 for ; Thu, 26 Jul 2007 13:23:03 -0700 (PDT) Message-ID: <22373362.1185481383611.JavaMail.jira@brutus> Date: Thu, 26 Jul 2007 13:23:03 -0700 (PDT) From: "Doug Cutting (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene In-Reply-To: <821213.1185415411038.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515850 ] Doug Cutting commented on LUCENE-965: ------------------------------------- > Did I miss something? What I meant is that the loops added by this patch to compute average document length per query term could be more efficiently computed once per field in a searcher. They could thus be cached in, e.g., a WeakHashMap>. The cost of computing these is proportional to the size of the norms, which means that it is proportional to the cost of reading the norms. Computing them on demand when a searcher is opened would not be as fast as pre-computing them, but it might not prohibitively slow either, and would be simple to implement without other changes to Lucene. By "average norm" I guess I really meant "easily computable from norms". This may not always be possible, since, e.g., with boosting, document lengths may not be recoverable from the norms. But, in many cases, it might suffice. Does that help? > Implement a state-of-the-art retrieval function in Lucene > --------------------------------------------------------- > > Key: LUCENE-965 > URL: https://issues.apache.org/jira/browse/LUCENE-965 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Affects Versions: 2.2 > Reporter: Hui Fang > Attachments: axiomaticFunction.patch > > > We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to > replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. > The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. > Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java. However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, > the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". > The second document should be more relevant than the first one, because it has more > occurrences of the query term "w3". But the original test case would require us to rank > the first document higher than the second one, which is not reasonable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org