lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <>
Subject [jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene
Date Thu, 26 Jul 2007 20:12:04 GMT


Grant Ingersoll commented on LUCENE-965:

I guess I would not be in favor of a special term, I would rather see it integrated into the
file format somehow.  Special terms get deleted, misused, etc.  Plus the avg. doc length is
going to be something that is going to need to be updated frequently, right? 

Since we are talking 3.x of Lucene fairly soon anyway (assuming the JDK 1.5 vote passes),
this would allow us to make the file format change as well, as long as we can still read prior

Charlie, as for you question about what users value in Lucene, speed or recall and precision,
the answer is both.  :-)  Some people care more about speed while others care about p/r. 
I think most people that use Lucene have the feeling that the results are good enough in production
environments and that we don't always worry about getting every last bit out of TREC (especially
since we can't, as a group, test against TREC).  That being said, I would bet most users would
be willing to trade off a few percentage points of speed in exchange for the kind of MAP improvements
we are talking here.  Especially since we probably can eventually figure out a way to make
it as fast anyway, or at least find other things we can speed up.

Correct me if I am wrong, but there are other IR strategies that can use the avg. doc. length,
too, right?  So, not to sidetrack too much, but if we do this right, maybe we can also open
up the door to other scoring strategies as well without much downside.  Just something to

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>                 Key: LUCENE-965
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval
function, to 
> replace the default similarity function in Lucene. We compared the performance of these
two functions and reported the results at

> The report shows that the performance of the axiomatic retrieval function is much better
than the default function. The axiomatic retrieval function is able to find more relevant
documents and users can see more relevant documents in the top-ranked documents. Incorporating
such a state-of-the-art retrieval function could improve the search performance of all the
applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and
 However, many test cases are hand coded to test whether the implementation of the default
function is correct. Thus, I also made the modification to many test files to make the new
retrieval function pass those cases. In fact, we found that some old test cases are not reasonable.
For example, in the testQueries02 of, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank

> the first document higher than the second one, which is not reasonable. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message