lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Charlie Zhao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene
Date Sat, 28 Jul 2007 21:46:52 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516185
] 

Charlie Zhao commented on LUCENE-965:
-------------------------------------


Regarding the approach to compute avgDL, this patch goes like this: 

+    float avgDL=0.0f;
+    for (int i=0; i<norms.length;i++) {
+        avgDL += normDecoder[norms[i] & 0xFF];
+    }
+    avgDL /= norms.length * 1.0f;

But may I suggest the alternative? 

      float CL = 0.0f;
      float avgDL = 0.0f;
      float aDL = 0.0f; 
      for (int i=0; i<norms.length;i++) {
        aDL = 1.0f / normDecoder[norms[i] & 0xFF] ; 
        aDL *= aDL;
        CL += aDL;
      }
      avgDL = CL / norms.length;    

Let us see a toy example:

2 docs in index

	|D|	avgD		|D|/avgD	norm	avgNorm 	norm/avgNorm		
D1	4	10		2/5		1/2	3/8		4/3
D2	16	10		8/5		1/4	3/8		2/3

norm/avgNorm is what we got from the patch code and D1>D2

|D|/avgD is what we got from the suggested alternative code and D1 < D2

They have totally flipped the relationship between D1 and D2. 

My impression of the Axiomatic Retrieval Function is: it still tries to penalize longer doc.
So maybe the alternative code is what we need? 

By the same token, |D| != Similarity.decodeNorm(fieldNorms[doc]). 

Note: since we are recovering from the norm, so avgDL and DL != their original absolute value.
But they suffice for the scoring purpose. 

Based on Doug's previous comment, I totally agree that avgDL should be pre-computed and cached
in the searcher before where the rubber meets the road. And the cost might be invisible if
we warm up the searcher first. Thanks for explaining. 

Not sure where Doron implemented 1 / sqrt((1 - Slope) * Pivot + (Slope) * Doclen). Since LUCENE-836
looks will be committed soon. I am really excited to see which similarity function will prevail
in this era.  

BTW, anyone would like to share how to read Lucene patches more efficiently? I mean I had
hard time to make sense of those +s and -s independently from their source files. Is there
a way to plug in a patch into my local source repository, so I can diff with my favorite diff
tool? Thanks in advance. 


> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval
function, to 
> replace the default similarity function in Lucene. We compared the performance of these
two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf.

> The report shows that the performance of the axiomatic retrieval function is much better
than the default function. The axiomatic retrieval function is able to find more relevant
documents and users can see more relevant documents in the top-ranked documents. Incorporating
such a state-of-the-art retrieval function could improve the search performance of all the
applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.
 However, many test cases are hand coded to test whether the implementation of the default
function is correct. Thus, I also made the modification to many test files to make the new
retrieval function pass those cases. In fact, we found that some old test cases are not reasonable.
For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy
w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank

> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message