lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene
Date Mon, 30 Jul 2007 21:42:52 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516547
] 

Doron Cohen commented on LUCENE-965:
------------------------------------

> Is there a way to plug in a patch into my local source repository, so I can diff with
my favorite diff tool?
: patch -p 0 < foo.patch  

Try with --dry-run first.
Another convenient way in case you are using Eclipse is the Subclipse plugin that lets you
visually diff patches just before applying them.

> But may I suggest the alternative? 

I think you have a valid point here. I too don't understand the proposed "Axiomatic Retrieval
Function" (ARF) in this regard: in Lucene, the norm value stored for a document (assuming
all boosts are 1) is
    norm(D) = 1 / sqrt(numTerms(D))
This value is ready to use at scoring time, multiplying it with  
    tf(t in d)  -   idf(t)^^2   
as described in http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html

Now, the ARF paper in http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf describes Lucene
scoring using |D| in place of norm(D) above, and describes ARF scoring using |D| again, the
same as it seems to be implemented in this patch e.g. in TermScorer. However, the paper defines
|D| as "the length of D". I find this confusing. Usually "|D|" really means the number of
words in a document, and "avgdl" would mean the average of all |D|'s in the collection (see
for instance "Okapi BM25" in Wikipedia). 

Now, your proposed change is something I can understand - it first translates back norm(D)
into Length(D) (ignoring boosts), and only then averaging it. 

In any case - I mean if either this is fixed or I am wrong and an explanation shows why no
fix is needed - I have to admit I still don't understand the logic behind ARF, intuitively,
why would it be better? Guess provable search quality results can help in persuading...  (LUCENE-836
is resolved btw).

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval
function, to 
> replace the default similarity function in Lucene. We compared the performance of these
two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf.

> The report shows that the performance of the axiomatic retrieval function is much better
than the default function. The axiomatic retrieval function is able to find more relevant
documents and users can see more relevant documents in the top-ranked documents. Incorporating
such a state-of-the-art retrieval function could improve the search performance of all the
applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java.
 However, many test cases are hand coded to test whether the implementation of the default
function is correct. Thus, I also made the modification to many test files to make the new
retrieval function pass those cases. In fact, we found that some old test cases are not reasonable.
For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy
w3". 
> The second document should be more relevant than the first one, because it has more 
> occurrences of the query term "w3". But the original test case would require us to rank

> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message