lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect
Date Sun, 13 Sep 2009 18:47:57 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754743#action_12754743
] 

Doron Cohen commented on LUCENE-1908:
-------------------------------------

Thanks for reviewing this Ted. 

{quote}
the new text seems to say things like "the scoring function is like this (formula) except
that it isn't because it is really like this (other-formula) but that isn't really right either
because it is like this (still-another-formula) which actually isn't right because of fields
and <mumble>".
{quote}

I see what you mean. 

I tried to take the reader of this from VSM to the actual elements computed and aggregated
in Lucene scoring code. This would also answer questions several times asked in the lists:
"but what is the scoring model of Lucene" - it is not that straightforward to tell why a certain
method is called during scoring. 

But I think you have a good point - the reader is told "this is the scoring formula" just
to discover 20 lines ahead that in fact "that is the formula" and yet again the same thing
in another paragraph. 

I think all 3 formulas are required, just the gluing text should improve. Might have helped
to have better English than mine for this, but I'll give it a try, I think I know how to write
it better in this sense.

{quote}
There are also many small errors such as claiming that tf is proportional to term frequency
and idf is proportional to inverse of document frequency. Proportional means that there is
a linear relationship which is definitely not the case here. It would be better to say tf
usually increases with increasing term frequency, although occasionally a constant might be
used. IDF, on the other hand, decreases with increasing document frequency.
{quote}

I agree. "Proportional" is wrong. Thanks for catching this! In fact it appears wrongly in
two other places in Similarity - idf() and in idfExplain().  In these two other places I think
replacing it to "related" would be correct, i.e. like this:

{noformat}
Note that Searcher.maxDoc() is used instead of
org.apache.lucene.index.IndexReader.numDocs() 
because it is related to Searcher.docFreq(Term) , 
i.e., when one is inaccurate, so is the other, and 
in the same direction.
{noformat}

For tf and idf I think this will do: (?)

{noformat}
Tf and Idf are described in more detail below, 
but for now, for completion, let's just say that 
for given term t and document (or query) x, 
Tf(t,x) is related to the number of occurrences of 
term t in x - when one increases so does 
the other - and idf(t) is similarly related to the 
inverse of the number of index documents 
containing term t. 
{noformat}


> Similarity javadocs for scoring function to relate more tightly to scoring models in
effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message