Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <1855049494.1252858677545.JavaMail.jira@brutus>
Date: Sun, 13 Sep 2009 09:17:57 -0700 (PDT)
From: "Doron Cohen (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-1908) Similarity javadocs for scoring
 function to relate more tightly to scoring models in effect
In-Reply-To: <1529898827.1252698057805.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754716#action_12754716 ] 

Doron Cohen commented on LUCENE-1908:
-------------------------------------

{quote}
The intro to ir book appears to break it down so that you can explain it with the math (why going into the unit vector space favors longer docs) - but other work I am seeing says the math tells you no such thing, and its just comparing it to the computed relevancy curve that tells you its not great.
{quote}

To my (current) understanding it goes like this: normalizing all V(d)'s to unit vector is losing all information about lengths of documents. For a large document made by duplicating a smaller one this is probably ok. For a large document which just contains lots of "unique" text this is probably wrong. To solve this, a different normalization is sometimes preferred, one that would not normalize V(d) to the unit vector. (very much in line with what you (Mark) wrote above, finally I understand this...).

The pivoted length normalization which you mentioned is one different such normalization. Juru in fact is using this document length normalization. In our TREC experiments with Lucene we tried this approach (we modified Lucene indexing such that all require components were indexed as stored/cached fields and at search time we could try various scoring algorithms). It is interesting that pivoted length normalization did not work well - by our experiments - with Lucene for TREC.

The document length normalization of Lucene's DefaultSimilarity (DS) now seems to me - intuitively - not so good - at least for the previously mentioned two edge cases, where doc1 is made of N distinct terms, and doc2 is made of same N distinct terms, but its length is 2N because each term appears twice. For doc1 DS will normalize to the unit vector same as EN, and for doc2 DS will normalize to a vector larger than the unit vector. However I think the desired behavior is the other way around - for doc2 to be the same as EN, and for doc1 to be normalized to a vector larger than the unit vector.

Back to the documentation patch, again it seems wrong presenting as if both EU and some additional doc length normalization are required - fixed patch to follow...

> Similarity javadocs for scoring function to relate more tightly to scoring models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org