lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect
Date Sun, 13 Sep 2009 16:17:57 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754716#action_12754716
] 

Doron Cohen commented on LUCENE-1908:
-------------------------------------

{quote}
The intro to ir book appears to break it down so that you can explain it with the math (why
going into the unit vector space favors longer docs) - but other work I am seeing says the
math tells you no such thing, and its just comparing it to the computed relevancy curve that
tells you its not great.
{quote}

To my (current) understanding it goes like this: normalizing all V(d)'s to unit vector is
losing all information about lengths of documents. For a large document made by duplicating
a smaller one this is probably ok. For a large document which just contains lots of "unique"
text this is probably wrong. To solve this, a different normalization is sometimes preferred,
one that would not normalize V(d) to the unit vector. (very much in line with what you (Mark)
wrote above, finally I understand this...).

The pivoted length normalization which you mentioned is one different such normalization.
Juru in fact is using this document length normalization. In our TREC experiments with Lucene
we tried this approach (we modified Lucene indexing such that all require components were
indexed as stored/cached fields and at search time we could try various scoring algorithms).
It is interesting that pivoted length normalization did not work well - by our experiments
- with Lucene for TREC.

The document length normalization of Lucene's DefaultSimilarity (DS) now seems to me - intuitively
- not so good - at least for the previously mentioned two edge cases, where doc1 is made of
N distinct terms, and doc2 is made of same N distinct terms, but its length is 2N because
each term appears twice. For doc1 DS will normalize to the unit vector same as EN, and for
doc2 DS will normalize to a vector larger than the unit vector. However I think the desired
behavior is the other way around - for doc2 to be the same as EN, and for doc1 to be normalized
to a vector larger than the unit vector.

Back to the documentation patch, again it seems wrong presenting as if both EU and some additional
doc length normalization are required - fixed patch to follow...

> Similarity javadocs for scoring function to relate more tightly to scoring models in
effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message