Mailing-List: contact general-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of nikita.zhiltsov@gmail.com
 designates 209.85.219.48 as permitted sender)
MIME-Version: 1.0
Date: Tue, 11 Dec 2012 15:27:14 -0500
Message-ID: 
 <CAOpaOQpEfvbH=O76joDHFO7TCc754_EWwyLJ7Ys_hE3v8DzV_g@mail.gmail.com>
Subject: LMDirichletSimilarity for multiple fields
From: Nikita Zhiltsov <nikita.zhiltsov@gmail.com>
To: general@lucene.apache.org
Content-Type: multipart/alternative; boundary=e89a8fb1f674e355a004d099814e

--e89a8fb1f674e355a004d099814e
Content-Type: text/plain; charset=UTF-8

Hi all,

I'm implementing an approach of mixture of language models in Lucene 4.0.0.

Here is a little math to be precise:

The ranking score for query q with t terms:

p(q | \theta) = \prod_{t \in q} p(t | \theta)

where

p(t | \theta) = \sum_f \alpha_f p(t | \theta^f)

and

p(t | \theta^f) = (freq(t) + \mu_f p(t | \theta_c^f)) / (length(f) + \mu_f)

\mu_f - Dirichlet prior for field f.

I've enhanced LMDirichletSimilarity to work with per-field priors:

public class LMPerFieldDirichletSimilarity extends LMDirichletSimilarity {
    @Override
    protected float score(BasicStats stats, float freq, float docLen) {
        float mu = stats.getAvgFieldLength();
        float collectionProbability = ((LMStats)
stats).getCollectionProbability();
        float score = (freq + mu * collectionProbability) / (docLen + mu);
        return score;
    }

    @Override
    public void computeNorm(FieldInvertState state, Norm norm) {
        byte length = new Integer(state.getLength()).byteValue();
        norm.setByte(length);
    }

    @Override
    protected float decodeNormValue(byte norm) {
        return new Byte(norm).floatValue();
    }
}

and I can mix CustomScoreQuery, BooleanQuery and FieldsQuery to get
relevant documents and compute the ranking function (the first
probability). However, my current solution omits p(t | \theta^f) values for
the fields, which do not contain occurrences of a given term t. Those
values should be computed by LMPerFieldDirichletSimilarity.score with
freq=0.

Surely, the problem comes from the fact that Lucene does not retrieve such
term positions by default. This problem is not so severe in case
of LMDirichletSimilarity and one-field approach, since such documents are
simply irrelevant. But in case of multi-field documents, one cannot omit
those values, if the document contains at least one term occurrence no
matter in which field.

How would you add these values while scoring?


I've already sent this email to the 'lucene-java-user' mailing list,
but haven't got any reply yet.


-- 

Nikita Zhiltsov

Visiting Graduate Student
Emory University
Intelligent Information Access Lab
E500 Emerson Hall, Atlanta, Georgia, USA
Phone: (404) 834-5364
E-mail: znikita@emory.edu


---------------------------------------------------------------------
Gradute Student, Research Fellow
Kazan Federal University
Computational Linguistics Laboratory
Russia, 420008
Kazan, Prof. Nuzhina Str., 1/37 room 117
Skype: nickita.jhiltsov
Personal page: http://cll.niimm.ksu.ru/~nzhiltsov
E-mail: nikita.zhiltsov@gmail.com

---------------------------------------------------------------------

--e89a8fb1f674e355a004d099814e--