Hi all,
I'm implementing an approach of mixture of language models in Lucene 4.0.0.
Here is a little math to be precise:
The ranking score for query q with t terms:
p(q | \theta) = \prod_{t \in q} p(t | \theta)
where
p(t | \theta) = \sum_f \alpha_f p(t | \theta^f)
and
p(t | \theta^f) = (freq(t) + \mu_f p(t | \theta_c^f)) / (length(f) + \mu_f)
\mu_f - Dirichlet prior for field f.
I've enhanced LMDirichletSimilarity to work with per-field priors:
public class LMPerFieldDirichletSimilarity extends LMDirichletSimilarity {
@Override
protected float score(BasicStats stats, float freq, float docLen) {
float mu = stats.getAvgFieldLength();
float collectionProbability = ((LMStats)
stats).getCollectionProbability();
float score = (freq + mu * collectionProbability) / (docLen + mu);
return score;
}
@Override
public void computeNorm(FieldInvertState state, Norm norm) {
byte length = new Integer(state.getLength()).byteValue();
norm.setByte(length);
}
@Override
protected float decodeNormValue(byte norm) {
return new Byte(norm).floatValue();
}
}
and I can mix CustomScoreQuery, BooleanQuery and FieldsQuery to get
relevant documents and compute the ranking function (the first
probability). However, my current solution omits p(t | \theta^f) values for
the fields, which do not contain occurrences of a given term t. Those
values should be computed by LMPerFieldDirichletSimilarity.score with
freq=0.
Surely, the problem comes from the fact that Lucene does not retrieve such
term positions by default. This problem is not so severe in case
of LMDirichletSimilarity and one-field approach, since such documents are
simply irrelevant. But in case of multi-field documents, one cannot omit
those values, if the document contains at least one term occurrence no
matter in which field.
How would you add these values while scoring?
