Hi all, I'm implementing an approach of mixture of language models in Lucene 4.0.0. Here is a little math to be precise: The ranking score for query q with t terms: p(q | \theta) = \prod_{t \in q} p(t | \theta) where p(t | \theta) = \sum_f \alpha_f p(t | \theta^f) and p(t | \theta^f) = (freq(t) + \mu_f p(t | \theta_c^f)) / (length(f) + \mu_f) \mu_f - Dirichlet prior for field f. I've enhanced LMDirichletSimilarity to work with per-field priors: public class LMPerFieldDirichletSimilarity extends LMDirichletSimilarity { @Override protected float score(BasicStats stats, float freq, float docLen) { float mu = stats.getAvgFieldLength(); float collectionProbability = ((LMStats) stats).getCollectionProbability(); float score = (freq + mu * collectionProbability) / (docLen + mu); return score; } @Override public void computeNorm(FieldInvertState state, Norm norm) { byte length = new Integer(state.getLength()).byteValue(); norm.setByte(length); } @Override protected float decodeNormValue(byte norm) { return new Byte(norm).floatValue(); } } and I can mix CustomScoreQuery, BooleanQuery and FieldsQuery to get relevant documents and compute the ranking function (the first probability). However, my current solution omits p(t | \theta^f) values for the fields, which do not contain occurrences of a given term t. Those values should be computed by LMPerFieldDirichletSimilarity.score with freq=0. Surely, the problem comes from the fact that Lucene does not retrieve such term positions by default. This problem is not so severe in case of LMDirichletSimilarity and one-field approach, since such documents are simply irrelevant. But in case of multi-field documents, one cannot omit those values, if the document contains at least one term occurrence no matter in which field. How would you add these values while scoring? -- Nikita Zhiltsov Visiting Graduate Student Emory University Intelligent Information Access Lab E500 Emerson Hall, Atlanta, Georgia, USA Phone: (404) 834-5364 E-mail: znikita@emory.edu --------------------------------------------------------------------- Gradute Student, Research Fellow Kazan Federal University Computational Linguistics Laboratory Russia, 420008 Kazan, Prof. Nuzhina Str., 1/37 room 117 Skype: nickita.jhiltsov Personal page: http://cll.niimm.ksu.ru/~nzhiltsov E-mail: nikita.zhiltsov@gmail.com ---------------------------------------------------------------------