Return-Path: X-Original-To: apmail-lucene-general-archive@www.apache.org Delivered-To: apmail-lucene-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2FF64D3ED for ; Tue, 11 Dec 2012 20:27:41 +0000 (UTC) Received: (qmail 38546 invoked by uid 500); 11 Dec 2012 20:27:40 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 38468 invoked by uid 500); 11 Dec 2012 20:27:40 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 38458 invoked by uid 99); 11 Dec 2012 20:27:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Dec 2012 20:27:39 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nikita.zhiltsov@gmail.com designates 209.85.219.48 as permitted sender) Received: from [209.85.219.48] (HELO mail-oa0-f48.google.com) (209.85.219.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Dec 2012 20:27:35 +0000 Received: by mail-oa0-f48.google.com with SMTP id h2so4695714oag.35 for ; Tue, 11 Dec 2012 12:27:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=plyVlU4E8qXlVfVFFEV77Bbwm0ghwA/ABNiL+kOYkmg=; b=WS/sqdYXOjk+mrJuoUiy6r5KHtTyVMPGlYYpxWkmzNycpsbf3alAffX/1m7gK3aU1X na25WTHwPXTW/ayEuP7qf9aOPOHObwVYMrDuIiqL5njC19A9rrymGvWBSwA5lZ56Jdv3 D7sB4RN3K6G4BnbA1DDVoBkYIVEhmiakT2mbU8/BIX8j+qC+zUM2aFBXwEuCP1Fzapdk 6RBbrISPLr2aLizlAW4ad+qPj9XzZH8MSW27LWQ8JSFye2B2v4KKb9mcOJnWQUUKRS5F AJMa1OCTfzVvdEqgyoPrbx3h+LYVJzLbh6voOj0mxFjXE2uFyvupKv/5PNxZuBd11Zvw CGbg== MIME-Version: 1.0 Received: by 10.60.10.227 with SMTP id l3mr9854705oeb.119.1355257634575; Tue, 11 Dec 2012 12:27:14 -0800 (PST) Received: by 10.76.171.100 with HTTP; Tue, 11 Dec 2012 12:27:14 -0800 (PST) Date: Tue, 11 Dec 2012 15:27:14 -0500 Message-ID: Subject: LMDirichletSimilarity for multiple fields From: Nikita Zhiltsov To: general@lucene.apache.org Content-Type: multipart/alternative; boundary=e89a8fb1f674e355a004d099814e X-Virus-Checked: Checked by ClamAV on apache.org --e89a8fb1f674e355a004d099814e Content-Type: text/plain; charset=UTF-8 Hi all, I'm implementing an approach of mixture of language models in Lucene 4.0.0. Here is a little math to be precise: The ranking score for query q with t terms: p(q | \theta) = \prod_{t \in q} p(t | \theta) where p(t | \theta) = \sum_f \alpha_f p(t | \theta^f) and p(t | \theta^f) = (freq(t) + \mu_f p(t | \theta_c^f)) / (length(f) + \mu_f) \mu_f - Dirichlet prior for field f. I've enhanced LMDirichletSimilarity to work with per-field priors: public class LMPerFieldDirichletSimilarity extends LMDirichletSimilarity { @Override protected float score(BasicStats stats, float freq, float docLen) { float mu = stats.getAvgFieldLength(); float collectionProbability = ((LMStats) stats).getCollectionProbability(); float score = (freq + mu * collectionProbability) / (docLen + mu); return score; } @Override public void computeNorm(FieldInvertState state, Norm norm) { byte length = new Integer(state.getLength()).byteValue(); norm.setByte(length); } @Override protected float decodeNormValue(byte norm) { return new Byte(norm).floatValue(); } } and I can mix CustomScoreQuery, BooleanQuery and FieldsQuery to get relevant documents and compute the ranking function (the first probability). However, my current solution omits p(t | \theta^f) values for the fields, which do not contain occurrences of a given term t. Those values should be computed by LMPerFieldDirichletSimilarity.score with freq=0. Surely, the problem comes from the fact that Lucene does not retrieve such term positions by default. This problem is not so severe in case of LMDirichletSimilarity and one-field approach, since such documents are simply irrelevant. But in case of multi-field documents, one cannot omit those values, if the document contains at least one term occurrence no matter in which field. How would you add these values while scoring? I've already sent this email to the 'lucene-java-user' mailing list, but haven't got any reply yet. -- Nikita Zhiltsov Visiting Graduate Student Emory University Intelligent Information Access Lab E500 Emerson Hall, Atlanta, Georgia, USA Phone: (404) 834-5364 E-mail: znikita@emory.edu --------------------------------------------------------------------- Gradute Student, Research Fellow Kazan Federal University Computational Linguistics Laboratory Russia, 420008 Kazan, Prof. Nuzhina Str., 1/37 room 117 Skype: nickita.jhiltsov Personal page: http://cll.niimm.ksu.ru/~nzhiltsov E-mail: nikita.zhiltsov@gmail.com --------------------------------------------------------------------- --e89a8fb1f674e355a004d099814e--