Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 306FD920C for ; Mon, 28 Nov 2011 18:51:55 +0000 (UTC) Received: (qmail 17040 invoked by uid 500); 28 Nov 2011 18:51:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 16989 invoked by uid 500); 28 Nov 2011 18:51:53 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 16981 invoked by uid 99); 28 Nov 2011 18:51:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Nov 2011 18:51:52 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sujit.pal@comcast.net designates 76.96.62.64 as permitted sender) Received: from [76.96.62.64] (HELO qmta07.westchester.pa.mail.comcast.net) (76.96.62.64) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Nov 2011 18:51:43 +0000 Received: from omta15.westchester.pa.mail.comcast.net ([76.96.62.87]) by qmta07.westchester.pa.mail.comcast.net with comcast id 2ho61i0081swQuc57irPBN; Mon, 28 Nov 2011 18:51:23 +0000 Received: from [10.1.1.36] ([208.106.108.2]) by omta15.westchester.pa.mail.comcast.net with comcast id 2irE1i01g038EM43birHpa; Mon, 28 Nov 2011 18:51:21 +0000 Subject: Re: Scoring a document using LDA topics From: Sujit Pal Reply-To: sujit.pal@comcast.net To: java-user@lucene.apache.org In-Reply-To: References: Content-Type: text/plain Organization: Personal Date: Mon, 28 Nov 2011 10:51:13 -0800 Message-Id: <1322506273.19456.9.camel@lysdexic.healthline.com> Mime-Version: 1.0 X-Mailer: Evolution 2.12.3 (2.12.3-8.el5_2.3) Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi Stephen, We are doing something similar, and we store as a multifield with each document as (d,z) pairs where we store the z's (scores) as payloads for each d (topic). We have had to build a custom similarity which implements the scorePayload function. So to find docs for a given d (topic), we do a simple PayloadTermQuery and the docs come back in descending order of z. Simple boolean term queries also work. We turn off norms (in the ctor for the PayloadTermQuery) to get scores that are identical to the d values. I wrote about this sometime back...maybe this would help you. http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html -sujit On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote: > List, > > I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic > model into Lucene. Briefly, the LDA model extracts topics > (distribution over words) from a set of documents, and then represents > each document with topic vectors. For example, documents could be > represented as: > > d1 = (0, 0.5, 0, 0.5) > > d2 = (1, 0, 0, 0) > > This means that document d1 contains topics 2 and 4, and document d2 > contains topic 1. I.e., > > P(z1, d1) = 0 > P(z2, d1) = 0.5 > P(z3, d1) = 0 > P(z4, d1) = 0.5 > P(z1, d2) = 1 > P(z2, d2) = 0 > ... > > Also, topics are represented by the probability that a term appears in > that topic, so we also have a set of vectors: > > z1 = (0, 0, .02, ...) > > meaning that topic z1 does not contain terms 1 or 2, but does contain > term 3. I.e., > > P(t1, z1) = 0 > P(t2, z1) = 0 > P(t3, z1) = .02 > ... > > Then, the similarity between a query and a document is computed as: > > Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d) > > Basically, for each term in the query, and each topic in existence, > see how relevant that term is in that topic, and how relevant that > topic is in the document. > > > I've been thinking about how to do this in Lucene. Assume I already > have the topics and the topic vectors for each document. I know that I > need to write my own Similarity class that extends DefaultSimilarity. > I need to override tf(), queryNorm(), coord(), and computeNorm() to > all return a constant 1, so that they have no effect. Then, I can > override idf() to compute the Sim equation above. Seems simple enough. > However, I have a few practical issues: > > > - Storing the topic vectors for each document. Can I store this in the > index somehow? If so, how do I retrieve it later in my > CustomSimilarity class? > > - Changing the Boolean model. Instead of only computing the similarity > on a documents that contain any of the terms in the query (the default > behavior), I need to compute the similarity on all of the documents. > (This is the whole idea behind LDA: you don't need an exact term > match for there to be a similarity.) I understand that this will > result in a performance hit, but I do not see a way around it. > > - Turning off fieldNorm(). How can I set the field norm for each doc > to a constant 1? > > > Any help is greatly appreciated. > > Steve > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org