lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sujit Pal <sujit....@comcast.net>
Subject Re: Scoring a document using LDA topics
Date Mon, 28 Nov 2011 18:51:13 GMT
Hi Stephen,

We are doing something similar, and we store as a multifield with each
document as (d,z) pairs where we store the z's (scores) as payloads for
each d (topic). We have had to build a custom similarity which
implements the scorePayload function. So to find docs for a given d
(topic), we do a simple PayloadTermQuery and the docs come back in
descending order of z. Simple boolean term queries also work. We turn
off norms (in the ctor for the PayloadTermQuery) to get scores that are
identical to the d values.

I wrote about this sometime back...maybe this would help you.
http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html 

-sujit

On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote:
> List,
> 
> I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic
> model into Lucene. Briefly, the LDA model extracts topics
> (distribution over words) from a set of documents, and then represents
> each document with topic vectors. For example, documents could be
> represented as:
> 
> d1 = (0,  0.5, 0, 0.5)
> 
> d2 = (1, 0, 0, 0)
> 
> This means that document d1 contains topics 2 and 4, and document d2
> contains topic 1. I.e.,
> 
> P(z1, d1) = 0
> P(z2, d1) = 0.5
> P(z3, d1) = 0
> P(z4, d1) = 0.5
> P(z1, d2) = 1
> P(z2, d2) = 0
> ...
> 
> Also, topics are represented by the probability that a term appears in
> that topic, so we also have a set of vectors:
> 
> z1 = (0, 0, .02, ...)
> 
> meaning that topic z1 does not contain terms 1 or 2, but does contain
> term 3. I.e.,
> 
> P(t1, z1) = 0
> P(t2, z1) = 0
> P(t3, z1) = .02
> ...
> 
> Then, the similarity between a query and a document is computed as:
> 
> Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)
> 
> Basically, for each term in the query, and each topic in existence,
> see how relevant that term is in that topic, and how relevant that
> topic is in the document.
> 
> 
> I've been thinking about how to do this in Lucene. Assume I already
> have the topics and the topic vectors for each document. I know that I
> need to write my own Similarity class that extends DefaultSimilarity.
> I need to override tf(), queryNorm(), coord(), and computeNorm() to
> all return a constant 1, so that they have no effect. Then, I can
> override idf() to compute the Sim equation above. Seems simple enough.
> However, I have a few practical issues:
> 
> 
> - Storing the topic vectors for each document. Can I store this in the
> index somehow? If so, how do I retrieve it later in my
> CustomSimilarity class?
> 
> - Changing the Boolean model. Instead of only computing the similarity
> on a documents that contain any of the terms in the query (the default
> behavior), I need to compute the similarity on all of the documents.
> (This is the whole  idea behind LDA: you don't need an exact term
> match for there to be a similarity.) I understand that this will
> result in a performance hit, but I do not see a way around it.
> 
> - Turning off fieldNorm(). How can I set the field norm for each doc
> to a constant 1?
> 
> 
> Any help is greatly appreciated.
> 
> Steve
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message