lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Thomas <>
Subject Re: Scoring a document using LDA topics
Date Tue, 29 Nov 2011 15:50:28 GMT

Thanks for your reply, and the link to your blog post, which was
helpful and got me thinking about Payloads.

I still have one more question. I need to be able to compute the
Sim(query q, doc d) similarity function, which is defined below:

Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)

So, I'm guessing that the only what to do this is to do the following:

- At index time, store the (flattened) topics as a payload for each
documen, as you suggest in your blog

- At query time, find out which topics are in the query
- Construct a BooleanQuery, consisting of one PayloadTermQuery per
topic in the query
- Search on the BooleanQuery. This essentially tells me which
documents have the topics in the query
- Iterate over the TopDocs returns by the search. For each doc, get
the full payload, unflatten it, and use it to compute Sim(query q, doc
- Reorder the results based on the Sim(query q, doc d) results.

Is this the best way? I can't see a way to compute the Sim() metric at
any other time, because in scorePayload(), we don't have access to the
full payload, nor to the query.

Thanks again,

On Mon, Nov 28, 2011 at 1:51 PM, Sujit Pal <> wrote:
> Hi Stephen,
> We are doing something similar, and we store as a multifield with each
> document as (d,z) pairs where we store the z's (scores) as payloads for
> each d (topic). We have had to build a custom similarity which
> implements the scorePayload function. So to find docs for a given d
> (topic), we do a simple PayloadTermQuery and the docs come back in
> descending order of z. Simple boolean term queries also work. We turn
> off norms (in the ctor for the PayloadTermQuery) to get scores that are
> identical to the d values.
> I wrote about this sometime back...maybe this would help you.
> -sujit
> On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote:
>> List,
>> I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic
>> model into Lucene. Briefly, the LDA model extracts topics
>> (distribution over words) from a set of documents, and then represents
>> each document with topic vectors. For example, documents could be
>> represented as:
>> d1 = (0,  0.5, 0, 0.5)
>> d2 = (1, 0, 0, 0)
>> This means that document d1 contains topics 2 and 4, and document d2
>> contains topic 1. I.e.,
>> P(z1, d1) = 0
>> P(z2, d1) = 0.5
>> P(z3, d1) = 0
>> P(z4, d1) = 0.5
>> P(z1, d2) = 1
>> P(z2, d2) = 0
>> ...
>> Also, topics are represented by the probability that a term appears in
>> that topic, so we also have a set of vectors:
>> z1 = (0, 0, .02, ...)
>> meaning that topic z1 does not contain terms 1 or 2, but does contain
>> term 3. I.e.,
>> P(t1, z1) = 0
>> P(t2, z1) = 0
>> P(t3, z1) = .02
>> ...
>> Then, the similarity between a query and a document is computed as:
>> Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)
>> Basically, for each term in the query, and each topic in existence,
>> see how relevant that term is in that topic, and how relevant that
>> topic is in the document.
>> I've been thinking about how to do this in Lucene. Assume I already
>> have the topics and the topic vectors for each document. I know that I
>> need to write my own Similarity class that extends DefaultSimilarity.
>> I need to override tf(), queryNorm(), coord(), and computeNorm() to
>> all return a constant 1, so that they have no effect. Then, I can
>> override idf() to compute the Sim equation above. Seems simple enough.
>> However, I have a few practical issues:
>> - Storing the topic vectors for each document. Can I store this in the
>> index somehow? If so, how do I retrieve it later in my
>> CustomSimilarity class?
>> - Changing the Boolean model. Instead of only computing the similarity
>> on a documents that contain any of the terms in the query (the default
>> behavior), I need to compute the similarity on all of the documents.
>> (This is the whole  idea behind LDA: you don't need an exact term
>> match for there to be a similarity.) I understand that this will
>> result in a performance hit, but I do not see a way around it.
>> - Turning off fieldNorm(). How can I set the field norm for each doc
>> to a constant 1?
>> Any help is greatly appreciated.
>> Steve
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message