lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Thomas <>
Subject Scoring a document using LDA topics
Date Mon, 28 Nov 2011 17:29:42 GMT

I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic
model into Lucene. Briefly, the LDA model extracts topics
(distribution over words) from a set of documents, and then represents
each document with topic vectors. For example, documents could be
represented as:

d1 = (0,  0.5, 0, 0.5)

d2 = (1, 0, 0, 0)

This means that document d1 contains topics 2 and 4, and document d2
contains topic 1. I.e.,

P(z1, d1) = 0
P(z2, d1) = 0.5
P(z3, d1) = 0
P(z4, d1) = 0.5
P(z1, d2) = 1
P(z2, d2) = 0

Also, topics are represented by the probability that a term appears in
that topic, so we also have a set of vectors:

z1 = (0, 0, .02, ...)

meaning that topic z1 does not contain terms 1 or 2, but does contain
term 3. I.e.,

P(t1, z1) = 0
P(t2, z1) = 0
P(t3, z1) = .02

Then, the similarity between a query and a document is computed as:

Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)

Basically, for each term in the query, and each topic in existence,
see how relevant that term is in that topic, and how relevant that
topic is in the document.

I've been thinking about how to do this in Lucene. Assume I already
have the topics and the topic vectors for each document. I know that I
need to write my own Similarity class that extends DefaultSimilarity.
I need to override tf(), queryNorm(), coord(), and computeNorm() to
all return a constant 1, so that they have no effect. Then, I can
override idf() to compute the Sim equation above. Seems simple enough.
However, I have a few practical issues:

- Storing the topic vectors for each document. Can I store this in the
index somehow? If so, how do I retrieve it later in my
CustomSimilarity class?

- Changing the Boolean model. Instead of only computing the similarity
on a documents that contain any of the terms in the query (the default
behavior), I need to compute the similarity on all of the documents.
(This is the whole  idea behind LDA: you don't need an exact term
match for there to be a similarity.) I understand that this will
result in a performance hit, but I do not see a way around it.

- Turning off fieldNorm(). How can I set the field norm for each doc
to a constant 1?

Any help is greatly appreciated.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message