mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: Using LDA CVB results to match a new document to topics?
Date Wed, 30 May 2012 03:12:20 GMT
Hi Timothy,

On Tue, May 29, 2012 at 5:03 PM, Runkel, Timothy J <> wrote:

> Jake,
> We've run the new LDA CVB implementation (using the RowID job to format
> docs as you noted in other email) and have complete results.

Great!  Were you able to inspect the model files easily using the
vectordump utility?  That doesn't seem to be very well documented (mea
but the new LDA doesn't require any of the old "LDAPrintTopics" stuff
anymore, and not everyone knows that...

> Now, given the topic by terms association vectors, how can we take a new
> document (in Term Frequency format using the same dictionary as the trained
> documents and ignoring any few terms not found) and query the model to rank
> its top topic matches?  Academic papers seem to gloss over this task.

Yeah, this is totally straightforward, but you're right, it's not usually
described well.  Note, although LDA doesn't technically follow the correct
"theory" if you use TF-IDF vectors, in "practice" you can often get better
results this way (ie. train with it, and then do the final topic ->
document association step with this as well), in that you get less of the
crappy super-common terms at the top of all of your topics.

> An input TF vector is not the same thing as the topic by terms association
> vectors, but hoping it was analogous enough, I tried several similarity or
> distance measures between some trained doc TF vectors and the topic by term
> association vectors, but the calculated top topic matches did not even rank
> in approximately the same order as the model output of the doc by topic
> membership rankings.  So that approach seems even less likely to match a
> new doc TF vector to model topics.

Yeah, don't do this, this won't work at all.

> Tracing the logic through the CVB classes seem to show the final model
> training iteration happens in the TopicModel.trainDocTopicModel(Vector
> original, Vector topics, Matrix docTopicModel)  method, but several of its
> steps use class values not obviously accessible in model results and its
> modifications to the docTopicModel matrix seems more like tuning than a
> simple look up.

This is exactly the place to look: CVB0DocInferenceMapper does exactly what
you want.  To paraphrase what it's doing:

    Vector doc = getOriginalDocument();
    Vector docTopics = new DenseVector(new double[numTopics]).assign(1.0
/numTopics);  // if you have a prior guess as to what the topic
distribution should be, you can start with it here, instead of the uniform
    Matrix docModel = new SparseRowMatrix(numTopics, doc.get().size());  //
this is an empty matrix, just for holding intermediate data - i've got a
branch where this gets hidden away.
    int maxIters = getMaxIters();
    for(int i = 0; i < maxIters; i++) {
      topicModel.trainDocTopicModel(doc.get(), docTopics, docModel);
   // and now the vector docTopics contains your document -> topic
distribution desired.  Take the top-K elements of this vector, by
magnitude, and you'll have your top-k topics and their probabilities for
the current document.

> Any help or pointers will be greatly appreciated.  Thank you!



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message