mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neal Richter <nrich...@gmail.com>
Subject Re: LDA in Mahout
Date Thu, 06 Jan 2011 21:33:17 GMT
Yes.. that is a good way to normalize and differentiate/interpret the dot
product or simple intersection-set-count.

Have you looked at transductive learning (an algorithm within
semi-supervised learning)?

IMO it would be very interesting to see what degree a bit of human labeled
data would improve LDA topic extraction.

Essentially one can take a large body of unlabeled documents and augment
with a smaller set of labeled documents. Under certain conditions the
addition can greatly boost the accuracy of the assigned labels/topics.

See the alg pseudo code here:
http://aicoder.blogspot.com/2010/10/review-of-to-rank-with-partially.html

<http://aicoder.blogspot.com/2010/10/review-of-to-rank-with-partially.html>
-Neal

On Thu, Jan 6, 2011 at 2:20 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> The only reasonable quick and dirty test of this sort is to look at the
> terms most related to each topic and heuristically assign human tags.
>
> Unless...
>
> Perhaps I misunderstood you in the first place.  If you include tags in the
> LDA training, then you can look at distance (aka 1-dot product) in LDA
> space
> between tags versus as a predictor of how often the tags cooccur.
>  Alternatively, you can look at dot product between test documents and the
> tags that are on the test document.  Then you can define AUC as the
> probability that tags that are actually present have higher dot product
> than
> randomly selected tags.  Higher AUC is good.
>
> On Thu, Jan 6, 2011 at 1:03 PM, Neal Richter <nrichter@gmail.com> wrote:
>
> > I did not intent to propose a theoretically sound way to test LDA as an
> > extractor/labeler of human tags.  The intent was simple suggestion
> towards
> > doing a quick-n-dirty test to see what the overlap of LDA extracted
> topics
> > and human tags on a well tagged document set.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message