mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: questions on the results of running lda and ldatopics, thanks
Date Fri, 01 Jul 2011 16:11:27 GMT
On Fri, Jul 1, 2011 at 6:42 AM, wine lover <winecoding@gmail.com> wrote:

> Yes, Jake, you are right. I also noticed the existence of "docTopics",
> which
> is a folder. I do not know how to view it or transfer its included files
> into readable format. It seems to me that the command of ldatopics does not
> do anything on "docTopics". Any suggestion will be highly appreciated.
>

It's a regular SequenceFile, with keys equal to whatever the keys of the
input
corpus rows are, and the values are VectorWritable, with entries being
{topic, p(topic | document) }.

Try using the "vectordump" utility to look at this sequence file.  If you
want to
see what *terms* are considered representative for each document, you'll
need to write your own (simple) map-reduce job to join the dictionary with
the model (like the say ldatopics does) and join *that* with the docTopics
output.

That would be a nice contribution to the process, if you could do it.

  -jake


>
> On Fri, Jul 1, 2011 at 1:04 AM, Jake Mannix <jake.mannix@gmail.com> wrote:
>
> > On Thu, Jun 30, 2011 at 12:02 PM, wine lover <winecoding@gmail.com>
> wrote:
> >
> > > Thanks, Hector, you are right, the exact meaning of topic_i is not
> > > necessary
> > > for unsupervised clustering.
> > >
> > > However, in order to cluster a set of documents, I still need to know
> the
> > > probabilistic relationship between topic and each document. I am not
> very
> > > clear how to get this kind of information from the generated result.
> > >
> > > For instance, model [p(model|topic_0) = 0.010358664102351409  Here,
> model
> > > is
> > > a word, but the result does not tell me anything between this word and
> a
> > > given document? Thanks.
> > >
> >
> > The current release of Mahout does produce the p(topic | document)
> > probabilities,
> > it gets emitted after the final iteration, and is in a sequence file in
> the
> > same
> > directory as the model outputs.  I think it's called "docTopics" or
> > something
> > like that?
> >
> >  -jake
> >
> >
> > >
> > > On Thu, Jun 30, 2011 at 2:08 PM, wine lover <winecoding@gmail.com>
> > wrote:
> > >
> > > > Hello Everyone,
> > > >
> > > > I have two questions on the LDA analysis.
> > > >
> > > > After running the command of lda, under the generated directory of
> > > > "testdata-lda", there have several folders: docTopics  state-0
> > state-1
> > > > ....
> > > >
> > > > It seems to me that those folders of "state-x" will be transferred
> into
> > > > readable format after running "ldatopics". But what does the folder
> of
> > > > "docTopics" stand for? How can I view it?
> > > >
> > > > Running the command of ldatopics generates 20 files, (topic_0,
> topic_1,
> > > > etc), in total. For instance, in the file of topic_0, I get
> information
> > > such
> > > > as follows:
> > > > model [p(model|topic_0) = 0.010358664102351409
> > > > tissues [p(tissues|topic_0) = 0.008870984984037485
> > > >
> > > > How can I tell what does topic_0 stand for? Where to find this kind
> of
> > > > information?  Moreover, is there any other procedures existed to
> > generate
> > > > the clustering result based on these topic_x files.
> > > >
> > > >
> > > > Thank you very much for the help.
> > > >
> > > > Wenyia
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message