mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Interpretating doc-topic output of cvb
Date Mon, 24 Jun 2013 16:16:21 GMT
What do you get out, and what exactly is your commandline invocation?


On Mon, Jun 24, 2013 at 6:58 AM, Mark Wicks <mawicks@gmail.com> wrote:

> As a slight correction to my earlier post on running cvb from the
> trunk, the Nan values were my mistake.  However, I still haven't had
> any success getting it to write document/topic inferences.
>
> On Sat, Jun 22, 2013 at 7:21 AM, Mark Wicks <mawicks@gmail.com> wrote:
> > I tried with cvb from trunk and ran into several problems:
> >
> > 1) The topic/term distributions were all Nan.
> > 2) The initial perplexity was Nan.
> > 3) It never wrote the document/topic inferences.
> > 4) It exited with an exception stating that the topic/term
> > distribution output directory already exists, after successfully
> > creating it and writing to it.  It did not exist before running cvb.
> >
> >
> > On Thu, Jun 20, 2013 at 10:18 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
> >> There was a bug in Mahout 0.7 regarding the doc/topic outputs,
> >> can you try your little test on trunk, and see if you get a more
> >> sensible / interpretable result?
> >>
> >>
> >> On Thu, Jun 20, 2013 at 10:17 AM, Mark Wicks <mawicks@gmail.com> wrote:
> >>
> >>> I apologize for posting this again.  I sent it during the weekend and
> >>> didn't get any response (which seems unusual for this list :)).
> >>> I am hoping that someone with some LDA/cvb experience who can help
> >>> might have missed it over the weekend.
> >>> Can someone tell me (1) if the document-topic distribution below makes
> >>> sense for the term frequencies shown and (2) how I should interpret
> >>> it.
> >>>
> >>> Mark Wicks
> >>>
> >>> On Sat, Jun 15, 2013 at 9:22 AM, Mark Wicks <mawicks@gmail.com> wrote:
> >>> > I am having trouble interpreting the "doc-topic" distribution
> produced
> >>> > by the cvb implementation of LDA in Mahout 0.7. Here's the
> >>> > term-frequency matrix for a simple test case (shown here as the
> output
> >>> > of mahout seqdumper):
> >>> >
> >>> > Key: /d01: Value: /d01:{0:30.0,1:10.0}
> >>> > Key: /d02: Value: /d02:{0:60.0,1:20.0}
> >>> > Key: /d03: Value: /d03:{0:30.0,1:10.0}
> >>> > Key: /d04: Value: /d04:{0:60.0,1:20.0}
> >>> > Key: /x01: Value: /x01:{2:30.0,3:10.0}
> >>> > Key: /x02: Value: /x02:{2:60.0,3:20.0}
> >>> > Key: /x03: Value: /x03:{2:30.0,3:10.0}
> >>> > Count: 7
> >>> >
> >>> > The intent here was that the d01 through d04 documents would consist
> >>> almost
> >>> > entirely of one topic represented almost entirely by terms 0 and 1
> >>> > with a topic-term
> >>> > distribution of [0.75, 0.25, epsilon, epsilon] and that the x01
> >>> > through x03 documents
> >>> > would consist almost entirely of a second topic represented almost
> >>> entirely by
> >>> > terms 2 and 3 with a topic-term distribution of [epsilon, epsilon,
> >>> > 0.75, 0.25]. Since
> >>> > the "d" documents do not contain terms 2 or 3 and the "x" documents
> do
> >>> > not contain
> >>> > terms 0 or 1, I expected to see document topic distributions that
> were
> >>> > approximately
> >>> > equal to
> >>> >
> >>> > d01: 1 0
> >>> > d01: 1 0
> >>> > d02: 1 0
> >>> > d03: 1 0
> >>> > x01: 0 1
> >>> > x02: 0 1
> >>> > x03: 0 1
> >>> >
> >>> > I ran the following command (where the simplelda/sparse/matrix
> directory
> >>> > contained the previous term frequency matrix). The algorithm ran to
> >>> completion
> >>> > (meaning that it converged before the maximum number of iterations
> was
> >>> > exceeded).
> >>> >
> >>> > mahout  cvb \
> >>> >    -i simplelda/sparse/matrix \
> >>> >    -dict simplelda/sparse/dictionary.file-0 \
> >>> >    -ow -o simplelda/cvb-topics \
> >>> >    -dt simplelda/cvb-classifications \
> >>> >         -tf  0.25 \
> >>> >    -block 4 \
> >>> >    -x 20 \
> >>> >    -cd 1e-10 \
> >>> >    -k 2 \
> >>> >    --tempDir simplelda/temp-k2 \
> >>> >    -seed 6956
> >>> >
> >>> > The topic-term frequencies written to simplelda/cvb-topics were
> accurate
> >>> and as
> >>> > expected:
> >>> >
> >>> >
> >>>
> {0:0.7499999999895863,1:0.2499999999548601,2:2.7776873636508568E-11,3:2.777682733874987E-11}
> >>> >
> >>>
> {0:9.375466996550278E-11,1:9.375456577819702E-11,2:0.7499999998802006,3:0.24999999993229008}
> >>> >
> >>> > However, the document-topic distribution output written to
> >>> > simplelda/cvbclassifications was not at all what I expected:
> >>> >
> >>> > Key: 0: Value: {0:0.05705773500297721,1:0.9429422649970228}
> >>> > Key: 1: Value: {0:0.05705773500297721,1:0.9429422649970228}
> >>> > Key: 2: Value: {0:0.05705773500297721,1:0.9429422649970228}
> >>> > Key: 3: Value: {0:0.05705773500297721,1:0.9429422649970228}
> >>> > Key: 4: Value: {0:0.4335650246424872,1:0.5664349753575127}
> >>> > Key: 5: Value: {0:0.4335650246424872,1:0.5664349753575127}
> >>> > Key: 6: Value: {0:0.4335650246424872,1:0.5664349753575127}
> >>> > Count: 7
> >>> >
> >>> > These are called "doc-topic distributions" in the help output, so I
> >>> > interpreted this to
> >>> > mean that the estimator concluded the "d" document terms were most
> >>> likely all
> >>> > drawn from the second topic. But the "d" documents contain no terms
> from
> >>> the
> >>> > second topic! Likewise, the "x" documents contain no terms from the
> >>> > first topic, so
> >>> > why is there a relatively large value (0.4335) in the first column.
> If
> >>> > this document-
> >>> > topic distribution produced by cvb is correct, what does it
> represent?
> >>>
> >>
> >>
> >>
> >> --
> >>
> >>   -jake
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message