mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 何一峰 <rchyf0...@gmail.com>
Subject Issue about MAHOUT CVB output
Date Mon, 28 Oct 2013 13:55:51 GMT
Dear friends,
       I was trying do some text data mining about topic-model with mahout.
So I have tryed the lda example of cluster-reuters.sh and get the output.
But I get some trouble understanding the data of this output text just as
follows:
0
 {0.05:0.504213930694442,0.03:0.28336545784823736,0.04:0.06645718081598415,0.046:0.060773760333127425,0.02:0.031139584926057114,0.006913:0.029014261897489655,0.057:0.01547053618634471,0.06:0.003032734756446454,0.055:0.0022445187679753908,0.01:0.0014910071438064025}
1
 {0:0.30350431300481306,0.07:0.3011858883397385,0.01:0.09381126246920836,0.003:0.07946306754428638,0.007050:0.06458652890539684,0.073:0.0584050608750753,0.057:0.03649121277960022,0.02:0.030451076132133135,0.077:0.010152469632734204,0.025:0.0068854519627228675}
2
 {0.06:0.8746019308889609,0.10:0.0518303525782281,0.007050:0.04239632840003365,0.006913:0.020680612837271954,0.003:0.006158596984308525,0.04:0.00187390844569758,0.02:0.0013634569377490335,0.077:3.422856602202767E-4,0.1:2.289123340741643E-4,0.046:1.4685288138721977E-4}

      In my opinion, this should be the doc-term distribution namely every
doc's tendency probability to the topic-word, and the digit before the
colon( just like the 0.06,0.10,0.007050 in the doc 2) should be index of
the origin word in the dictionary which was built when we invoking
seq2sparse. Is this right? If so, how could I translate the index into the
origin word which makes the output easier to understand and further use. If
not so, can you explain these output data for me?Much thanks!
     By the way, any advise relevant are appreciated!

-- 
>From Norlan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message