mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charly Lizarralde <charly.lizarra...@gmail.com>
Subject Re: lda + vector dump
Date Fri, 23 Aug 2013 15:45:15 GMT
I think I am doing it on the cvb output ( 1 record per topic ) so
dictionary is used to output the topic most relevant terms....but I'll
check!


On Fri, Aug 23, 2013 at 12:37 PM, Liz Merkhofer <
lmerkhofer@bericotechnologies.com> wrote:

> Hi Charly,
>
> I've been playing around with cvb, too. I have a few thoughts on b,
> vectordump:
>
> What are you doing vectordump on? If you're doing it on your cvb output,
> you're getting something like a dictionary per topic, with
> <input-word-key>:<probability-it's-in-this-cluster>. If you're doing it on
> cvb-topics output, for each document, you're getting the likelihood that it
> belongs to each of your topics.
>
> I wonder if your problem is that you read the same book I did, "Hadoop
> MapReduce Cookbook," that advised to use vectordump with the dictionary
> flag as your dictionary from s2s. Don't do that - that translates your
> document or topic keys as if they were your vocab keys, and it's just
> completely nonsensical.
>
> Best,
> Liz Merkhofer
>
>
>
> On Fri, Aug 23, 2013 at 11:18 AM, Charly Lizarralde <
> charly.lizarralde@gmail.com> wrote:
>
> > Hi everyone, I am experimenting with cvb algorithm and I have a few
> > questions....
> >
> > a) Is there any updated documentation? I have been collecting info from
> > mail lists, blogs, etc. I have been writing a small beginers tutorial, if
> > you like I'll send it.
> >
> > b) Should I remove "stop-words" before building the feature vectors ? I
> am
> > having some trouble "reading" the results....
> >
> > c) Vectordump is not sorting well...is this a reported bug ? ( I am
> > building mahout from trunk now )
> >
> > d) Any considerations on performance? It took 10 hours on a 5 node
> cluster
> > and  I've set 20 iterations on less than 10.000 docs and it took
> >
> > Thanks!
> > Charly
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message