mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Lee <honesthe...@gmail.com>
Subject Re: Has anyone implemented "true" L-LDA out of Mahout?
Date Wed, 18 Sep 2013 07:42:12 GMT
It seems that I have to feed LDA w/ both of topic-terms & doc-topics.

Can anyone tell me how to build a seed model to begin with?

Given this labeled corpus example,

0 | 1 | 2 | 3 | total '4' terms
-  | - | - | - | -
3 | 1 | 0 | 0 - label 0
6 | 2 | 0 | 0 - label 0
0 | 0 | 3 | 1 - label 1
0 | 0 | 6 | 2 - label 1
3 | 1 | 3 | 1 - label 0, 1
6 | 2 | 6 | 2 - label 0, 1

2 labels/topics x 4 terms (0 - 3 in the column headers)
6 documents (0 - 5 in 6 rows)

(doc 0 - 1: label 0 or topic 0
 doc 2 - 3: label 1 or topic 0
 doc 4 - 5: label 0, 1 or both topics)

The seed doc-topics should be like labeling.

e.g. doc 0: {1.0, 0.0}, doc 2: {0.0, 1.0}, and doc 4: {0.5, 0.5}
-- this is my understanding from labeled LDA idea.

The seed topic-terms must be like this?

topic 0: {0: 3+6+3+6, 1: 1+2+1+2, 2: 3+6, 3: 1+2)
topic 1: {0: 3+6, 1: 1+2, 2: 3+6+3+6, 3: 1+2+1+2)

or divide them by the # of topics?

topic 0: {0: 3+6+(3+6)/2, 1: 1+2+(1+2)/2, 2: (3+6)/2, 3: (1+2)/2 }
topic 1: {0: (3+6)/2, 1: (1+2)/2, 2: 3+6+(3+6)/2, 3: 1+2+(1+2)/2 }

Any advice will be highly appreciated.

Thanks,
Henry Lee.


On Thu, Sep 5, 2013 at 6:45 PM, Henry Lee <honesthenry@gmail.com> wrote:

> Thanks for your help in advance.
>
> I will have such a good data set within 2 weeks or so.
> I may have a working impl. by the end of next week or so.
>
> Thanks,
> Henry Lee.
>
>
> On Thu, Sep 5, 2013 at 1:50 PM, Jake Mannix <jake.mannix@gmail.com> wrote:
>
>> Nobody's talked to me about it either.
>>
>> I'm happy to review your code when you try this out, however.  Do you
>> have a good data set you're planning on using for training?  Ideally you
>> want a supervised label set in which training data has multiple labels per
>> document.
>>
>>
>> On Thu, Sep 5, 2013 at 9:44 AM, Ted Dunning <ted.dunning@gmail.com>wrote:
>>
>>> I haven't seen any discussion of this other than what you reference.
>>>
>>>
>>> On Thu, Sep 5, 2013 at 7:59 AM, Henry Lee <honesthenry@gmail.com> wrote:
>>>
>>> > I am about to implement Jake Mannix's suggestion out of Twitter fork.
>>> >
>>> > Has anyone already implemented "true" L-LDA out of Mahout?
>>> >
>>> > http://markmail.org/message/cm2a6rnxblj5azuh
>>> >
>>> > over this fork?
>>> >
>>> >
>>> >
>>> https://github.com/twitter/mahout/blob/master/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0PriorMapper.java
>>> >
>>> > Thanks,
>>> > Henry Lee
>>> >
>>>
>>
>>
>>
>> --
>>
>>   -jake
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message