They are using PLSI which we already tried to implement in
https://issues.apache.org/jira/browse/MAHOUT-106. We didn't get it
scalable, as far as I remember the paper, they are doing a nasty trick
when sending data to the reducers in a certain step so that they only
have to load a certain portion of data into memory. I'm not sure this
can be replicated in hadoop (would love to be proven wrong through).
They are also using LSH to cluster users by jaccard-coefficient, don't
we already have code for this in org.apache.mahout.clustering.minhash ?
--sebastian
On 13.04.2011 10:49, Sean Owen wrote:
> One of the three approaches that they combine is latent semantic indexing --
> that is what I was referring to.
>
> On Wed, Apr 13, 2011 at 8:33 AM, Ted Dunning<ted.dunning@gmail.com> wrote:
>
>> Sean,
>>
>> Do you mean LSI (latent semantic indexing)? Or LSH (locality sensitive
>> hashing)?
>>
>> (are you a victim of agressive error correction?)
>>
>> (or am I the victim of too little?)
>>
>>
|