mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: canopy cluster size
Date Wed, 14 Mar 2012 13:52:02 GMT
YW, you might also try Dirichlet with a 
DistanceMeasureClusterDistribution on a CosineDistanceMeasure. See 
DirichletClusterer or the wiki for an explanation of why this might also 
be an attractive approach. With enough initial models (maybe -k=50 or 
100 in your case) it is essentially non-parametric. You can also use k, 
reducers with Dirichlet (also k-means, btw) to improve scalability. See 
TestL1ModelClustering for an example of this approach.

On 3/14/12 7:30 AM, Baoqiang Cao wrote:
> Appreciate!
>
> It help a lot on clarifying canopy for me. After all these adventures,
> I guess kmeans is the inevitable solution for my problem. Ironically,
> I went to canopy in hope of getting better results out of kmeans.
>
> Thanks again.
>
> Baoqiang
>
>
> On Tue, Mar 13, 2012 at 5:01 PM, Jeff Eastman
> <jdog@windwardsolutions.com>  wrote:
>> No, Canopy only uses a single reducer, so what's happening is many mappers
>> are munching your data in parallel and then the poor little reducer has to
>> combine them all. It is slow going and a problem with Canopy that I don't
>> know how to fix. It is complicated by the fact that all the canopy centers
>> become very dense vectors in this process, consuming memory and cpu. You
>> might play with t3 and t4 parameters which set different T1/2 values for the
>> reduce step. That could improve reducer performance.
>>
>> Suggest you try k-means. With it you can specify the number of clusters you
>> want and use that many reducers to improve scalability.
>>
>>
>>
>> On 3/13/12 2:51 PM, Baoqiang Cao wrote:
>>> Thanks Jeff!
>>>
>>> After post the email, I did try CosineDistance, the problem is that
>>> the reducer part takes too long, it almost stop. The T2 values I tried
>>> on Cosine are, 0.8, 0.5, 0.2, 0.1, 0.08, 0.0008. Every case, the
>>> reducer quickly passed 67%, then very very slowly progress, for
>>> example, it takes several minutes to finish 1% more.
>>>
>>> Is that something wrong in my data?
>>>
>>> Best
>>> Baoqiang
>>>
>>>
>>> On Tue, Mar 13, 2012 at 3:08 PM, Jeff Eastman
>>> <jdog@windwardsolutions.com>    wrote:
>>>> EuclideanDistance is not a great choice for document clustering,
>>>> especially
>>>> with a lot of terms. Suggest you try CosineDistance which will give you
>>>> all
>>>> distances between 0 and 1. If you still end up with only one canopy it is
>>>> because T2 is too large. T1 has no effect upon the number of canopies
>>>> produced. Once you make T2 small enough you should see more canopies.
>>>>
>>>> You might also try k-means, sampling maybe k=50 initial clusters from
>>>> your
>>>> dataset. Then you can tune k to see how that affects your clusters.
>>>>
>>>>
>>>>
>>>>
>>>> On 3/13/12 12:44 PM, Baoqiang Cao wrote:
>>>>> Hi,
>>>>>
>>>>> I'm trying to use canopy clustering on about 2 million documents. What
I
>>>>> did is:
>>>>>
>>>>> mahout seq2sparse -i /mahout/input/chris_ce_20120120 -o
>>>>> /mahout/sparse/test -wt tfidf -nr 100 -x 50 -md 34000 -n 2 -s 5
>>>>>
>>>>> And canopy clustering:
>>>>>
>>>>> mahout canopy -i /mahout/sparse/test/tfidf-vectors -o
>>>>> /mahout/canopy-clusters/test -dm
>>>>> org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 20 -t2
>>>>> 1.5 -ow -cl
>>>>>
>>>>> at last:
>>>>>
>>>>> mahout clusterdump -s /mahout/canopy-clusters/test/clusters-0-final
>>>>> -dt sequencefile  -o foo
>>>>>
>>>>> In "foo", there is only one line staring with "C-0{n=100 c=[",
>>>>> regardless t1 and t2 values I used.
>>>>>
>>>>> I tried "-t1 2000 -t2 1500", ..., "-t1 2 -t2 1.5". Always, one line in
>>>>> the final output from clusterdump. I'm expecting not a single cluster,
>>>>> any help find out why I got only one cluster?
>>>>>
>>>>> Thanks.
>>>>> Baoqiang
>>>>>
>>>>>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message