mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: kmeans not returning k clusters
Date Wed, 09 May 2012 14:49:35 GMT
Paratosh is correct in his analysis. K-means can work itself into a 
situation where there are some empty clusters if the initial cluster 
centers are too closely spaced or if the data really doesn't support k 
clusters. This is because it assigns each vector to the most likely 
(closest) cluster. If two prior clusters are very close together this 
can cause one of them to become empty.

Have you tried priming k-means with canopy instead of the random sampler?

On 5/9/12 10:35 AM, Pat Ferrel wrote:
> I suspect you are right Paritosh. I ran the random seed with kmean 
> several times on the supplied data set and always got 28 rather than 
> 30 clusters. I don't care so much about the number but it might mean 
> that some clusters are thrown out and without looking you couldn't 
> tell if they were important ones or not. Just upping k to 32 doesn't 
> really work if you still get some thrown out.
>
> At least i think the issue is repeatable with this data.
>
> On 5/9/12 1:14 AM, Paritosh Ranjan wrote:
>> Printouts of Mahout vectors prints only the non-zero elements.
>> So, the centers are not empty, rather they are zero.
>>
>> Prima facie, I suspect that you are getting lot of empty clusters. 
>> This might be occurring due to the combination of distance measure, 
>> convergence threshold and distances between vectors.
>> Can you try to analyze and change/play around with these parameters?
>>
>> I will try to look into how the Random Cluster Initialization is 
>> working. I will log a jira if I find some issue. However, I think 
>> that there will be no problem in cluster initialization part.
>>
>> On 09-05-2012 03:21, Danfeng Li wrote:
>>> I got the same issue. What I found is that the initial centers have 
>>> many empty ones, the final number of clusters are decided by the 
>>> number of nonempty centers.
>>>
>>> Here are some example of my cases:
>>>
>>> ...
>>> CL-34358205{n=0 c=[] r=[]}
>>> CL-34358207{n=0 c=[] r=[]}
>>> CL-34358209{n=0 c=[] r=[]}
>>> CL-34358213{n=0 c=[0:1.000] r=[]}
>>> CL-34358215{n=0 c=[] r=[]}
>>> CL-34358216{n=0 c=[] r=[]}
>>> CL-34358217{n=0 c=[] r=[]}
>>> CL-34358220{n=0 c=[] r=[]}
>>> CL-34358221{n=0 c=[] r=[]}
>>> CL-34358222{n=0 c=[] r=[]}
>>> CL-34358223{n=0 c=[] r=[]}
>>> CL-34358224{n=0 c=[] r=[]}
>>> CL-34358227{n=0 c=[0:1.000] r=[]}
>>> CL-34358228{n=0 c=[] r=[]}
>>> CL-34358229{n=0 c=[] r=[]}
>>> ...
>>>
>>> Is it the case there is a bug in initialization?
>>>
>>> Thanks.
>>> Dan
>>>
>>> -----Original Message-----
>>> From: Pat Ferrel [mailto:pat@occamsmachete.com]
>>> Sent: Tuesday, May 08, 2012 9:13 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: kmeans not returning k clusters
>>>
>>> Here is a sample data set. In this case I asked for 30 and got 28 
>>> but in other cases the discrepancy has been greater like ask for 200 
>>> and get 38 but that was for a much larger data set.
>>>
>>> Running on my mac laptop in a single node pseudo cluster hadoop 
>>> 0.20.205, mahout 0.6
>>>
>>> command line:
>>>
>>> mahout kmeans \
>>>       -i b2/bixo-vectors/tfidf-vectors/ \
>>>       -c b2/bixo-kmeans-centroids \
>>>       -cl \
>>>       -o b2/bixo-kmeans-clusters \
>>>       -k 30 \
>>>       -ow \
>>>       -cd 0.01 \
>>>       -x 20 \
>>>       -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>>>
>>> Find the data here:
>>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740

>>>
>>>
>>> BTW when I run rowsimilarity asking for 20 similar docs I get a max of
>>> 20 but sometimes many less. Shouldn't this always return the 
>>> requested number? I'll post this question again to the the attention 
>>> of the right person.
>>>
>>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>>> I looked at the 0.6 version's code but was not able to find any 
>>>> reason.
>>>> If possible, can you share the data you are trying to cluster along
>>>> with the execution parameters?
>>>>
>>>> You can also open a Jira for this and provide the info there.
>>>>
>>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>>> 0.6
>>>>>
>>>>> I take it this is not expected behavior? I could be doing something
>>>>> stupid. I only look in the "final" directory. Looking in the others
>>>>> with clusterdump shows the same number of clusters and I assumed they
>>>>> were iterations.
>>>>>
>>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>>>
>>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>>> What would cause kmeans to not return k clusters? As I tweak
>>>>>>> parameters I get different numbers of clusters but it's usually
>>>>>>> less than the k I pass in. Since I am not using canopies at present
>>>>>>> I would expect k to always be honored but the quality of the
>>>>>>> clusters would depend on the convergence amount and number of
>>>>>>> iterations allowed. No?
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
>>
>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message