mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danfeng Li <...@operasolutions.com>
Subject RE: kmeans not returning k clusters
Date Tue, 08 May 2012 21:51:17 GMT
I got the same issue. What I found is that the initial centers have many empty ones, the final
number of clusters are decided by the number of nonempty centers.

Here are some example of my cases:

...
CL-34358205{n=0 c=[] r=[]}
CL-34358207{n=0 c=[] r=[]}
CL-34358209{n=0 c=[] r=[]}
CL-34358213{n=0 c=[0:1.000] r=[]}
CL-34358215{n=0 c=[] r=[]}
CL-34358216{n=0 c=[] r=[]}
CL-34358217{n=0 c=[] r=[]}
CL-34358220{n=0 c=[] r=[]}
CL-34358221{n=0 c=[] r=[]}
CL-34358222{n=0 c=[] r=[]}
CL-34358223{n=0 c=[] r=[]}
CL-34358224{n=0 c=[] r=[]}
CL-34358227{n=0 c=[0:1.000] r=[]}
CL-34358228{n=0 c=[] r=[]}
CL-34358229{n=0 c=[] r=[]}
...

Is it the case there is a bug in initialization?

Thanks.
Dan

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Tuesday, May 08, 2012 9:13 AM
To: user@mahout.apache.org
Subject: Re: kmeans not returning k clusters

Here is a sample data set. In this case I asked for 30 and got 28 but in other cases the discrepancy
has been greater like ask for 200 and get 38 but that was for a much larger data set.

Running on my mac laptop in a single node pseudo cluster hadoop 0.20.205, mahout 0.6

command line:

mahout kmeans \
     -i b2/bixo-vectors/tfidf-vectors/ \
     -c b2/bixo-kmeans-centroids \
     -cl \
     -o b2/bixo-kmeans-clusters \
     -k 30 \
     -ow \
     -cd 0.01 \
     -x 20 \
     -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure

Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740

BTW when I run rowsimilarity asking for 20 similar docs I get a max of
20 but sometimes many less. Shouldn't this always return the requested number? I'll post this
question again to the the attention of the right person.

On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
> I looked at the 0.6 version's code but was not able to find any reason.
> If possible, can you share the data you are trying to cluster along 
> with the execution parameters?
>
> You can also open a Jira for this and provide the info there.
>
> On 07-05-2012 19:45, Pat Ferrel wrote:
>> 0.6
>>
>> I take it this is not expected behavior? I could be doing something 
>> stupid. I only look in the "final" directory. Looking in the others 
>> with clusterdump shows the same number of clusters and I assumed they 
>> were iterations.
>>
>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>
>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>> What would cause kmeans to not return k clusters? As I tweak 
>>>> parameters I get different numbers of clusters but it's usually 
>>>> less than the k I pass in. Since I am not using canopies at present 
>>>> I would expect k to always be honored but the quality of the 
>>>> clusters would depend on the convergence amount and number of 
>>>> iterations allowed. No?
>>>
>>>
>>>
>
>
>

Mime
View raw message