mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fernando O." <fot...@gmail.com>
Subject Re: Clustering Question (from a newbie)
Date Tue, 22 Nov 2011 11:43:04 GMT
In ClusterIn I had #Categories clusters with initial centroid some
arbitrary vector (I was using the first #Categories vectors that got).

I realized that since I had percentages I could create arbitrary centroids
giving 0.5 value on the corresponding category and 0 on the others.

Turns out that it work really good :D I still have to take a better look
but it looks correct.

Now I'm wondering if there is any paper that supports my assumption

On Tue, Nov 22, 2011 at 7:50 AM, Paritosh Ranjan <pranjan@xebia.com> wrote:

> public static void run(Configuration conf,
>                         Path input,
>                         Path clustersIn,
>                         Path output,...
>
> The second parameter is clustersIn. What are you providing there?
>
> I propose that you first use CanopyClustering to find the appropriate
> number of clusters present. And then give them as the input in clustersIn.
> You might be giving the wrong clustersIn which can create problems.
>
> Paritosh
>
>
> On 22-11-2011 16:12, Fernando O. wrote:
>
>> Hi all,
>>     Disclaimer: I'm a total newbie in datamining / clustering / AI / and
>> all the areas around.My knowledge of clustering is basically what I learn
>> in my cs regular courses but never did research/work with this before.
>>
>> Any reading recomendation would be much appreciated :D
>>
>> I'm trying to understand a large set of data: I have a set of Geographical
>> regions, and for each region I have N characteristics or categories, let's
>> say the measure that I have is something like an indicator of the
>> importance of that characteristic in that region.
>>
>> So I have a table somthing like this
>>        C1      C2       C3
>> R1   80%   20%      0%
>> R2   75%   25%      0%
>> R3   50%   20%     30%
>>
>>  From what I read Kmeans works pretty well for most cases, so I choosed to
>> use that clustering technique.
>> Then I used the Tanimoto Distance because I wanted to measure the
>> correlation between categories.
>>
>> Right now I have a small set: 148 Regions and 13 Categories. From those
>> 148
>> Regions only one has more than 1% in Cn, and it has in fact 36%.
>>
>> So I would expect that if I set the number of clusters to something
>> relatively large (15 or 20) I would get a cluster with only that region
>> having Cn=36%
>>
>> My problem is that I couldn't make it happen so I'm not sure why this is
>> happening. In fact I have some empty clusters.
>> R158,30%1,10%0,00%5,66%5,55%2,**24%1,42%3,20%1,12%14,75%6,23%**
>> 0,25%0,01%0,16%R2
>> 37,08%1,95%0,00%26,27%4,86%0,**11%0,00%0,00%0,76%7,78%18,16%**
>> 0,00%0,00%0,00%R3
>> 48,86%3,03%6,14%5,98%7,91%1,**85%1,69%3,55%0,43%15,63%4,83%**
>> 0,09%0,00%0,00%*R4*
>> *8,86%**0,59%**6,60%**2,46%****2,06%**1,26%**0,26%**1,71%**0,**
>> 47%**6,11%**7,43%
>> **0,03%**61,96%**0,21%*R551,**56%2,55%0,00%16,08%7,29%0,49%**
>> 3,31%1,22%0,47%
>> 13,49%3,53%0,01%0,00%0,00%**R640,15%6,26%0,00%8,07%5,25%0,**
>> 20%0,45%13,29%1,28%
>> 12,85%11,64%0,00%0,00%0,55%
>>
>>
>> Running Kmeans like this:
>> KMeansDriver.run(conf, new Path("mahoutTest/regions"), new Path(
>> "testdata/clusters"), new Path("output"),
>> new TanimotoDistanceMeasure(), 0.001, 1000, true, false);
>>
>> The vectors for each Region are in 1/100 (that 8.86 is 0.0886)
>>
>> Any Idea of what I might be doing wrong ? (please don't say everything!
>> :D )
>>
>> Thanks a lot!
>>
>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1411 / Virus Database: 2092/4030 - Release Date: 11/21/11
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message