# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From "Fernando O." <fot...@gmail.com>
Subject Re: Clustering Question (from a newbie)
Date Wed, 23 Nov 2011 08:44:20 GMT
Hi Ted!
Thanks a lot from your answer. At first I used the original counts I was
expecting that the resulting clusters would have some logic. I realized
that since most of the distance measures I was experimenting do something
like this: for 2 vectors v and e => some_calculationOn(v_i,e_i) .

After looking at my results I went back to think about the problem and I
realized that if I want to look at category weight then I would need to
express the weight of each category in each row.

I'll look into Kullback-Leibler and thanks a lot for noticing the \delta I
do need it in fact!

Cheers,
Fernando

On Tue, Nov 22, 2011 at 9:11 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I would recommend that you work with the original counts instead of
> percentages.  That allows you to use statistical similarity measures based
> on the multinomial distribution.  The important thing that the counts
> provide over percentages is an understanding of how certain the
> distribution really is.
>
> If you move forward with using the percentages, I would consider using
> something like Kuhlback-Leibler divergence as a measure of dissimilarity.
>  You would need to smooth the probabilities when you derive them from the
> counts.  The simplest method for this is to introduce a simple prior into
> your estimates.  Then, if the count for each category i is k_i, you would
> estimate the percentage p_i as
>
>    p_i = (k_i + \delta) / \sum_j (k_j + \delta)
>
> This prevents you from ever estimating either 0 or 1 for these percentages
> and thus helps avoid log 0.  It also will tend to give you better results
> in a variety of ways.
>
> On Tue, Nov 22, 2011 at 1:46 PM, Fernando O. <fotero@gmail.com> wrote:
>
> > It's 148 not b/c I'm doing initial tests :D
> >
> > Yes, values add up to 1.
> >
> > For this example percentages are precalculated basically I get a total
> > number for each category and then convert it to percentages.
> >
> >
> > On Tue, Nov 22, 2011 at 6:10 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > Do the category values add up to 1 for every row?
> > >
> > > Where do these percentages come from?
> > >
> > > At 148 rows, I would use R instead of Mahout.
> > >
> > > On Tue, Nov 22, 2011 at 2:42 AM, Fernando O. <fotero@gmail.com> wrote:
> > >
> > > > So I have a table somthing like this
> > > >       C1      C2       C3
> > > > R1   80%   20%      0%
> > > > R2   75%   25%      0%
> > > > R3   50%   20%     30%
> > > >
> > > > From what I read Kmeans works pretty well for most cases, so I
> choosed
> > to
> > > > use that clustering technique.
> > > > Then I used the Tanimoto Distance because I wanted to measure the
> > > > correlation between categories.
> > > >
> > > > Right now I have a small set: 148 Regions and 13 Categories. From
> those
> > > 148
> > > > Regions only one has more than 1% in Cn, and it has in fact 36%.
> > > >
> > >
> >
>


Mime
• Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message