Hi Ted!
Thanks a lot from your answer. At first I used the original counts I was
expecting that the resulting clusters would have some logic. I realized
that since most of the distance measures I was experimenting do something
like this: for 2 vectors v and e => some_calculationOn(v_i,e_i) .
After looking at my results I went back to think about the problem and I
realized that if I want to look at category weight then I would need to
express the weight of each category in each row.
I'll look into KullbackLeibler and thanks a lot for noticing the \delta I
do need it in fact!
Cheers,
Fernando
On Tue, Nov 22, 2011 at 9:11 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> I would recommend that you work with the original counts instead of
> percentages. That allows you to use statistical similarity measures based
> on the multinomial distribution. The important thing that the counts
> provide over percentages is an understanding of how certain the
> distribution really is.
>
> If you move forward with using the percentages, I would consider using
> something like KuhlbackLeibler divergence as a measure of dissimilarity.
> You would need to smooth the probabilities when you derive them from the
> counts. The simplest method for this is to introduce a simple prior into
> your estimates. Then, if the count for each category i is k_i, you would
> estimate the percentage p_i as
>
> p_i = (k_i + \delta) / \sum_j (k_j + \delta)
>
> This prevents you from ever estimating either 0 or 1 for these percentages
> and thus helps avoid log 0. It also will tend to give you better results
> in a variety of ways.
>
> On Tue, Nov 22, 2011 at 1:46 PM, Fernando O. <fotero@gmail.com> wrote:
>
> > It's 148 not b/c I'm doing initial tests :D
> >
> > Yes, values add up to 1.
> >
> > For this example percentages are precalculated basically I get a total
> > number for each category and then convert it to percentages.
> >
> >
> > On Tue, Nov 22, 2011 at 6:10 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > Do the category values add up to 1 for every row?
> > >
> > > Where do these percentages come from?
> > >
> > > At 148 rows, I would use R instead of Mahout.
> > >
> > > On Tue, Nov 22, 2011 at 2:42 AM, Fernando O. <fotero@gmail.com> wrote:
> > >
> > > > So I have a table somthing like this
> > > > C1 C2 C3
> > > > R1 80% 20% 0%
> > > > R2 75% 25% 0%
> > > > R3 50% 20% 30%
> > > >
> > > > From what I read Kmeans works pretty well for most cases, so I
> choosed
> > to
> > > > use that clustering technique.
> > > > Then I used the Tanimoto Distance because I wanted to measure the
> > > > correlation between categories.
> > > >
> > > > Right now I have a small set: 148 Regions and 13 Categories. From
> those
> > > 148
> > > > Regions only one has more than 1% in Cn, and it has in fact 36%.
> > > >
> > >
> >
>
