mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Question Regarding Entropy calculation in Mahout
Date Fri, 23 May 2014 18:31:20 GMT
Yash,

I am not sure how your suggestion will work.

The problem is clustering algorithms tend to make hard assignments.  Thus,
if you try to compute entropy relative to some reference probability
distribution (aka perplexity [1]) then a reference clustering will provide
1 or 0 as the probability.  Any item that gets classified into a different
cluster will cause the Entropy to include a term - 1 log 0 which is
infinite.

One way to deal with this is to assign probability 1-\epsilon to the
cluster an item is in and \epsilon/(k-1) for all the other clusters.  You
then have issues finding a good value of \epsilon which seem to me to be
out of scope for the original question.

Computing entropy relative to the fraction of documents in each cluster is
easier to compute, but much harder to understand.  Computing mutual
information (not entropy) on the confusion matrix between two clusterings
can also be done, but that also seems beyond the original question.

As such, I think that the burden is on the original questioner to describe
the problem more accurately.



On Fri, May 23, 2014 at 11:21 AM, Yash Sharma <yash360@gmail.com> wrote:

> Hi Darshan,
> What i understand from your problem is that:
> - You have clustered few documents
> - You want to verify the accuracy of ur clustering , and you want to use
> entropy for that
> - You are not sure what should be the input for entropy calculation.
>
> Possible solution:
> The entropy would expect a String[] to calculate the information contained
> in the data/sequence.
> One simplest way is to keep all the documents labelled with categories.
> - Cluster the docs as you usually do.
> - For entropy calculation create a String[] for every cluster. Each array
> containing all the labels of the docs in the cluster.
> cluster1 = {"sports", "tech", "tech", "tech", "book", ..}
> cluster2 = {"sports", "drama", "sports", "sports"...}
> etc
>
> - Calculate the entropy of each cluster.
> Entropy would measure the degree of randomness of a system. High entropy
> means there is high degree of randomness in a system.
> Lower Entropy are desirable for validation of accuracy of your clustering
> technique.
>
> P.S. You can use Entropy.java class for your validation purpose but
> its deprecated now.
>
> Having Said that - Kindly be patient while asking questions and provide
> more info on what work you have done so far with your findings. It would
> enable all of us to answer quickly & correctly :)
>
> Hope it was helpful. Other Approaches are welcome..!!
>
> Peace,
> Yash
>
>
> On Fri, May 23, 2014 at 10:55 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > I am sorry, but I don't understand your questions or needs sufficiently
> to
> > answer.
> >
> >
> >
> >
> > On Wed, Apr 23, 2014 at 12:21 PM, Darshan Sonagara <
> > darshan.sonagara@gmail.com> wrote:
> >
> > > sir please reply me as soon as possible
> > > thanks in advance.
> > >
> > >
> > > On Tue, Apr 22, 2014 at 11:50 PM, Darshan Sonagara <
> > > darshan.sonagara@gmail.com> wrote:
> > >
> > > > waiting for the replay sir .
> > > >
> > > >
> > > > On Tue, Apr 22, 2014 at 7:13 PM, Darshan Sonagara <
> > > > darshan.sonagara@gmail.com> wrote:
> > > >
> > > >> Thnks for the Replay sir,
> > > >>
> > > >> actually i am doing clustering for gathering similar king of
> document
> > in
> > > >> same cluster as much as possible.
> > > >> i can see from output file by cluster dump by observing top term.
> > > >> i also figure out that by varying Distance Measure Technique. it
> > > differs.
> > > >> but i want some mathematical prof that it is better then other
> > > technique.
> > > >> so for that i need to calculate Entropy and pureness of cluster.
> > > >> but i am not able to find any command in mahout which can give me
> > > entropy
> > > >> as a result.
> > > >> i found Entropy.java under mahout common math statistic package.
> but i
> > > >> don't what should i give it as input so that i can find entropy or
> > other
> > > >> parameter. so i can find how much cluster is good or bed.
> > > >>
> > > >>
> > > >>
> > > >> On Tue, Apr 22, 2014 at 7:01 PM, Ted Dunning <ted.dunning@gmail.com
> > > >wrote:
> > > >>
> > > >>> On Tue, Apr 22, 2014 at 12:11 AM, Darshan Sonagara <
> > > >>> darshan.sonagara@gmail.com> wrote:
> > > >>>
> > > >>> > But the problem is that i want check that whether my clustering
> is
> > > >>> good or
> > > >>> > bad. so for that i need to calculate Entropy Value. I am
not
> having
> > > any
> > > >>> > idea how to calculate entropy in mahout or by other technique.
> > > >>> > by finding entropy i can have good conclusion.
> > > >>> > so please can anyone help me with these.
> > > >>> >
> > > >>>
> > > >>> Actually, the way to tell whether your clustering is good is to
see
> > if
> > > it
> > > >>> works for its intended use.
> > > >>>
> > > >>> What do you want to use clustering for?
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >>
> > > >> *Regards From:*
> > > >>
> > > >> *Darshan  Sonagara*
> > > >> *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > > >>
> > > >> *Vice-Chairperson | **GCET IEEE SB.*
> > > >>
> > > >> (: +*91* 9408002452
> > > >>
> > > >>
> > > >>
> > > >>  : Darshan Sonagara<
> > > http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> > > >>   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > *Regards From:*
> > > >
> > > > *Darshan  Sonagara*
> > > > *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > > >
> > > > *Vice-Chairperson | **GCET IEEE SB.*
> > > >
> > > > (: +*91* 9408002452
> > > >
> > > >
> > > >
> > > >  : Darshan Sonagara<
> > > http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> > > >   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > > >
> > > >
> > >
> > >
> > > --
> > >
> > > *Regards From:*
> > >
> > > *Darshan  Sonagara*
> > > *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > >
> > > *Vice-Chairperson | **GCET IEEE SB.*
> > >
> > > (: +*91* 9408002452
> > >
> > >
> > >
> > >  : Darshan Sonagara<
> > > http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> > >   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message