mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Incorrect calculation of pdf
Date Mon, 27 Jun 2011 18:03:54 GMT
Actually, pdf() should always be a pdf(), not a logPdf().  Many algorithms
want one or the other.  Some don't much care because log is monotonic.  But
we should do what the name implies.

On Mon, Jun 27, 2011 at 10:15 AM, Jeff Eastman <jeastman@narus.com> wrote:

> A better approach would be to create a new Model and ModelDistribution that
> uses log arithmetic of your choosing. The initial models are very simple
> minded and are likely not adequate for real applications.
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Monday, June 27, 2011 7:51 AM
> To: user@mahout.apache.org
> Subject: Re: Incorrect calculation of pdf
>
> There should not be a change to an existing method.
>
> It would be find to add another method, perhaps called logPdf, that does
> what you suggest.  This loss of precision is common with the normal
> distribution in high dimensions.
>
> On Mon, Jun 27, 2011 at 1:49 AM, Vasil Vasilev <vavasilev@gmail.com>
> wrote:
>
> > Hi,
> >
> > Recently I wanted to use Dirichlet clustering algorithm to cluster
> vectors
> > directly taken out of vectorized text, whose dimensionality was around
> > 50000. In this situation the algorithm fails to calculate the pdf of a
> > vector corresponding to cluster center due to problems with numerical
> > precision during multiplication.
> >
> > In this regard, what do you think of modifying the GaussianCluster.pdf()
> > method in such way that it works with logarithmic probabilities?
> >
> > Regards, Vasil
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message