mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Derek O'Callaghan <>
Subject Re: Standard Deviation of a Set of Vectors
Date Wed, 29 Sep 2010 12:02:58 GMT
Hi Jeff,

FYI I checked the problem I was having in CDbwEvaluator with the same 
dataset from the ClusterEvaluator thread, the problem is occurring in 
the std calculation in CDbwEvaluator.computeStd(), in that 
s2.times(s0).minus(s1.times(s1)) generates negative values which then 
produce NaN with the subsequent SquareRootFunction(). This then sets the 
average std to NaN later on in intraClusterDensity(). It's happening for 
the cluster I have with the almost-identical points.

It's the same symptom as the problem last week, where this was happening 
when s0 was 1. Is the solution to ignore these clusters, like the s0 = 1 
clusters? Or to add a small prior std as was done for the similar issue 
in NormalModel.pdf()?



On 28/09/10 20:28, Jeff Eastman wrote:
>  Hi Ted,
> The clustering code computes this value for cluster radius. Currently, 
> it is done with a running sums approach (s^0, s^1, s^2) that computes 
> the std of each vector term using:
> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new 
> SquareRootFunction()).divide(s0);
> For CDbw, they need a scalar, average std value, and this is currently 
> computed by averaging the vector terms:
> double d = std.zSum() / std.size();
> The more I read about it; however, the less confident I am about this 
> approach. The paper itself seems to indicate a covariance approach, 
> but I am lost in their notation. See page 5, just above Definition 1.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message