mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: kMeans Help
Date Sat, 27 Jun 2009 02:59:50 GMT
Success!  Woo hoo.

On Jun 26, 2009, at 10:42 PM, Grant Ingersoll wrote:

> So, the problem I'm having lies in the RandomSeedGenerator in that  
> it is writing out a Cluster, which calls Cluster.write() and does:
> AbstractVector.writeVector(out, computeCentroid());
>
> Now computeCentroid() does:
> if (numPoints == 0)
>      return pointTotal;
>    else if (centroid == null) {
>      // lazy compute new centroid
>      centroid = pointTotal.divide(numPoints);
>      Vector stds = pointSquaredTotal.times(numPoints).minus(
>          pointTotal.times(pointTotal)).assign(new  
> SquareRootFunction())
>          .divide(numPoints);
>      std = stds.zSum() / 2;
>    }
>    return centroid;
>
> In the case of the RandomSeedGenerator, numPoints is always == 0  
> because the Cluster doesn't have any points added to it.   
> Furthermore, pointTotal is an empty Vector of the same size as the  
> center, due to the Cluster constructor:
>    super();
>    this.id = nextClusterId++;
>    this.center = center;
>    this.numPoints = 0;
>    this.pointTotal = center.like();
>    this.pointSquaredTotal = center.like();
>
> The semantics of constructing a Cluster are odd to me.  Do I always  
> have to immediately add a point to the Cluster in order for it to be  
> "real", despite the fact that I added a Center?  Isn't adding a  
> Center effectively giving the Cluster one point?
>
>
> On Jun 26, 2009, at 8:45 PM, Grant Ingersoll wrote:
>
>> Still no dice.
>>
>> On Jun 26, 2009, at 7:59 PM, Grant Ingersoll wrote:
>>
>>> We need to make that handled separately then from the various  
>>> jobs.  That was one of the things that was different about the  
>>> KMeansJob call.
>>>
>>> On Jun 26, 2009, at 7:45 PM, Jeff Eastman wrote:
>>>
>>>> Found the call in the syntheticcontrol/kmeans.Job had true for  
>>>> the overwrite output flag. Don't think that was your problem, but  
>>>> something similar must be at work.
>>>>
>>>>
>>>>
>>>> Jeff Eastman wrote:
>>>>> Running the latest trunk, I get a file not found exception  
>>>>> running synthetic control on the $output/data file. Looks like  
>>>>> output got deleted somewhere but have not discovered where yet.  
>>>>> Perhaps Canopy is broken or KMeans is purging output?
>>>>>
>>>>>
>>>>> Grant Ingersoll wrote:
>>>>>> I'm running trunk.  Using the data at http://people.apache.org/wikipedia/n2.tar.gz

>>>>>>  (a dump of 2302 documents from a Lucene index of Wikipedia.   
>>>>>> The chunks file in that same directory contains the original  
>>>>>> files).  Vectors are normalized using L2.
>>>>>>
>>>>>> When I run K-Means on it via:  
>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver --input /Users/

>>>>>> grantingersoll/projects/lucene/solr/wikipedia/devWorks/n2/part- 
>>>>>> full.txt --clusters /Users/grantingersoll/projects/lucene/solr/ 
>>>>>> wikipedia/devWorks/n2/clusters --k 10 --output /Users/ 
>>>>>> grantingersoll/projects/lucene/solr/wikipedia/devWorks/n2/k- 
>>>>>> output --distance org.apache.mahout.utils.CosineDistanceMeasure
>>>>>>
>>>>>> I get the two directories seen in n2-output.  The clusters-0  
>>>>>> and clusters-1 files both contain a single vector which is all 0.
>>>>>>
>>>>>> I've also tried SquaredEuclidean, but to no avail.
>>>>>>
>>>>>> Any insight into what I'm doing wrong would be appreciated.
>>>>>>
>>>>>> Thanks,
>>>>>> Grant
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message