mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Some basic introductory questions
Date Thu, 17 Sep 2009 12:57:05 GMT

On Sep 17, 2009, at 12:36 AM, Aleksander Stensby wrote:

> Hi all,
> I've been following the development of Mahout for quite a while now  
> and
> figured it was time for me to get my hands dirty:)
> I've gone through the examples and Grant's excellent IBM article  
> (great work
> on that Grant!).


> So, now I'm at the point where I want to figure out where I go next.
> Specifically, I'm a bit fuzzed about common practices when it comes to
> utilizing Mahout in my own applications...
> Case scenario:
> I have my own project, add the dependencies to Mahout (through  
> maven), and
> make my own little kMeans test class.
> I guess my question is a bit stupid, but how would you go about  
> using Mahout
> out of the box?
> Ideally (or maybe not?), I figured that I could just take care of  
> providing
> the Vectors -> push it into mahout and run the kMeans clustering...
> But when I started looking at the kMeans clustering example, I  
> notice that
> there is actually a lot of implementation in the example itself...  
> Is it
> really necessary for me to implement all of those methods in every  
> project
> where I want to do kMeans? Can't they be reused? The methods I talk  
> about
> are for instance:
>  static List<Canopy> populateCanopies(DistanceMeasure measure,  
> List<Vector>
> points, double t1, double t2)

Yeah, this one is a bit weird here.

>  private static void referenceKmeans(List<Vector> points,
> List<List<Cluster>> clusters, DistanceMeasure measure, int maxIter)

I think that is for testing purposes, but don't have the code up at  
the mo'.

>  private static boolean iterateReference(List<Vector> points,  
> List<Cluster>
> clusters, DistanceMeasure measure)
> In my narrow minded head I would think that input would be the  
> List<Vector>
> and that the output would be List<List<Cluster> of some general kMeans
> method that did all the internals for me... Or am I missing  
> something? Or do
> I have to use the KMeansDriver.runJob and read input from serialized  
> vectors
> files?

I think the piece that is missing is these algs. are designed to scale  
and use Hadoop.  Imagine passing around 5+ million dense vectors of  
with large cardinality.

View raw message