mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors
Date Wed, 17 Jun 2009 13:22:11 GMT

On Jun 17, 2009, at 9:05 AM, Benson Margulies wrote:

> All I know is what I learned from reading the paper. However, I  
> continue to
> think, from reading the paper, that you may be trying to make Canopy  
> do
> something it was not intended to do.
>
> As I read the paper, the idea here is to get a rough partitioning  
> that is
> used to optimize various downstream algorithms, not to tune for a  
> precise
> partitioning. The number of canopies doesn't need, as I read it, to be
> particularly close to the number of eventual partitions to be useful.
>
> Thus the extended discussion of how to start up and run various other
> algorithms, (e.g. k-means).

Makes sense.

>
> Now, still, you need to get some useful number of partitions. The  
> paper has
> a classic toss-off line, 'we used cross-validation,' without any  
> details
> about exactly what the authors did. Presumably, that means that the  
> author
> ran many possible values and hand-examined the results. The paper  
> reports no
> general results about how sensitive the T values are to particular  
> input
> data sets. A pessimist would fear that, for any new input, you're  
> going to
> need to go through a lengthy process to find good values for T1 and  
> T2.
>
> This leads me to wonder, ignorantly, why this project is so focused on
> Canopy. The paper describes it as a tool for speeding up various other
> things. Since you're hadooping all those other things, how much does  
> it
> help?

I don't think anyone is solely focused on it, but it is something that  
we have available in our arsenal of clustering tools, therefore it  
warrants documentation and understanding of when and how to use.   
Personally, it's just something I could easily run to work on  
MAHOUT-121.

At any rate, this kind of write up is exactly the advice that we need  
to be able to give people.  Care to add to http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData

  ?


>
> Anyway, I expect that my ignorance is on comprehensive display here.
>

Funny, I feel like my ignorance is the one on display, but that is  
something I got over a long time ago in open source.  Which is why I  
just come out and ask the questions!  One of my goals for Mahout is to  
make it a place where people can come and learn about Machine Learning  
and get practical advice and not be afraid to ask basic questions.   
Machine learning is so shrouded in mystery it almost seems like a Dark  
Art.  I'm thankful every day on this project that smarter people than  
me show up and answer questions.  So, please, keep 'em coming!

-Grant

Mime
View raw message