mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary Bradski" <garybrad...@gmail.com>
Subject Re: MapReduce, machine learning, and introductions
Date Fri, 04 Apr 2008 17:19:56 GMT
Random forests, though not developed that way, are an example of Kleinberg's
Stochastic Discrimination, which builds optimal classifiers based on the
Law  of Large Numbers. Such classifiers are built out of large collections
of simple classifiers, but are distinct from boosting. For this, the
classifiers have to meet three conditions:

   1. Encouragement: The simple classifiers must weakly encourage one
   class from another
   2. Generalization: This is data dependent and thus why you can in fact
   build an optimal classifier FOR THAT DATA.  Your decision functions which
   work on the test set must do so on the training set.
   3. Fairness: You cannot have any statistical bias.  Thus, square
   classifiers cannot be used to classify squares for example because the edges
   and corners of the squares would have different statistics.

Like most things in life, you cannot actually meet all the conditions for
such a classifier, usually you get the first two and fudge the third either
by post processing as Kleinberg does which breaks the parallelization, or by
random functions that tend to diffuse the bias as in Random Forest.


   - Kleinberg's site is worth a look http://kappa.math.buffalo.edu/  But
   he can be rather obscure and others have implemented his theory such as
   *http://tinyurl.com/6hvfvs*
   - For Random forests, Leo Breiman's site (RIP)
   http://www.stat.berkeley.edu/users/breiman/    Breiman was the, or one
   of the key inventors of decision trees.
   - but I'd also look at very simple implementations such as done by
   Zisserman:

*Bosch, A. , Zisserman, A. and Munoz, X.*
*Image Classification using Random Forests and Ferns*
Proceedings of the 11th International Conference on Computer Vision, Rio de
Janeiro, Brazil (2007)
Bibtex source<http://www.robots.ox.ac.uk/%7Evgg/publications/html/bosch07a-bibtex.html>|
Abstract<http://www.robots.ox.ac.uk/%7Evgg/publications/html/bosch07a-abstract.html>|
Document:
ps.gz <http://www.robots.ox.ac.uk/%7Evgg/publications/papers/bosch07a.ps.gz>
PDF <http://www.robots.ox.ac.uk/%7Evgg/publications/papers/bosch07a.pdf>

Stochastic Discrimination classifiers have nice properties:

   - The never over train unlike boosting. Because of the Law of Large
   Numbers, they just get better with more data.
   - They are innately parallel/independent
   - It is easy to use them for variable selection via techniques such as
   discussed by Breiman (see his "Black Box" lecture on his site)

When built out of decision trees, they have principled ways of handling
missing data, mixed data types and data at very different scales such as
often occurs with real data, but seldom with computer vision.

By the way, OpenCV has a full implementation of Random Forests that is free,
open and under a BSD license.

Gary

On Thu, Apr 3, 2008 at 5:04 PM, Jeff Eastman <jeff@windwardsolutions.com>
wrote:

>  Hi Gary,
>
>
>
> Thanks for your suggestion on Random Forests. I've cc'd this thread to the
> Mahout dev list just in case you would like to continue it there. We have
> received a lot of interest from students in conjunction with the Google
> Summer of Code project and others looking to contribute to our mission. We
> are not restricted at all to the 10 original NIPS algorithms; they were just
> a natural starting point and a way to "prime the pump". Perhaps some more
> information on your experiences using it on real manufacturing data would
> motivate an implementation.
>
>
>
> Jeff
>
>
>   ------------------------------
>
> *From:* Gary Bradski [mailto:garybradski@gmail.com]
> *Sent:* Thursday, April 03, 2008 4:46 PM
> *To:* Jeff Eastman
> *Cc:* Andrew Y. Ng; Dubey, Pradeep; Jimmy Lin
>
> *Subject:* Re: MapReduce, machine learning, and introductions
>
>
>
> One of the things I'd like to see parallelized is Random forests.  Though
> there is no "best" algorithm for classification, when I ran it on Intel
> manufacturing data sets it was almost always beating boosting, SVM, and
> MART. Zisserman claimed it worked best on keypoint recognition in vision and
> his version was the simplest one I've heard.
>
> This is one of those "brain dead" parallelizations -- just parcel out the
> learning of trees on randomly selected subsets of the data.  In learning,
> each tree randomly selects from a subset of the features at each node.
>
> It has nice techniques for doing feature selection as well.
>
> Gary
>
> On Thu, Apr 3, 2008 at 4:27 PM, Jeff Eastman <jeff@windwardsolutions.com>
> wrote:
>
> Well, it has been a couple of years. Thanks for the response and
> retransmission. Good luck in your current endeavors.
>
>
>
> Jeff
>
>
>   ------------------------------
>
> *From:* Gary Bradski [mailto:garybradski@gmail.com]
> *Sent:* Thursday, April 03, 2008 4:23 PM
> *To:* Andrew Y. Ng; Dubey, Pradeep
> *Cc:* Jeff Eastman; Jimmy Lin
> *Subject:* Re: MapReduce, machine learning, and introductions
>
>
>
> Re: Parallel Machine learning project mahout
> http://lucene.apache.org/mahout
>
> When I was at Intel, I began carving out a parallel Machine learning niche
> since it was something interesting that Intel would also be interested in.
>
> But that was two companies ago for me and I haven't touched it since.  I'm
> now focused on sensor guided manipulation and revamping the computer vision
> library I started, OpenCV.
>
> About all I can do is send the last known working version of the code that
> I had.  I've CC'd Pradeep Dubey, and Intel Fellow with whom I worked on some
> of the parallel machine learning issues, his team also studied that code.  I
> don't know what happened since, but Parallel machine learning might still be
> one of his active areas and maybe theres's some synergy there.
>
> Gary
>
> On Thu, Apr 3, 2008 at 3:38 PM, Andrew Y. Ng <ang@cs.stanford.edu> wrote:
>
> Hi Jeff,
>
> I'd been hearing increasing amounts of buzz on Mahour and am excited
> about it, but unfortunately am no longer working in this space.
> Gary Bradski, CC-ed above, would be a great person to talk to about
> Map-Reduce and machine learning, though!
>
> Andrew
>
>
> On Thu, 3 Apr 2008, Jeff Eastman wrote:
>
> > Hi Andrew,
> >
> > I'm a committer on the new Mahout project. As Jimmy indicated, we are
> > setting out to implement versions of the NIPS paper algorithms on top of
> > Hadoop. So far, we have committed versions of only k-means and canopy
> but
> > have a number of other algorithms in various stages of implementation. I
> > don't have any immediate questions but I live in Los Altos and so it
> would
> > be convenient to visit if you or your colleagues do have questions about
> > Mahout.
> >
> > In any case I thought it would be nice to introduce myself.
> >
> > Jeff
> >
> > http://lucene.apache.org/mahout
> >
> >
> > Jeff Eastman, Ph.D.
> > Windward Solutions Inc.
> > +1.415.298.0023
> > http://windwardsolutions.com
> > http://jeffeastman.blogspot.com
> >
> >
> > > -----Original Message-----
> > > From: Jimmy Lin [mailto:jimmylin@umd.edu]
> > > Sent: Saturday, March 29, 2008 8:37 PM
> > > To: ang@cs.stanford.edu
> > > Cc: Jeff Eastman
> > > Subject: MapReduce, machine learning, and introductions
> > >
> > > Hi Andrew,
> > >
> > > How are things going?  Haven't seen you in a while... hope everything
> > > is going well at Stanford.
> > >
> > > I was recently in the bay area attending the Yahoo Hadoop summit---
> > > I've been using MapReduce in teaching and research recently (stat MT,
> > > IR, etc.), so I was there talking about that.
> > >
> > > Are you aware of the Apache Mahout project?  They are putting together
> > > an open-source MR toolkit for machine-learning-ish things; one of the
> > > things they're working on is implementing the various algorithms in
> > > your NIPS paper.  Jeff Eastman is involved in the project, cc'ed
> > > here.  I thought I'd put the two of you in touch...
> > >
> > > Best,
> > > Jimmy
> >
> >
> >
> >
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message