mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marty Kube <marty.kube.apa...@gmail.com>
Subject Re: Out-of-core random forest implementation
Date Fri, 22 Feb 2013 01:47:16 GMT
On the algorithm choice, Parallel Boosted Regression Trees looks 
interesting as well.

http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf

After reading that paper I had a question about the Streaming Parallel 
Decision Trees.  The SPDT paper is for building one decision tree.  I 
was thinking about how to extend that to a decision forest.  I could see 
how one can build many trees in the master and use a random choice of 
the splitting variables (as opposed to checking all variables and using 
the best variable). However, it seems like the idea of sampling the 
dataset with replacement for each tree gets lost.  Is that okay?

On 02/21/2013 03:19 AM, Ted Dunning wrote:
> For quantile estimation, consider also streamlib at 
> https://github.com/clearspring/stream-lib
>
> The bigmlcom implementation looks more directly applicable, actually.
>
> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <andy.twigg@gmail.com 
> <mailto:andy.twigg@gmail.com>> wrote:
>
>     Even better, there is already a good implementation of the histograms:
>     https://github.com/bigmlcom/histogram
>
>     -Andy
>
>
>     On 20 February 2013 22:50, Marty Kube <marty.kube.apache@gmail.com
>     <mailto:marty.kube.apache@gmail.com>> wrote:
>     > That's a winner...
>     > Out of all of the algorithms I've looked at the Ben-Haim/SPDT
>     looks most
>     > likely.  In batch mode it uses one pass over the data set, it
>     can be used in
>     > a streaming mode, and has constant space and time requirements.
>      That seems
>     > like the kind of scalable algorithm we're after.
>     > I'm in!
>     >
>     >
>     > On 02/20/2013 10:09 AM, Andy Twigg wrote:
>     >>
>     >> Alternatively, the algorithm described in [1] is more
>     straightforward,
>     >> efficient, hadoop-compatible (using only mappers communicating to a
>     >> master) and satisfies all our requirements so far. I would like to
>     >> take a pass at implementing that, if anyone else is interested?
>     >>
>     >> [1]
>     http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
>     >>
>     >>
>     >> On 20 February 2013 14:27, Andy Twigg <andy.twigg@gmail.com
>     <mailto:andy.twigg@gmail.com>> wrote:
>     >>>
>     >>> Why don't we start from
>     >>>
>     >>> https://github.com/ashenfad/hadooptree ?
>     >>>
>     >>> On 20 February 2013 13:25, Marty Kube
>     <marty.kube.apache@gmail.com <mailto:marty.kube.apache@gmail.com>>
>     >>> wrote:
>     >>>>
>     >>>> Hi Lorenz,
>     >>>>
>     >>>> Very interesting, that's what I was asking for when I
>     mentioned non-MR
>     >>>> implementations :-)
>     >>>>
>     >>>> I have not looked at spark before, interesting that it uses
>     Mesos for
>     >>>> clustering.   I'll check it out.
>     >>>>
>     >>>>
>     >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>     >>>>>
>     >>>>> Hi Marty,
>     >>>>>
>     >>>>> i am currently working on a PLANET-like implementation on
>     top of spark:
>     >>>>> http://spark-project.org
>     >>>>>
>     >>>>> I think this framework is a nice fit for the problem.
>     >>>>> If the input data fits into the "total cluster memory" you
>     benefit from
>     >>>>> the caching of the RDD's.
>     >>>>>
>     >>>>> regards,
>     >>>>>
>     >>>>> lorenz
>     >>>>>
>     >>>>>
>     >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
>     <marty.kube.apache@gmail.com <mailto:marty.kube.apache@gmail.com>>
>     >>>>> wrote:
>     >>>>>
>     >>>>>> You had mentioned other "resource management" platforms
>     like Giraph or
>     >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think
>     of other
>     >>>>>> parallelization frameworks.
>     >>>>>>
>     >>>>>> It's interesting that the planet folks thought it was really
>     >>>>>> worthwhile
>     >>>>>> working on top of map reduce for all of the resource
>     management that
>     >>>>>> is
>     >>>>>> built in.
>     >>>>>>
>     >>>>>>
>     >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>     >>>>>>>
>     >>>>>>> If non-MR means map-only job with communicating mappers
>     and a state
>     >>>>>>> store,
>     >>>>>>> I am down with that.
>     >>>>>>>
>     >>>>>>> What did you mean?
>     >>>>>>>
>     >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>     >>>>>>> martykube@beavercreekconsulting.com
>     <mailto:martykube@beavercreekconsulting.com>> wrote:
>     >>>>>>>
>     >>>>>>>> Right now I'd lean towards the planet model, or
maybe a
>     non-MR
>     >>>>>>>> implementation.  Anyone have a good idea for a non-MR
>     solution?
>     >>>>>>>>
>     >>>
>     >>>
>     >>> --
>     >>> Dr Andy Twigg
>     >>> Junior Research Fellow, St Johns College, Oxford
>     >>> Room 351, Department of Computer Science
>     >>> http://www.cs.ox.ac.uk/people/andy.twigg/
>     >>> andy.twigg@cs.ox.ac.uk <mailto:andy.twigg@cs.ox.ac.uk> |
>     +447799647538 <tel:%2B447799647538>
>     >>
>     >>
>     >>
>     >> --
>     >> Dr Andy Twigg
>     >> Junior Research Fellow, St Johns College, Oxford
>     >> Room 351, Department of Computer Science
>     >> http://www.cs.ox.ac.uk/people/andy.twigg/
>     >> andy.twigg@cs.ox.ac.uk <mailto:andy.twigg@cs.ox.ac.uk> |
>     +447799647538 <tel:%2B447799647538>
>     >
>     >
>
>
>
>     --
>     Dr Andy Twigg
>     Junior Research Fellow, St Johns College, Oxford
>     Room 351, Department of Computer Science
>     http://www.cs.ox.ac.uk/people/andy.twigg/
>     andy.twigg@cs.ox.ac.uk <mailto:andy.twigg@cs.ox.ac.uk> |
>     +447799647538 <tel:%2B447799647538>
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message