mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Meng Qinghan <qinghan.m...@gmail.com>
Subject Re: Out-of-core random forest implementation
Date Thu, 07 Mar 2013 02:31:13 GMT
You can read the following paper which I think it is useful.

http://link.springer.com/chapter/10.1007%2F978-3-642-30217-6_12?LI=true

2013/3/6 Andy Twigg <andy.twigg@gmail.com>

> Hi guys,
>
> I've created a JIRA issue for this now -
> https://issues.apache.org/jira/browse/MAHOUT-1153
>
> Right now we're working on my fork but will make an early patch once
> we can. Hopefully that will provide a basis for others who are
> interested to add to it.
>
> Andy
>
>
> On 22 February 2013 08:49, Andy Twigg <andy.twigg@gmail.com> wrote:
> > Hi Marty,
> >
> > I would suggest doing each tree independently, one after the other.
> > Have each node select a random sample w/replacement of its data, and
> > let the master choose the random subset of features to split with. I'm
> > sure there are more efficient ways, but we can think of them later.
> >
> > -Andy
> >
> >
> > On 22 February 2013 01:47, Marty Kube <marty.kube.apache@gmail.com>
> wrote:
> >> On the algorithm choice, Parallel Boosted Regression Trees looks
> interesting
> >> as well.
> >>
> >>
> http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf
> >>
> >> After reading that paper I had a question about the Streaming Parallel
> >> Decision Trees.  The SPDT paper is for building one decision tree.  I
> was
> >> thinking about how to extend that to a decision forest.  I could see
> how one
> >> can build many trees in the master and use a random choice of the
> splitting
> >> variables (as opposed to checking all variables and using the best
> >> variable). However, it seems like the idea of sampling the dataset with
> >> replacement for each tree gets lost.  Is that okay?
> >>
> >>
> >> On 02/21/2013 03:19 AM, Ted Dunning wrote:
> >>>
> >>> For quantile estimation, consider also streamlib at
> >>> https://github.com/clearspring/stream-lib
> >>>
> >>> The bigmlcom implementation looks more directly applicable, actually.
> >>>
> >>> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <andy.twigg@gmail.com
> >>> <mailto:andy.twigg@gmail.com>> wrote:
> >>>
> >>>     Even better, there is already a good implementation of the
> histograms:
> >>>     https://github.com/bigmlcom/histogram
> >>>
> >>>     -Andy
> >>>
> >>>
> >>>     On 20 February 2013 22:50, Marty Kube <marty.kube.apache@gmail.com
> >>>     <mailto:marty.kube.apache@gmail.com>> wrote:
> >>>     > That's a winner...
> >>>     > Out of all of the algorithms I've looked at the Ben-Haim/SPDT
> >>>     looks most
> >>>     > likely.  In batch mode it uses one pass over the data set, it
> >>>     can be used in
> >>>     > a streaming mode, and has constant space and time requirements.
> >>>      That seems
> >>>     > like the kind of scalable algorithm we're after.
> >>>     > I'm in!
> >>>     >
> >>>     >
> >>>     > On 02/20/2013 10:09 AM, Andy Twigg wrote:
> >>>     >>
> >>>     >> Alternatively, the algorithm described in [1] is more
> >>>     straightforward,
> >>>     >> efficient, hadoop-compatible (using only mappers communicating
> to a
> >>>     >> master) and satisfies all our requirements so far. I would
like
> to
> >>>     >> take a pass at implementing that, if anyone else is interested?
> >>>     >>
> >>>     >> [1]
> >>>
> http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
> >>>     >>
> >>>     >>
> >>>     >> On 20 February 2013 14:27, Andy Twigg <andy.twigg@gmail.com
> >>>     <mailto:andy.twigg@gmail.com>> wrote:
> >>>     >>>
> >>>     >>> Why don't we start from
> >>>     >>>
> >>>     >>> https://github.com/ashenfad/hadooptree ?
> >>>     >>>
> >>>     >>> On 20 February 2013 13:25, Marty Kube
> >>>     <marty.kube.apache@gmail.com <mailto:marty.kube.apache@gmail.com>>
> >>>
> >>>     >>> wrote:
> >>>     >>>>
> >>>     >>>> Hi Lorenz,
> >>>     >>>>
> >>>     >>>> Very interesting, that's what I was asking for when
I
> >>>     mentioned non-MR
> >>>     >>>> implementations :-)
> >>>     >>>>
> >>>     >>>> I have not looked at spark before, interesting that
it uses
> >>>     Mesos for
> >>>     >>>> clustering.   I'll check it out.
> >>>     >>>>
> >>>     >>>>
> >>>     >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
> >>>     >>>>>
> >>>     >>>>> Hi Marty,
> >>>     >>>>>
> >>>     >>>>> i am currently working on a PLANET-like implementation
on
> >>>     top of spark:
> >>>     >>>>> http://spark-project.org
> >>>     >>>>>
> >>>     >>>>> I think this framework is a nice fit for the problem.
> >>>     >>>>> If the input data fits into the "total cluster
memory" you
> >>>     benefit from
> >>>     >>>>> the caching of the RDD's.
> >>>     >>>>>
> >>>     >>>>> regards,
> >>>     >>>>>
> >>>     >>>>> lorenz
> >>>     >>>>>
> >>>     >>>>>
> >>>     >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
> >>>     <marty.kube.apache@gmail.com <mailto:marty.kube.apache@gmail.com>>
> >>>
> >>>     >>>>> wrote:
> >>>     >>>>>
> >>>     >>>>>> You had mentioned other "resource management"
platforms
> >>>     like Giraph or
> >>>     >>>>>> Mesos.  I haven't looked at those yet.  I guess
I was think
> >>>     of other
> >>>     >>>>>> parallelization frameworks.
> >>>     >>>>>>
> >>>     >>>>>> It's interesting that the planet folks thought
it was really
> >>>     >>>>>> worthwhile
> >>>     >>>>>> working on top of map reduce for all of the
resource
> >>>     management that
> >>>     >>>>>> is
> >>>     >>>>>> built in.
> >>>     >>>>>>
> >>>     >>>>>>
> >>>     >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
> >>>     >>>>>>>
> >>>     >>>>>>> If non-MR means map-only job with communicating
mappers
> >>>     and a state
> >>>     >>>>>>> store,
> >>>     >>>>>>> I am down with that.
> >>>     >>>>>>>
> >>>     >>>>>>> What did you mean?
> >>>     >>>>>>>
> >>>     >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty
Kube <
> >>>     >>>>>>> martykube@beavercreekconsulting.com
> >>>     <mailto:martykube@beavercreekconsulting.com>> wrote:
> >>>     >>>>>>>
> >>>     >>>>>>>> Right now I'd lean towards the planet
model, or maybe a
> >>>     non-MR
> >>>     >>>>>>>> implementation.  Anyone have a good
idea for a non-MR
> >>>     solution?
> >>>     >>>>>>>>
> >>>     >>>
> >>>     >>>
> >>>     >>> --
> >>>     >>> Dr Andy Twigg
> >>>     >>> Junior Research Fellow, St Johns College, Oxford
> >>>     >>> Room 351, Department of Computer Science
> >>>     >>> http://www.cs.ox.ac.uk/people/andy.twigg/
> >>>     >>> andy.twigg@cs.ox.ac.uk <mailto:andy.twigg@cs.ox.ac.uk>
|
> >>>     +447799647538 <tel:%2B447799647538>
> >>>
> >>>     >>
> >>>     >>
> >>>     >>
> >>>     >> --
> >>>     >> Dr Andy Twigg
> >>>     >> Junior Research Fellow, St Johns College, Oxford
> >>>     >> Room 351, Department of Computer Science
> >>>     >> http://www.cs.ox.ac.uk/people/andy.twigg/
> >>>     >> andy.twigg@cs.ox.ac.uk <mailto:andy.twigg@cs.ox.ac.uk>
|
> >>>     +447799647538 <tel:%2B447799647538>
> >>>
> >>>     >
> >>>     >
> >>>
> >>>
> >>>
> >>>     --
> >>>     Dr Andy Twigg
> >>>     Junior Research Fellow, St Johns College, Oxford
> >>>     Room 351, Department of Computer Science
> >>>     http://www.cs.ox.ac.uk/people/andy.twigg/
> >>>     andy.twigg@cs.ox.ac.uk <mailto:andy.twigg@cs.ox.ac.uk> |
> >>>     +447799647538 <tel:%2B447799647538>
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > Dr Andy Twigg
> > Junior Research Fellow, St Johns College, Oxford
> > Room 351, Department of Computer Science
> > http://www.cs.ox.ac.uk/people/andy.twigg/
> > andy.twigg@cs.ox.ac.uk | +447799647538
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message