mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Twigg <andy.tw...@gmail.com>
Subject Re: Out-of-core random forest implementation
Date Sat, 26 Jan 2013 12:39:18 GMT
Hi,

I'd like to grow some decision trees efficiently where the dataset is
too large for memory. The decision tree will almost certainly fit in
memory. I'm a bit surprised that Mahout doesn't have an algorithm to
construct decision trees when the data doesn't fit in memory, given
its focus on large scale!

Can anyone recommend an algorithm that does grow random forests 1)
when the data is out-of-core, 2) with multi-pass streaming access to
data, 3) is distributed, [optionally: 4) is incremental ] ? I am happy
to help contribute such an algorithm to mahout, but would prefer it if
one already exists. Does anyone have familiarity with the algorithm
used by bigML.com ?

Andy




On 26 January 2013 00:29, Marty Kube
<martykube@beavercreekconsulting.com> wrote:
> Hey Andy,
>
> What is the use case that is driving your question?
> Are you looking at the training phase - I didn't realise that one needed to
> keep the data in memory.
> I have a use case where even keeping the trees, much less the data, in
> memory during classification is an issue.
>
> Ted, do you have some references on the algorithms below?  I've done some
> work on using mmap trees during classification.  I'd be happy to contribute
> along those lines.
>
>
> On 01/25/2013 04:52 PM, Ted Dunning wrote:
>>
>> Hey Andy,
>>
>> There are no plans for this.  You are correct that multiple passes aren't
>> too difficult, but they do go against the standard map-reduce paradigm a
>> bit if you want to avoid iterative map-reduce.
>>
>> It definitely would be nice to have a really competitive random forest
>> implementation that uses the global  accumulator style plus long-lived
>> mappers.  The basic idea would be to use the same sort of tricks that
>> Vowpal Wabbit or Giraph use to get a bunch of long-lived mappers and then
>> have them asynchronously talk to a tree repository.
>>
>> On Fri, Jan 25, 2013 at 6:58 PM, Andy Twigg <andy.twigg@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm new to this list so I apologise if this is covered elsewhere (but
>>> I couldn't find it..)
>>>
>>> I'm looking at the Random Forests implementations, both mapreduce
>>> ("partial") and non-distributed. Both appear to require the data
>>> loaded into memory. Random forests should be straightforward to
>>> construct with multiple passes through the data without storing the
>>> data in memory. Is there such an implementation in Mahout? If not, is
>>> there a ticket/plan ?
>>>
>>> Thanks,
>>> Andy
>>>
>>>
>>> --
>>> Dr Andy Twigg
>>> Junior Research Fellow, St Johns College, Oxford
>>> Room 351, Department of Computer Science
>>> http://www.cs.ox.ac.uk/people/andy.twigg/
>>> andy.twigg@cs.ox.ac.uk | +447799647538
>>>
>



-- 
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Mime
View raw message