mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marty Kube <>
Subject Re: Decision Forest - Partial implementation
Date Sun, 09 Dec 2012 22:33:12 GMT

On 12/09/2012 04:29 AM, Ted Dunning wrote:
> On Sun, Dec 9, 2012 at 2:12 AM, Marty Kube <
>> wrote:
>> ...
>> I've been looking at the mmap suggestion some.  When you said:
>> 1) use shared memory via mmap to store the forest.  This allows multiple
>> mapper threads to access the same forest.  The current Mahout in-memory
>> structure for this is not suitable for shared memory, however.
>> Can you be a little more specific about why the current in-memory
>> structure is not suitable for shared memory?
> Because it uses Java pointers instead of offsets.  The mmap'ed structure
> could be mapped into memory at any address and thus must be position
> independent.
Okay, I think I get the point here.  Instead of having a tree 
represented by Java objects one would have a mapped byte array. You'd 
have to know the encoding in order to read and evaluate a decision 
node.  One would encode locations of the other nodes in a tree (and tree 
roots) as offsets in the file instead of object references.
>> I'm finding that Java does not support shared memory so one would need to
>> run the forest cache through JNI in order to use mmap and shared memory.
> Not quite true.  See
>> The other track I came up with is to use a distributed cache like memcache
>> or hazelcast.  To me those solutions seem target to cross host caches so I
>> worry about performance.  What I really want is a within host shared cache
>> across JVMs.
> You should definitely worry about performance on these.
> There are two good approaches.  If your shared objects are pretty small,
> then distributed cache can get the objects into the local file system for
> mapping.  If they are larger, then you can use MapR's NFS capabilities to
> present anything in the cluster as a normal file which can then be mapped.
>> On 12/08/2012 03:43 AM, Ted Dunning wrote:
>>> There are several approaches that might help:
>>> 1) use shared memory via mmap to store the forest.  This allows multiple
>>> mapper threads to access the same forest.  The current Mahout in-memory
>>> structure for this is not suitable for shared memory, however.
>>> 2) split the forests across many mappers (as you suggest).  You would have
>>> to tag your outputs cleverly so that they wind up at the right reducer.
>>>    Tags would include input data segment and forest segment.  Mahout
>>> doesn't
>>> support this, but it should be easily doable.
>>> 3) thin the forests.  There isn't a lot of literature on this, but I am
>>> pretty sure that I have seen some articles where less informative trees in
>>> the random forest were removed.  Another option with a similar effect is
>>> to
>>> use the random forest as an oracle so that you can generate a huge amount
>>> of training data for some other technique that may be prone to
>>> over-fitting.  This alternative model can be trained to fit the output of
>>> the random forest very precisely.  Over-fitting isn't an issue because you
>>> can generate as much training data as you like.  This isn't supported in
>>> Mahout.
>>> On Sat, Dec 8, 2012 at 2:03 AM, Marty Kube <
>>> martykube@**<>>
>>> wrote:
>>>   So here is a better description of the decision forest classification
>>>> implementation I'm working on.  This is for large scale classification
>>>> after training.
>>>> We have many attributes being classified, each attribute has it's own
>>>> forest.  The forest are big enough when loaded into RAM that you get only
>>>> one JVM per host.  But you really want one thread per processor on the
>>>> host, so we ended up threading the mappers.  We have a lot of feature
>>>> vectors so we send the features to the mappers.
>>>> This seems a bit awkward.  I've been thinking about spreading the trees
>>>> out across mappers to reduce the RAM per JVM with the goal of getting
>>>> closer to one JVM per core.  But then we'll need to do a more complex
>>>> join
>>>> between forests and feature vectors.  Right now we are essentially doing
>>>> a
>>>> replicated join with the forest being the replicated set.
>>>> Has anyone tried this - Is there support for this in Mahout?
>>>> On 12/06/2012 09:32 PM, Marty Kube wrote:
>>>>   Yes I'm on a project in which we classify a large data set.  We do use
>>>>> mapreduce to do the classification as the data set is much larger than
>>>>> the
>>>>> working memory.  We have a non-mahout implementation...
>>>>> So we put the decision forest in memory via a distributed cache and
>>>>> partition the data set and run it past the models.  The models are
>>>>> getting
>>>>> pretty big and keeping them in memory is a challenge. I guess I was
>>>>> looking
>>>>> for an implementation that doesn't require keeping the decision forest
>>>>> in
>>>>> memory.  I'll have a look at the TestForest implementation.
>>>>> On 12/06/2012 12:06 AM, deneche abdelhakim wrote:
>>>>>   You mean you want to classify a large dataset ?
>>>>>> The partial implementation is useful when the training dataset is
>>>>>> large
>>>>>> to fit in memory. If it's does fit then you better train the forest
>>>>>> using
>>>>>> the in-memory implementation.
>>>>>> If you want to classify a large amount of rows then you can add the
>>>>>> parameter -mr to TestForest to classify the data using mapreduce.
>>>>>> example of this can be found in the wiki:
>>>>>> <https://cwiki.apache.**org/MAHOUT/partial-**implementation.html<>
>>>>>> On Thu, Dec 6, 2012 at 2:45 AM, Marty Kube <
>>>>>> martykube@**beavercreekconsult**<>
>>>>>> <martykube@**<>
>>>>>> wrote:
>>>>>>    Hi,
>>>>>>> I'm working improving classification throughput for a decision
>>>>>>>    I
>>>>>>> was wondering about the use case for Partial Implementation.
>>>>>>> The quick start guide suggests that Partial Implementation is
>>>>>>> for
>>>>>>> building forest on large datasets.
>>>>>>> My problem is classification after training. Is Partial Implementation
>>>>>>> helpful for this use case?

View raw message