mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests
Date Sun, 12 Jul 2009 18:53:15 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730129#action_12730129
] 

Ted Dunning commented on MAHOUT-145:
------------------------------------


What do you think about using a normal mapper structure where the map() method reads one line
at a time, stores the record into memory and then does the tree building in the close() method
of your mapper?

This trick is used extensively in streaming.  If you are using 0.18.* then you have to stash
the output collector in an instance variable so that you can produce output (or just open
a task specific output file).  In 0.20, I think that the Context argument is passed to the
close method to avoid that need.  Because production of output in the close() is so important
to some applications, you are guaranteed to be able to use the output collector in close().

> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions of the
data. That loses some of the solidity of the original method, but could actually do better
if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message