mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests
Date Wed, 05 Aug 2009 21:50:14 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739775#action_12739775
] 

Ted Dunning commented on MAHOUT-145:
------------------------------------

Ouch!

|| Num Map Tasks || Num trees || In-Mem build time || Partial build time || In-Mem oob error
|| Partial oob error ||
| ...|
| 2 | 100 | 0h 0m 57s 641 | 0h 0m 44s 43 | 4.45E-4 | 0.42 |
| ... |
| 10 | 400 | 0h 3m 33s 253 | 0h 1m 8s 29 | 4.45E-4 | 0.23 |

This looks like it runs faster (or at least not much slower), but produces astronomically
worse results.  

What really bugs me is that it is worse with few maps.  Am I interpreting this correctly when
I say that splitting the data in half and building independent forests increases OOB errors
by a factor of 1000?  How could that possibly be?



> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: partial_August_2.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions of the
data. That loses some of the solidity of the original method, but could actually do better
if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message