mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Deneche A. Hakim (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests
Date Tue, 11 Aug 2009 18:18:15 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742000#action_12742000
] 

Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------

How the Partial Mapred builder works:
* step 0 (centralized): the main program prepares and launches the builder
* step 1 (mapred job): each mapper builds a set of trees and classifies the oob instances
of the partition, return each tree with the classifications of all partition instances (non
classified instance get -1)
* step 1-2 (centralized): the builder processes the outputs of the job two times:
 ** the first time in order to compute the partitions' sizes and their respective order
 ** the second time to extract the trees and pass the oob classifications to a callback
 this step has been split to avoid loading all the outputs in memory (slows down the program
when the data is large)
* step 2 (mapred job): each mapper uses all the trees of the other partitions to compute the
classifications for all the instances of its partition. This completes the oob error computation
* step 2-2 (centralized): the builder processes the outputs and passes the oob classifications
to a callback
* step 3 (centralized): the main program receives the decision forest, and its callback has
received all the oob classifications. In order to compute the oob error it must compare the
oob classifications with the real data labels. Actually its done by loading the whole data
in memory (ouch!), extracting its labels, then computing the oob error

in the test results the build time is the time taken by the steps 1, 1-2, 2 and 2-2. Although
the step 3 is not accounted, it slows the tests so much that I was not able to try KDD 100%.

In the following results, the build time is computed by the program, and I was able to figure
out the other times using the log of the program.

EC2 10 nodes (c1.medium) cluster
mapred.tasktracker.map.tasks.maximum=2
mapred.child.java.opts=-Xms500m -Xmx1000m
export HADOOP_HEAPSIZE=2000

seed 1, m 1, oob

KDD 10%
|| Num Map Tasks || Num Trees || Oob Error || Build Time || Step 1 || Step 1-2 || Step 2 ||
Step 2-2 || Step 3 ||
| 10 | 100 | 0.0515 | 0h 0m 48s 823 | 24s | 2s | 15s | 7s | 14s |
| 10 | 200 | 0.0514 | 0h 0m 59s 34 | 27s | 3s | 15s | 14s | 13s |
| 10 | 400 | 0.0513 | 0h 1m 40s 265 | 43s | 7s | 22s | 28s | 13s |
| 20 | 100 | 0.0864 | 0h 0m 37s 366 | 15s | 1s | 14s | 7s | 14s |
| 20 | 200 | 0.1024 | 0h 0m 47s 213 | 14s | 2s | 17s | 14s | 13s |
| 20 | 400 | 0.0903 | 0h 1m 14s 368 | 18s | 4s | 22s | 30s | 13s |
| 50 | 100 | 0.4315 | 0h 0m 37s 657 | 13s | 1s | 16s | 8s | 14s |
| 50 | 200 | 0.4316 | 0h 0m 48s 611 | 15s | 2s | 16s | 15s | 14s |
| 50 | 400 | 0.4316 | 0h 1m 6s 160 | 14s | 2s | 21s | 30s | 12s |

As soon as I compile the results of KDD50 and KDD100 I'll post them, then I can start explaining
those results (at least I will try)

> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: partial_August_10.patch, partial_August_2.patch, partial_August_9.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions of the
data. That loses some of the solidity of the original method, but could actually do better
if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message