spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sung Chung (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
Date Mon, 29 Sep 2014 21:42:33 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152347#comment-14152347
] 

Sung Chung commented on SPARK-3717:
-----------------------------------

I think that this would be great as an alternative option.

1. Partitioning by rows (as currently implemented) scales in # of rows.
2. Partitioning by features scales in # of features.

With good modularization, I think a lot of tree logic (splitting, building trees) could be
shared among the different partitioning schemes.

> DecisionTree, RandomForest: Partition by feature
> ------------------------------------------------
>
>                 Key: SPARK-3717
>                 URL: https://issues.apache.org/jira/browse/SPARK-3717
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and RandomForest.  This
JIRA argues for partitioning by feature for training deep trees.  This is especially relevant
for random forests, which are often trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main problem parameters
determining whether it is better to partition features or instances.  For random forests (training
many deep trees), partitioning features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores feature j, then
the worker stores the feature value for all instances (i.e., the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for relevant
instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be best to
shuffle data and training subtrees locally.  This can mean shuffling the entire dataset for
each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the right
hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message