spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Seth Hendrickson (JIRA)" <>
Subject [jira] [Commented] (SPARK-23704) PySpark access of individual trees in random forest is slow
Date Fri, 22 Jun 2018 22:22:00 GMT


Seth Hendrickson commented on SPARK-23704:

Instead of
Can you try
trees = model.trees
And time only the second line? The first line actually calls into the JVM and creates new
trees in Python.

> PySpark access of individual trees in random forest is slow
> -----------------------------------------------------------
>                 Key: SPARK-23704
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, PySpark
>    Affects Versions: 2.2.1
>         Environment: PySpark 2.2.1 / Windows 10
>            Reporter: Julian King
>            Priority: Minor
> Making predictions from a randomForestClassifier PySpark is much faster than making predictions
from an individual tree contained within the .trees attribute. 
> In fact, the model.transform call without an action is more than 10x slower for an individual
tree vs the model.transform call for the random forest model.
> See [] for
example with timing.
> Ideally:
>  * Getting a prediction from a single tree should be comparable to or faster than getting predictions
from the whole tree
>  * Getting all the predictions from all the individual trees should be comparable in
speed to getting the predictions from the random forest

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message