spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Jurney <russell.jur...@gmail.com>
Subject In PySpark ML, how can I interpret the SparseVector returned by a pyspark.ml.classification.RandomForestClassificationModel.featureImportances ?
Date Thu, 22 Dec 2016 00:11:26 GMT
I am debugging problems with a PySpark RandomForestClassificationModel, and
I am trying to use the feature importances to do so. However, the
featureImportances property returns a SparseVector that isn't possible to
interpret. How can I transform the SparseVector to be a useful list of
features along with feature type and name?

Some of my feature were nominal, so they had to be one-hot-encoded and then
combined with my numeric features. There is no PCA or anything that would
make interpretability hard, I just need to transform things back to where I
can get a feature type/name for each item in the SparseVector.

In other words... in practice,
RandomForestClassificationModel.featureImportances isn't useful without
some ability to make it interpretable. Does that ability exist? I've done
this in sklearn, but don't know how to do this with Spark ML.

My code is in a Jupyter Notebook on Github here
<https://github.com/rjurney/Agile_Data_Code_2/blob/master/ch09/Debugging%20Prediction%20Problems.ipynb>,
skip to the end.

Stack Overflow post:
http://stackoverflow.com/questions/41273893/in-pyspark-ml-how-can-i-interpret-the-sparsevector-returned-by-a-pyspark-ml-cla

Thanks!
-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com relato.io

Mime
View raw message