spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <>
Subject [jira] [Closed] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent --> Grid search working on LR but not on RF
Date Mon, 13 Jul 2015 23:47:00 GMT


Joseph K. Bradley closed SPARK-9011.
    Resolution: Not A Problem

> Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent --> Grid search working
on LR but not on RF
> --------------------------------------------------------------------------------------------------------
>                 Key: SPARK-9011
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, PySpark
>    Affects Versions: 1.4.0
>         Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single node running
>            Reporter: Shivam Verma
>            Priority: Minor
>              Labels: cross-validation, ml, mllib, pyspark, randomforest, tuning
> Hi,
> I ran into this bug while using on an RF (Random Forest)
classifier to classify a small dataset using the module. (This is a bug
because CrossValidator works on LR (Logistic Regression) but not on RF)
> Bug:
> There is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol="rawPrediction",
labelCol="label", metricName="areaUnderROC") interprets the 'rawPredict' column - with LR,
the rawPredictionCol is expected to contain vectors, whereas with RF, the prediction column
contains doubles. 
> Suggested Resolution: Either enable BinaryClassificationEvaluator to work with doubles,
or let RF output a column rawPredictions containing the probability vectors (with probability
of 1 assigned to predicted label, and 0 assigned to the rest).
> Detailed Observation:
> While running grid search on an RF classifier to classify a small dataset using the
module, specifically the ParamGridBuilder and CrossValidator classes. I get the following
error when I try passing a DataFrame of Features-Labels to CrossValidator:
> {noformat}
> Py4JJavaError: An error occurred while calling o1464.evaluate.
> : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be
of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType.
> {noformat}
> I tried the following code, using the dataset given in Spark's CV documentation for [cross
I also pass the DF through a StringIndexer transformation for the RF:
> {noformat}
> dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]),
1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)]
* 10,["features", "label"])
> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
> si_model =
> dataset2 = si_model.transform(dataset)
> keep = [dataset2.features, dataset2.indexed]
> dataset3 =*keep).withColumnRenamed('indexed','label')
> rf = RandomForestClassifier(predictionCol="rawPrediction",featuresCol="features",numTrees=5,
> grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
> cvModel =
> {noformat}
> Note that the above dataset *works* on logistic regression. I have also tried a larger
dataset with sparse vectors as features (which I was originally trying to fit) but received
the same error on RF.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message