spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <>
Subject [jira] [Created] (SPARK-4328) Python serialization updates make Python ML API more brittle to types
Date Mon, 10 Nov 2014 22:43:35 GMT
Joseph K. Bradley created SPARK-4328:

             Summary: Python serialization updates make Python ML API more brittle to types
                 Key: SPARK-4328
             Project: Spark
          Issue Type: Improvement
          Components: MLlib, PySpark
    Affects Versions: 1.2.0
            Reporter: Joseph K. Bradley

In Spark 1.1, you could create a LabeledPoint with labels specified as integers, and then
use it with LinearRegression.  This was broken by the Python API updates since then.  E.g.,
this code runs in the 1.1 branch but not in the current master:

from pyspark.mllib.regression import *
import numpy
features = numpy.ndarray((3))
data = sc.parallelize([LabeledPoint(1, features)])

Recommendation: Allow users to use integers from Python.

The error message you get is:
py4j.protocol.Py4JJavaError: An error occurred while calling o55.trainLinearRegressionModelWithSGD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 3.0 failed
1 times, most recent failure: Lost task 7.0 in stage 3.0 (TID 15, localhost): java.lang.ClassCastException:
java.lang.Integer cannot be cast to java.lang.Double
	at scala.runtime.BoxesRunTime.unboxToDouble(
	at org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727)
	at net.razorvine.pickle.Unpickler.load_reduce(
	at net.razorvine.pickle.Unpickler.dispatch(
	at net.razorvine.pickle.Unpickler.load(
	at net.razorvine.pickle.Unpickler.loads(
	at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804)
	at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910)
	at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
	at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
	at org.apache.spark.executor.Executor$
	at java.util.concurrent.ThreadPoolExecutor.runWorker(
	at java.util.concurrent.ThreadPoolExecutor$

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message