spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Dreibelbis (JIRA)" <>
Subject [jira] [Created] (SPARK-24656) SparkML Transformers and Estimators with multiple columns
Date Mon, 25 Jun 2018 22:10:00 GMT
Michael Dreibelbis created SPARK-24656:

             Summary: SparkML Transformers and Estimators with multiple columns
                 Key: SPARK-24656
             Project: Spark
          Issue Type: New Feature
          Components: ML, MLlib
    Affects Versions: 2.3.1
            Reporter: Michael Dreibelbis

Currently SparkML Transformers and Estimators operate on single input/output column pairs.
This makes pipelines extremely cumbersome (as well as non-performant) when transformations
on multiple columns needs to be made.


I am proposing to implement ParallelPipelineStage/Transformer/Estimator/Model that would operate
on the input columns in parallel.

 // old way
    val pipeline = new Pipeline().setStages(Array(
      new CountVectorizer().setInputCol("_1").setOutputCol("_1_cv"),
      new CountVectorizer().setInputCol("_2").setOutputCol("_2_cv"),
      new IDF().setInputCol("_1_cv").setOutputCol("_1_idf"),
      new IDF().setInputCol("_2_cv").setOutputCol("_2_idf")

    // proposed way
    val pipeline2 = new Pipeline().setStages(Array(
      new ParallelCountVectorizer().setInputCols(Array("_1", "_2")).setOutputCols(Array("_1_cv",
      new ParallelIDF().setInputCols(Array("_1_cv", "_2_cv")).setOutputCols(Array("_1_idf",


This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message