spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liang-Chi Hsieh (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once
Date Mon, 01 May 2017 02:55:04 GMT
Liang-Chi Hsieh created SPARK-20542:
---------------------------------------

             Summary: Add an API into Bucketizer that can bin a lot of columns all at once
                 Key: SPARK-20542
                 URL: https://issues.apache.org/jira/browse/SPARK-20542
             Project: Spark
          Issue Type: New Feature
          Components: ML
    Affects Versions: 2.2.0
            Reporter: Liang-Chi Hsieh


Current ML's Bucketizer can only bin a column of continuous features. If a dataset has thousands
of of continuous columns needed to bin, we will result in thousands of ML stages. It is very
inefficient regarding query planning and execution.

We should have a type of bucketizer that can bin a lot of columns all at once. It would need
to accept an list of arrays of split points to correspond to the columns to bin, but it might
make things more efficient by replacing thousands of stages with just one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message