spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-24467) VectorAssemblerEstimator
Date Fri, 08 Jun 2018 18:00:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16506334#comment-16506334
] 

Nick Pentreath edited comment on SPARK-24467 at 6/8/18 5:59 PM:
----------------------------------------------------------------

Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't think a new estimator
could return the existing {{VectorAssembler}} but would probably need to return a new {{VectorAssemblerModel.
Though perhaps the existing one can be made a Model without breaking things}}


was (Author: mlnick):
Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't think a new estimator
could return the existing {{VectorAssembler}} but would probably need to return a new {{VectorAssemblerModel}}

> VectorAssemblerEstimator
> ------------------------
>
>                 Key: SPARK-24467
>                 URL: https://issues.apache.org/jira/browse/SPARK-24467
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 2.4.0
>            Reporter: Joseph K. Bradley
>            Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added `VectorSizeHint`
instead of making `VectorAssembler` into an Estimator since I thought the latter option would
break most workflows.  However, I should have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the inputCols.
 This Param can be optional.  If not given, then VectorAssembler will behave as it does now.
 If given, then VectorAssembler can use that info instead of figuring out the Vector sizes
via metadata or examining Rows in the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and produces
a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to VectorAssemblerEstimator will be
easier than adding VectorSizeHint since it will not require users to manually input Vector
lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other things in
the future which require vector length metadata, so we could consider keeping it rather than
deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message