spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Szymkiewicz <>
Subject Re: Handling nulls in vector columns is non-trivial
Date Thu, 22 Jun 2017 01:19:15 GMT
Since 2.2 there is Imputer:

which should at least partially address the problem.

On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
> I just wanted to highlight some of the rough edges around using
> vectors in columns in dataframes. 
> If there is a null in a dataframe column containing vectors pyspark ml
> models like logistic regression will completely fail. 
> However from what i've read there is no good way to fill in these
> nulls with empty vectors. 
> Its not possible to create a literal vector column expressiong and
> coalesce it with the column from pyspark.
> so we're left with writing a python udf which does this coalesce, this
> is really inefficient on large datasets and becomes a bottleneck for
> ml pipelines working with real world data.
> I'd like to know how other users are dealing with this and what plans
> there are to extend vector support for dataframes.
> Thanks!,
> Franklyn


To unsubscribe e-mail:

View raw message