spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Franklyn D'souza" <franklyn.dso...@shopify.com>
Subject Re: Handling nulls in vector columns is non-trivial
Date Thu, 22 Jun 2017 01:30:20 GMT
>From the documentation it states that ` The input columns should be of
DoubleType or FloatType.` so i dont think that is what im looking for. Also
in general the API around vectors is highly lacking, especially from the
pyspark side.

Very common vector operations like addition, subtractions and dot products
can't be performed. I'm wondering what the direction is with vector support
in spark.

On Wed, Jun 21, 2017 at 9:19 PM, Maciej Szymkiewicz <mszymkiewicz@gmail.com>
wrote:

> Since 2.2 there is Imputer:
>
> https://github.com/apache/spark/blob/branch-2.2/
> examples/src/main/python/ml/imputer_example.py
>
> which should at least partially address the problem.
>
> On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
> > I just wanted to highlight some of the rough edges around using
> > vectors in columns in dataframes.
> >
> > If there is a null in a dataframe column containing vectors pyspark ml
> > models like logistic regression will completely fail.
> >
> > However from what i've read there is no good way to fill in these
> > nulls with empty vectors.
> >
> > Its not possible to create a literal vector column expressiong and
> > coalesce it with the column from pyspark.
> >
> > so we're left with writing a python udf which does this coalesce, this
> > is really inefficient on large datasets and becomes a bottleneck for
> > ml pipelines working with real world data.
> >
> > I'd like to know how other users are dealing with this and what plans
> > there are to extend vector support for dataframes.
> >
> > Thanks!,
> >
> > Franklyn
>
> --
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Mime
View raw message