spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Franklyn Dsouza (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-21199) Its not possible to impute Vector types
Date Mon, 26 Jun 2017 17:17:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063416#comment-16063416
] 

Franklyn Dsouza edited comment on SPARK-21199 at 6/26/17 5:16 PM:
------------------------------------------------------------------

For this particular scenario I have a table with two columns one is a string `document_type`
and the other is a an array of tokens for the document. I want to do a TF-IDF on these tokens.
the IDF needs to be done per `document_type` so i pivot on `document_type` and then do the
IDF on the TF vectors.

This pivoting introduces nulls for missing columns that need to be imputed. I can't impute
array type either and fixing it at the token generation step would involve a lot of left joins
to align various data sources.


was (Author: franklyndsouza):
For this particular scenario I have a table with two columns one is a string `document_type`
and the other is a an array of tokens for the document. I want to do a TF-IDF on these tokens.
the IDF needs to be done per `document_type` so i pivot on `document_type` and then do the
IDF on the TF vetors.

This pivoting introduces nulls for missing columns that need to be imputed. I can't impute
array type either and fixing it at the token generation step would involve a lot of left joins
to align various data sources.

> Its not possible to impute Vector types
> ---------------------------------------
>
>                 Key: SPARK-21199
>                 URL: https://issues.apache.org/jira/browse/SPARK-21199
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0, 2.1.1
>            Reporter: Franklyn Dsouza
>
> There are cases where nulls end up in vector columns in dataframes. Currently there is
no way to fill in these nulls because its not possible to create a literal vector column expression
using lit().
> Also the entire pyspark ml api will fail when they encounter nulls so this makes it hard
to work with the data.
> I think that either vector support should be added to the imputer or vectors should be
supported in column expressions so they can be used in a coalesce.
> [~mlnick]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message