spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "yogesh garg (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.
Date Wed, 07 Mar 2018 23:34:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390434#comment-16390434
] 

yogesh garg edited comment on SPARK-23562 at 3/7/18 11:33 PM:
--------------------------------------------------------------

Error in question can be reproduced with the following code in scala 

{code:scala}
val d1 = spark.createDataFrame(Seq(
  (1001, "a"),
  (1002, "b")
)).toDF("id1", "c1")
val seq: Seq[(java.lang.Long, String)] = (Seq(
  (20001, "x"),
  (20002, "y"),
  (null, null)
))
val d2 = seq.toDF("id2", "c2")

val dataset = d1.crossJoin(d2)
d1.show()
d2.show()
dataset.show()

def test(mode: String) = {
  val formula = new RFormula()
    .setFormula("c1 ~ id2")
    .setHandleInvalid(mode)

  val model = formula.fit(dataset)
  val output = model.transform(dataset)
  println(model)
  println(mode)
  output.select("features", "label").show(truncate=false)
}

List("skip", "keep", "error").foreach {test}{code}



{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task ** in stage ** failed
** times, most recent failure: Lost task ** in stage ** (TID **, **, executor **): org.apache.spark.SparkException:
Failed to execute user defined function($anonfun$3: (struct<id2_double_rFormula_1b829d1fadd6:double>)
=> vector)

Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.

{code}



was (Author: yogeshgarg):
Error in question can be reproduced with the following code in scala 

{code:scala}
val d1 = spark.createDataFrame(Seq(
  (1001, "a"),
  (1002, "b")
)).toDF("id1", "c1")
val seq: Seq[(java.lang.Long, String)] = (Seq(
  (20001, "x"),
  (20002, "y"),
  (null, null)
))
val d2 = seq.toDF("id2", "c2")

val dataset = d1.crossJoin(d2)
d1.show()
d2.show()
dataset.show()

def test(mode: String) = {
  val formula = new RFormula()
    .setFormula("c1 ~ id2")
    .setHandleInvalid(mode)

  val model = formula.fit(dataset)
  val output = model.transform(dataset)
  println(model)
  println(mode)
  output.select("features", "label").show(truncate=false)
}

List("skip", "keep", "error").foreach {test}{code}


> RFormula handleInvalid should handle invalid values in non-string columns.
> --------------------------------------------------------------------------
>
>                 Key: SPARK-23562
>                 URL: https://issues.apache.org/jira/browse/SPARK-23562
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Bago Amirbekian
>            Priority: Major
>
> Currently when handleInvalid is set to 'keep' or 'skip' this only applies to String
fields. Numeric fields that are null will either cause the transformer to fail or might be
null in the resulting label column.
> I'm not sure what the semantics of keep might be for numeric columns with null values,
but we should be able to at least support skip for these types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message