spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26136) Row.getAs return null value in some condition
Date Wed, 21 Nov 2018 09:09:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694427#comment-16694427
] 

Hyukjin Kwon commented on SPARK-26136:
--------------------------------------

Type should be specified {{row.getAs[String]("A")}}; otherwise, it can't infer the type correctly.

{code}
-       (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
+      (row.getAs[String]("A"), colA, row.getString(0), col0, row.getString(1), xyz)
{code}

> Row.getAs return null value in some condition
> ---------------------------------------------
>
>                 Key: SPARK-26136
>                 URL: https://issues.apache.org/jira/browse/SPARK-26136
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.3.0, 2.3.2, 2.4.0
>         Environment: Windows 10
> JDK 1.8.0_181
> scala 2.11.12
> spark 2.4.0 / 2.3.2 / 2.3.0
>  
>            Reporter: Charlie Feng
>            Priority: Major
>
> {{Row.getAs("fieldName")}} will return null value when all below conditions met:
>  * Used in {{DataFrame.flatMap()}}
>  * {{Another map()}} call inside {{flatMap}}
>  * call {{row.getAs("fieldName")}} inside a {{Tuple}}.
> Source code to reproduce the bug:
> {code}
> import org.apache.spark.sql.SparkSession
> object FlatMapGetAsBug {
> def main(args: Array[String]) {
>   val spark = SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
>   import spark.implicits._;
>   val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
>   df.show();
>   val df2 = df.flatMap { row =>
>     row.getAs[String]("XYZ").split(",").map { xyz =>
>       var colA: String = row.getAs("A");
>       var col0: String = row.getString(0);
>       (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
>     }
>   }.toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", "ColumnB", "ColumnXYZ")
>   df2.show();
>   spark.close()
>  }
> }
> {code}
> Console Output:
> {code}
> +---+---+-----+
> | A| B| XYZ|
> +---+---+-----+
> | a1| b1|x,y,z|
> +---+---+-----+
> +------------+------------+------------+------------+-------+---------+
> |ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
> +------------+------------+------------+------------+-------+---------+
> | null| a1| a1| a1| b1| x|
> | null| a1| a1| a1| b1| y|
> | null| a1| a1| a1| b1| z|
> +------------+------------+------------+------------+-------+---------+
> {code}
> We try to get "A" column with 4 approach
> 1. call {{row.getAs("A")}} inside a tuple
> 2. call {{row.getAs("A")}}, save result into a variable "colA", and add variable into
the tuple
> 3. call {{row.getString(0)}} inside a tuple
> 4. call {{row.getString(0)}}, save result into a variable "col0", and add variable into
the tuple 
> And we found that approach 2~4 get value "a1" successfully, but approach 1 get "null"
> This issue existing in spark 2.4.0/2.3.2/2.3.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message