spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-18884) Support Array[_] in ScalaUDF
Date Thu, 15 Dec 2016 15:43:58 GMT

     [ https://issues.apache.org/jira/browse/SPARK-18884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Takeshi Yamamuro updated SPARK-18884:
-------------------------------------
    Description: 
Throw an exception if we use `Array[_]` in `ScalaUDF`;

{code}
scala> import org.apache.spark.sql.execution.debug._
scala> Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar")).write.mode("overwrite").parquet("/Users/maropu/Desktop/data/")
scala> val df = spark.read.load("/Users/maropu/Desktop/data/")
scala> val df = Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar"))
scala> val testArrayUdf = udf { (ar: Array[Int]) => ar.sum }
scala> df.select(testArrayUdf($"ar")).show

Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot
be cast to [I
  at $anonfun$1.apply(<console>:23)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
  ... 99 more
{code}

On the other hand, the query below is passed;

{code}
scala> val testSeqUdf = udf { (ar: Seq[Int]) => ar.sum }
scala> df.select(testSeqUdf($"ar")).show
+-------+
|UDF(ar)|
+-------+
|      1|
+-------+
{code}

I'm not sure this behivour is an expected one. The curernt implementation querys argument
types (`DataType`) by reflection (`ScalaReflection.schemaFor`) in `sql.functions.udf`, and
then creates type converters (`CatalystTypeConverters`) from the types. `Seq[_]` and `Array[_]`
are represented as `ArrayType` in `DataType` and both types are handled by using `ArrayConverter. However,
since we cannot tell a difference between both types in `DataType`, ISTM it's difficut to
support the two array types based on this current design. One idea (of curse, it's not
the best) I have is to create type converters directly from `TypeTag` in `sql.functions.udf`. This
is a prototype (https://github.com/apache/spark/compare/master...maropu:ArrayTypeUdf#diff-89643554d9757dd3e91abff1cc6096c7R740)
to support both array types in `ScalaUDF`. I'm not sure this is acceptable and welcome any
suggestion.

  was:
Throw an exception if we use `Array[_]` in `ScalaUDF`;

{code}
scala> import org.apache.spark.sql.execution.debug._
scala> Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar")).write.mode("overwrite").parquet("/Users/maropu/Desktop/data/")
scala> val df = spark.read.load("/Users/maropu/Desktop/data/")
scala> val df = Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar"))
scala> val testArrayUdf = udf { (ar: Array[Int]) => ar.sum }
scala> df.select(testArrayUdf($"ar")).show

Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot
be cast to [I
  at $anonfun$1.apply(<console>:23)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
  ... 99 more
{code}

On the other hand, the query below is passed;

{code}
scala> val testSeqUdf = udf { (ar: Seq[Int]) => ar.sum }
scala> df.select(testSeqUdf($"ar")).show
+-------+
|UDF(ar)|
+-------+
|      1|
+-------+
{code}

I'm not sure this behivour is an expected one. The curernt implementation querys argument
types (`DataType`)
by reflection (`ScalaReflection.schemaFor`) in `sql.functions.udf`, and then creates type
converters (`CatalystTypeConverters`) from the types.
`Seq[_]` and `Array[_]` are represented as `ArrayType` in `DataType` and both types are handled
by using `ArrayConverter.
However, since we cannot tell a difference between both types in `DataType`, ISTM it's difficut
to support the two array types
based on this current design.
One idea (of curse, it's not the best) I have is to create type converters directly from `TypeTag`
in `sql.functions.udf`.
This is a prototype (https://github.com/apache/spark/compare/master...maropu:ArrayTypeUdf#diff-89643554d9757dd3e91abff1cc6096c7R740)
to
support both array types in `ScalaUDF`. I'm not sure this is acceptable and welcome any suggestion.


> Support Array[_] in ScalaUDF
> ----------------------------
>
>                 Key: SPARK-18884
>                 URL: https://issues.apache.org/jira/browse/SPARK-18884
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Takeshi Yamamuro
>            Priority: Minor
>
> Throw an exception if we use `Array[_]` in `ScalaUDF`;
> {code}
> scala> import org.apache.spark.sql.execution.debug._
> scala> Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar")).write.mode("overwrite").parquet("/Users/maropu/Desktop/data/")
> scala> val df = spark.read.load("/Users/maropu/Desktop/data/")
> scala> val df = Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar"))
> scala> val testArrayUdf = udf { (ar: Array[Int]) => ar.sum }
> scala> df.select(testArrayUdf($"ar")).show
> Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef
cannot be cast to [I
>   at $anonfun$1.apply(<console>:23)
>   at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89)
>   at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
>   at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
>   ... 99 more
> {code}
> On the other hand, the query below is passed;
> {code}
> scala> val testSeqUdf = udf { (ar: Seq[Int]) => ar.sum }
> scala> df.select(testSeqUdf($"ar")).show
> +-------+
> |UDF(ar)|
> +-------+
> |      1|
> +-------+
> {code}
> I'm not sure this behivour is an expected one. The curernt implementation querys argument
types (`DataType`) by reflection (`ScalaReflection.schemaFor`) in `sql.functions.udf`, and
then creates type converters (`CatalystTypeConverters`) from the types. `Seq[_]` and `Array[_]`
are represented as `ArrayType` in `DataType` and both types are handled by using `ArrayConverter. However,
since we cannot tell a difference between both types in `DataType`, ISTM it's difficut to
support the two array types based on this current design. One idea (of curse, it's not
the best) I have is to create type converters directly from `TypeTag` in `sql.functions.udf`. This
is a prototype (https://github.com/apache/spark/compare/master...maropu:ArrayTypeUdf#diff-89643554d9757dd3e91abff1cc6096c7R740)
to support both array types in `ScalaUDF`. I'm not sure this is acceptable and welcome any
suggestion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message