spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guilherme Braccialli (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-14236) UDAF does not use incomingSchema for update Method
Date Thu, 21 Sep 2017 22:30:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-14236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175541#comment-16175541
] 

Guilherme Braccialli edited comment on SPARK-14236 at 9/21/17 10:29 PM:
------------------------------------------------------------------------

+1 to implement this.

as a workaround I'm using code below to make code more readable:

{code:java}
  val inputColumns = Map(
          "start" -> TimestampType, 
          "end" -> TimestampType
  )
  override def inputSchema = StructType(inputColumns.map{case (name,dataType) => StructField(name,dataType)}.toArray)
  val inputColumnsNameId = inputColumns.zipWithIndex.map{case ((name, dataType), position)
=> (name -> position)}
  val inputStart = inputColumnsNameId("start")
  val inputEnd = inputColumnsNameId("end")
{code}


PS: I did some tests and identified significant perfomance overhead if I try to resolve field
names (by accessing map inputColumnsNameId) inside update function, that's why I created one
val with respective id for each input field. I tested with approximate 1 billion rows.

same solution applies to bufferSchema.


was (Author: gbraccialli):
+1 to implement this.

as a workaround I'm using code below to make code more readable:

{code:scala}
  val inputColumns = Map(
          "start" -> TimestampType, 
          "end" -> TimestampType
  )
  override def inputSchema = StructType(inputColumns.map{case (name,dataType) => StructField(name,dataType)}.toArray)
  val inputColumnsNameId = inputColumns.zipWithIndex.map{case ((name, dataType), position)
=> (name -> position)}
  val inputStart = inputColumnsNameId("start")
  val inputEnd = inputColumnsNameId("end")
{code}


PS: I did some tests and identified significant perfomance overhead if I try to resolve field
names (by accessing map inputColumnsNameId) inside update function, that's why I created one
val with respective id for each input field. I tested with approximate 1 billion rows.

same solution applies to bufferSchema.

> UDAF does not use incomingSchema for update Method
> --------------------------------------------------
>
>                 Key: SPARK-14236
>                 URL: https://issues.apache.org/jira/browse/SPARK-14236
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Matthias Niehoff
>            Priority: Minor
>
> When I specify a schema for the incoming data in an UDAF, the schema will not be applied
to the incoming row in the update method. I can only access the fields using their numeric
indices and not with their names. The Fields in the row are named input0, input1,...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message