spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Herman van Hovell tot Westerflier (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-4233) Simplify the Aggregation Function implementation
Date Fri, 22 May 2015 18:39:17 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556588#comment-14556588
] 

Herman van Hovell tot Westerflier commented on SPARK-4233:
----------------------------------------------------------

Hi, 

I have looked through the code in the PR. The new interface doesn't look simpler to me. It
seems that it has been design with Hive UDAFs in mind.

Can you explain to me why the current UDAF implementation is complicated, why it needs to
change, and what is improved if we start to use the proposed implementation?

As for the distinct implementations. Why not nest the required aggregation operator in a distinct
operator? For instance:
{code}
case class DistinctifyFunction(
    @transient expr: Seq[Expression],
    @transient aggr: AggregateFunction
    @transient base: AggregateExpression)
  extends AggregateFunction {

  def this() = this(null, null) // Required for serialization.

  val seen = new OpenHashSet[Any]()

  @transient
  val distinctValue = new InterpretedProjection(expr)

  override def update(input: Row): Unit = {
    val evaluatedExpr = distinctValue(input)
    if (!evaluatedExpr.anyNull) {
      seen.add(evaluatedExpr)
    }
  }

  override def eval(input: Row): Any = {
    // Assume the AggregateFunction input has been rerouted, to the distinct value projection.
    seen.foreach(aggr.update(_))
    aggr.eval(input)
  }
}
{code}

> Simplify the Aggregation Function implementation
> ------------------------------------------------
>
>                 Key: SPARK-4233
>                 URL: https://issues.apache.org/jira/browse/SPARK-4233
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Cheng Hao
>
> Currently, the UDAF implementation is quite complicated, and we have to provide distinct
& non-distinct version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message