spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: In Spark-SQL, is there support for distributed execution of native Hive UDAFs?
Date Thu, 23 Apr 2015 08:35:16 GMT
Your understanding is correct -- there is no partial aggregation currently
for Hive UDAF.

However, there is a PR to fix that:
https://github.com/apache/spark/pull/5542



On Thu, Apr 23, 2015 at 1:30 AM, daniel.mescheder <
daniel.mescheder@realimpactanalytics.com> wrote:

> Hi everyone,
>
> I was playing with the integration of Hive UDAFs in Spark-SQL and noticed
> that the terminatePartial and merge methods of custom UDAFs were not
> called. This made me curious as those two methods are the ones responsible
> for distributing the UDAF execution in Hive.
> Looking at the code of HiveUdafFunction which seems to be the wrapper for
> all native Hive functions for which there exists no spark-sql specific
> implementation, I noticed that it
>
> a) extends AggregateFunction and not PartialAggregate
> b) only contains calls to iterate and evaluate, but never to merge of the
> underlying UDAFEvaluator object
>
> My question is thus twofold: Is my observation correct, that to achieve
> distributed execution of a UDAF I have to add a custom implementation at
> the spark-sql layer (like the examples in aggregates.scala)? If that is the
> case, how difficult would it be to use the terminatePartial and merge
> functions provided by the UDAFEvaluator to make Hive UDAFs distributed by
> default?
>
>
> Cheers,
>
> Daniel
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/In-Spark-SQL-is-there-support-for-distributed-execution-of-native-Hive-UDAFs-tp11753.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message