zero323 commented on a change in pull request #27278: [SPARK30569][SQL][PYSPARK][SPARKR] Add
percentile_approx DSL functions.
URL: https://github.com/apache/spark/pull/27278#discussion_r368969058
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
##########
@@ 652,6 +652,122 @@ object functions {
*/
def min(columnName: String): Column = min(Column(columnName))
+ /**
+ * Aggregate function: Returns and array of the approximate percentile values
+ * of numeric column col at the given percentages.
+ *
+ * Each value of the percentage array must be between 0.0 and 1.0.
+ *
+ * The accuracy parameter is a positive numeric literal
+ * which controls approximation accuracy at the cost of memory.
+ * Higher value of accuracy yields better accuracy, 1.0/accuracy
+ * is the relative error of the approximation.
+ *
+ * @group agg_funcs
+ * @since 3.0.0
+ */
+ def percentile_approx(e: Column, percentage: Array[Double], accuracy: Long): Column = {
+ withAggregateFunction {
+ new ApproximatePercentile(
+ e.expr, typedLit(percentage).expr, lit(accuracy).expr
+ )
+ }
+ }
+
+ /**
+ * Aggregate function: Returns and array of the approximate percentile values
+ * of numeric column col at the given percentages.
+ *
+ * Each value of the percentage array must be between 0.0 and 1.0.
+ *
+ * The accuracy parameter is a positive numeric literal
+ * which controls approximation accuracy at the cost of memory.
+ * Higher value of accuracy yields better accuracy, 1.0/accuracy
+ * is the relative error of the approximation.
+ *
+ * @group agg_funcs
+ * @since 3.0.0
+ */
+ def percentile_approx(columnName: String, percentage: Array[Double], accuracy: Long): Column
= {
+ percentile_approx(Column(columnName), percentage, accuracy)
+ }
+
+ /**
+ * Aggregate function: Returns and array of the approximate percentile values
+ * of numeric column col at the given percentages.
+ *
+ * Each value of the percentage array must be between 0.0 and 1.0.
+ *
+ * The accuracy parameter is a positive numeric literal
+ * which controls approximation accuracy at the cost of memory.
+ * Higher value of accuracy yields better accuracy, 1.0/accuracy
+ * is the relative error of the approximation.
+ *
+ * @group agg_funcs
+ * @since 3.0.0
+ */
+ def percentile_approx(e: Column, percentage: Seq[Double], accuracy: Long): Column = {
+ percentile_approx(e, percentage.toArray, accuracy)
+ }
+
+ /**
+ * Aggregate function: Returns and array of the approximate percentile values
+ * of numeric column col at the given percentages.
+ *
+ * Each value of the percentage array must be between 0.0 and 1.0.
+ *
+ * The accuracy parameter is a positive numeric literal
+ * which controls approximation accuracy at the cost of memory.
+ * Higher value of accuracy yields better accuracy, 1.0/accuracy
+ * is the relative error of the approximation.
+ *
+ * @group agg_funcs
+ * @since 3.0.0
+ */
+ def percentile_approx(columnName: String, percentage: Seq[Double], accuracy: Long): Column
= {
+ percentile_approx(Column(columnName), percentage.toArray, accuracy)
+ }
+
+ /**
+ * Aggregate function: Returns the approximate percentile value of numeric
+ * column col at the given percentage.
+ *
+ * The value of percentage must be between 0.0 and 1.0.\
+ *
+ * The accuracy parameter is a positive numeric literal
+ * which controls approximation accuracy at the cost of memory.
+ * Higher value of accuracy yields better accuracy, 1.0/accuracy
+ * is the relative error of the approximation.
+ *
+ * @group agg_funcs
+ * @since 3.0.0
+ */
+ def percentile_approx(e: Column, percentage: Double, accuracy: Long): Column = {
+ withAggregateFunction {
+ new ApproximatePercentile(
+ e.expr, lit(percentage).expr, lit(accuracy).expr
+ )
+ }
+ }
+
+ /**
+ * Aggregate function: Returns the approximate percentile value of numeric
+ * column col at the given percentage.
+ *
+ * The value of percentage must be between 0.0 and 1.0.\
+ *
+ * The accuracy parameter is a positive numeric literal
+ * which controls approximation accuracy at the cost of memory.
+ * Higher value of accuracy yields better accuracy, 1.0/accuracy
+ * is the relative error of the approximation.
+ *
+ * @group agg_funcs
+ * @since 3.0.0
+ */
+ def percentile_approx(columnName: String, percentage: Double, accuracy: Long): Column =
{
Review comment:
To be honest I am not very enthusiastic about it and I am not even convinced that it is
consistent with the rest of `functions`.
The closest equivalents we have are
 `approx_count_distinct` with `rsd`
 `last` with `ignoreNulls`
and both use external types, not columns. Not to mention this is still counterintuitive
and painful to use though:
> we don't need to duplicate docs with less maintenance.
is a fair point.
 I can easily remove `Seq` variants, that's for sure and cut number of signatures by two,
leaving us with four.
 If having not `Column` variant on JVM is fine, we can drop `(String, _, _) => Column`
variant so that brings us to two variants.
 It is also not hard to build `Column` objects transparently for Python and R users to
support `(Column, Column, Column) => Column`. But I am still concerned about confusing
semantics.
If two variants are still to much, we could always have `(Column, Any, Double) =>
Column` ‒ `o.a.sql.functions` is already quite full of `Any`s. Or if we're fine with making
Java users miserable, we could `(Column, Either[Double, Array[Double], Double) => Column`,
but this will require additional supporting code for R and Python.

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services

To unsubscribe, email: reviewsunsubscribe@spark.apache.org
For additional commands, email: reviewshelp@spark.apache.org
