spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] zero323 commented on a change in pull request #27278: [SPARK-30569][SQL][PYSPARK][SPARKR] Add percentile_approx DSL functions.
Date Tue, 21 Jan 2020 12:22:59 GMT
zero323 commented on a change in pull request #27278: [SPARK-30569][SQL][PYSPARK][SPARKR] Add
percentile_approx DSL functions.
URL: https://github.com/apache/spark/pull/27278#discussion_r368969058
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 ##########
 @@ -652,6 +652,122 @@ object functions {
    */
   def min(columnName: String): Column = min(Column(columnName))
 
+  /**
+   * Aggregate function: Returns and array of the approximate percentile values
+   * of numeric column col at the given percentages.
+   *
+   * Each value of the percentage array must be between 0.0 and 1.0.
+   *
+   * The accuracy parameter is a positive numeric literal
+   * which controls approximation accuracy at the cost of memory.
+   * Higher value of accuracy yields better accuracy, 1.0/accuracy
+   * is the relative error of the approximation.
+   *
+   * @group agg_funcs
+   * @since 3.0.0
+   */
+  def percentile_approx(e: Column, percentage: Array[Double], accuracy: Long): Column = {
+    withAggregateFunction {
+      new ApproximatePercentile(
+        e.expr, typedLit(percentage).expr, lit(accuracy).expr
+      )
+    }
+  }
+
+  /**
+   * Aggregate function: Returns and array of the approximate percentile values
+   * of numeric column col at the given percentages.
+   *
+   * Each value of the percentage array must be between 0.0 and 1.0.
+   *
+   * The accuracy parameter is a positive numeric literal
+   * which controls approximation accuracy at the cost of memory.
+   * Higher value of accuracy yields better accuracy, 1.0/accuracy
+   * is the relative error of the approximation.
+   *
+   * @group agg_funcs
+   * @since 3.0.0
+   */
+  def percentile_approx(columnName: String, percentage: Array[Double], accuracy: Long): Column
= {
+    percentile_approx(Column(columnName), percentage, accuracy)
+  }
+
+  /**
+   * Aggregate function: Returns and array of the approximate percentile values
+   * of numeric column col at the given percentages.
+   *
+   * Each value of the percentage array must be between 0.0 and 1.0.
+   *
+   * The accuracy parameter is a positive numeric literal
+   * which controls approximation accuracy at the cost of memory.
+   * Higher value of accuracy yields better accuracy, 1.0/accuracy
+   * is the relative error of the approximation.
+   *
+   * @group agg_funcs
+   * @since 3.0.0
+   */
+  def percentile_approx(e: Column, percentage: Seq[Double], accuracy: Long): Column = {
+    percentile_approx(e, percentage.toArray, accuracy)
+  }
+
+  /**
+   * Aggregate function: Returns and array of the approximate percentile values
+   * of numeric column col at the given percentages.
+   *
+   * Each value of the percentage array must be between 0.0 and 1.0.
+   *
+   * The accuracy parameter is a positive numeric literal
+   * which controls approximation accuracy at the cost of memory.
+   * Higher value of accuracy yields better accuracy, 1.0/accuracy
+   * is the relative error of the approximation.
+   *
+   * @group agg_funcs
+   * @since 3.0.0
+   */
+  def percentile_approx(columnName: String, percentage: Seq[Double], accuracy: Long): Column
= {
+    percentile_approx(Column(columnName), percentage.toArray, accuracy)
+  }
+
+  /**
+   * Aggregate function: Returns the approximate percentile value of numeric
+   * column col at the given percentage.
+   *
+   * The value of percentage must be between 0.0 and 1.0.\
+   *
+   * The accuracy parameter is a positive numeric literal
+   * which controls approximation accuracy at the cost of memory.
+   * Higher value of accuracy yields better accuracy, 1.0/accuracy
+   * is the relative error of the approximation.
+   *
+   * @group agg_funcs
+   * @since 3.0.0
+   */
+  def percentile_approx(e: Column, percentage: Double, accuracy: Long): Column = {
+    withAggregateFunction {
+      new ApproximatePercentile(
+        e.expr, lit(percentage).expr, lit(accuracy).expr
+      )
+    }
+  }
+
+  /**
+   * Aggregate function: Returns the approximate percentile value of numeric
+   * column col at the given percentage.
+   *
+   * The value of percentage must be between 0.0 and 1.0.\
+   *
+   * The accuracy parameter is a positive numeric literal
+   * which controls approximation accuracy at the cost of memory.
+   * Higher value of accuracy yields better accuracy, 1.0/accuracy
+   * is the relative error of the approximation.
+   *
+   * @group agg_funcs
+   * @since 3.0.0
+   */
+  def percentile_approx(columnName: String, percentage: Double, accuracy: Long): Column =
{
 
 Review comment:
   To be honest I am not very enthusiastic about it and I am not even convinced that it is
consistent with the rest of `functions`. 
   
   The closest equivalents we have are
   
   - `approx_count_distinct` with `rsd`
   - `last` with `ignoreNulls`
   
   and both use external types, not columns. Not to mention this is still counter-intuitive
and painful to use though:
   
   > we don't need to duplicate docs with less maintenance.
   
   is a fair point.
   
   - I can easily remove `Seq` variants, that's for sure and cut number of signatures by two,
leaving us with four.
   - If having not `Column` variant on JVM is fine, we can drop `(String, _, _) => Column`
variant so that brings us to two variants.
   - It is also not hard to build `Column` objects transparently for Python and R users to
support `(Column, Column, Column) => Column`. But I am still concerned about confusing
semantics. 
   
     If two variants are still to much, we could always have `(Column, Any, Double) =>
Column` ‒ `o.a.sql.functions` is already quite full of `Any`s. Or if we're fine with making
Java users miserable, we could `(Column, Either[Double, Array[Double], Double) => Column`,
but this will require additional supporting code for R and Python.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message