spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sethah <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-10641][WIP][SQL] Add Skewness and Kurto...
Date Fri, 23 Oct 2015 17:13:50 GMT
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9003#discussion_r42889596
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
---
    @@ -930,3 +930,327 @@ object HyperLogLogPlusPlus {
       )
       // scalastyle:on
     }
    +
    +/**
    + * A central moment is the expected value of a specified power of the deviation of a
random
    + * variable from the mean. Central moments are often used to characterize the properties
of about
    + * the shape of a distribution.
    + *
    + * This class implements online, one-pass algorithms for computing the central moments
of a set of
    + * points.
    + *
    + * References:
    + *  - Xiangrui Meng.  "Simpler Online Updates for Arbitrary-Order Central Moments."
    + *      2015. http://arxiv.org/abs/1510.04923
    + *
    + * @see [[https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
    + *     Algorithms for calculating variance (Wikipedia)]]
    + *
    + * @param child to compute central moments of.
    + */
    +abstract class CentralMomentAgg(child: Expression) extends ImperativeAggregate with Serializable
{
    +
    +  /**
    +   * The maximum central moment order to be computed.
    +   */
    +  protected def momentOrder: Int
    +
    +  /**
    +   * Array of sufficient moments need to compute the aggregate statistic.
    +   */
    +  protected def sufficientMoments: Array[Int]
    +
    +  override def children: Seq[Expression] = Seq(child)
    +
    +  override def nullable: Boolean = false
    +
    +  override def dataType: DataType = DoubleType
    +
    +  // Expected input data type.
    +  // TODO: Right now, we replace old aggregate functions (based on AggregateExpression1)
to the
    +  // new version at planning time (after analysis phase). For now, NullType is added
at here
    +  // to make it resolved when we have cases like `select avg(null)`.
    +  // We can use our analyzer to cast NullType to the default data type of the NumericType
once
    +  // we remove the old aggregate functions. Then, we will not need NullType at here.
    +  override def inputTypes: Seq[AbstractDataType] = Seq(TypeCollection(NumericType, NullType))
    +
    +  override def aggBufferSchema: StructType = StructType.fromAttributes(aggBufferAttributes)
    +
    +  /**
    +   * The number of central moments to store in the buffer.
    +   */
    +  private[this] val numMoments = 5
    +
    +  override val aggBufferAttributes: Seq[AttributeReference] = Seq.tabulate(numMoments)
{ i =>
    +    AttributeReference(s"M$i", DoubleType)()
    +  }
    +
    +  // Note: although this simply copies aggBufferAttributes, this common code can not
be placed
    +  // in the superclass because that will lead to initialization ordering issues.
    +  override val inputAggBufferAttributes: Seq[AttributeReference] =
    +    aggBufferAttributes.map(_.newInstance())
    +
    +  /**
    +   * Initialize all moments to zero.
    +   */
    +  override def initialize(buffer: MutableRow): Unit = {
    +    for (aggIndex <- 0 until numMoments) {
    +      buffer.setDouble(mutableAggBufferOffset + aggIndex, 0.0)
    +    }
    +  }
    +
    +  // frequently used values for online updates
    +  private[this] var delta = 0.0
    --- End diff --
    
    done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message