spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Ray (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-21184) QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n
Date Thu, 22 Jun 2017 20:34:01 GMT
Andrew Ray created SPARK-21184:
----------------------------------

             Summary: QuantileSummaries implementation is wrong and QuantileSummariesSuite
fails with larger n
                 Key: SPARK-21184
                 URL: https://issues.apache.org/jira/browse/SPARK-21184
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.1.1
            Reporter: Andrew Ray


1. QuantileSummaries implementation does not match the paper it is supposed to be based on.

1a. The compress method (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240)
merges neighboring buckets, but thats not what the paper says to do. The paper (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf)
describes an implicit tree structure and the compress method deletes selected subtrees.

1b. The paper does not discuss merging these summary data structures at all. The following
comment is in the merge method of QuantileSummaries:

{quote}      // The GK algorithm is a bit unclear about it, but it seems there is no need
to adjust the
      // statistics during the merging: the invariants are still respected after the merge.{quote}

Unless I'm missing something that needs substantiation, it's not clear that that the invariants
hold.

2. QuantileSummariesSuite fails with n = 10000 (and other non trivial values)
https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27

One possible solution if these issues can't be resolved would be to move to an algorithm that
explicitly supports merging and is well tested like https://github.com/tdunning/t-digest




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message