spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mengxr <...@git.apache.org>
Subject [GitHub] spark pull request: [MLlib] [SPARK-2885] DIMSUM: All-pairs similar...
Date Tue, 16 Sep 2014 01:42:47 GMT
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1778#discussion_r17579669
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
---
    @@ -390,6 +393,113 @@ class RowMatrix(
         new RowMatrix(AB, nRows, B.numCols)
       }
     
    +  /**
    +   * Compute all cosine similarities between columns of this matrix using the brute-force
    +   * approach of computing normalized dot products.
    +   *
    +   * @return An n x n sparse upper-triangular matrix of cosine similarities between
    +   *         columns of this matrix.
    +   */
    +  def columnSimilarities(): CoordinateMatrix = {
    +    similarColumns(0.0)
    +  }
    +
    +  /**
    +   * Compute all similarities between columns of this matrix using a sampling approach.
    +   *
    +   * The threshold parameter is a trade-off knob between estimate quality and computational
cost.
    +   *
    +   * Setting a threshold of 0 guarantees deterministic correct results, but comes at
exactly
    +   * the same cost as the brute-force approach. Setting the threshold to positive values
    +   * incurs strictly less computational cost than the brute-force approach, however the
    +   * similarities computed will be estimates.
    +   *
    +   * The sampling guarantees relative-error correctness for those pairs of columns that
have
    +   * similarity greater than the given similarity threshold.
    +   *
    +   * To describe the guarantee, we set some notation:
    +   * Let A be the smallest in magnitude non-zero element of this matrix.
    +   * Let B be the largest  in magnitude non-zero element of this matrix.
    +   * Let L be the number of non-zeros per row.
    +   *
    +   * For example, for {0,1} matrices: A=B=1.
    +   * Another example, for the Netflix matrix: A=1, B=5
    +   *
    +   * For those column pairs that are above the threshold,
    +   * the computed similarity is correct to within 20% relative error with probability
    +   * at least 1 - (0.981)^(100/B)
    +   *
    +   * The shuffle size is bounded by the *smaller* of the following two expressions:
    +   *
    +   * O(n log(n) L / (threshold * A))
    +   * O(m L^2)
    +   *
    +   * The latter is the cost of the brute-force approach, so for non-zero thresholds,
    +   * the cost is always cheaper than the brute-force approach.
    +   *
    +   * @param threshold Similarities above this threshold are probably computed correctly.
    --- End diff --
    
    Elaborate more on `probably computed correctly`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message