spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mgaido91 <...@git.apache.org>
Subject [GitHub] spark pull request #19340: [SPARK-22119][ML] Add cosine distance to KMeans
Date Sun, 14 Jan 2018 08:53:11 GMT
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19340#discussion_r161391180
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -573,44 +705,45 @@ object KMeans {
       }
     
       /**
    -   * Returns the K-means cost of a given point against the given cluster centers.
    +   * @return whether a center converged or not, given the epsilon parameter.
        */
    -  private[mllib] def pointCost(
    -      centers: TraversableOnce[VectorWithNorm],
    -      point: VectorWithNorm): Double =
    -    findClosest(centers, point)._2
    +  override def isCenterConverged(
    +      oldCenter: VectorWithNorm,
    +      newCenter: VectorWithNorm,
    +      epsilon: Double): Boolean = {
    +    EuclideanDistanceMeasure.fastSquaredDistance(newCenter, oldCenter) <= epsilon
* epsilon
    --- End diff --
    
    Using `sqrt` would introduce a performance regression. This is the reason why I can't
use only a function to differentiate the to distance measures, because the implementation
for Euclidean distance is highly optimized and this is an optimization: avoiding sqrt can
be a great performance improvement since it is an expensive operation.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message