Reply-To: dev@mahout.apache.org Delivered-To: mailing list commits@mahout.apache.org Received: (qmail 66965 invoked by uid 99); 29 Apr 2017 05:08:22 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Apr 2017 05:08:22 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 95BCFE1797; Sat, 29 Apr 2017 05:08:22 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: rawkintrevo@apache.org To: commits@mahout.apache.org Date: Sat, 29 Apr 2017 05:08:24 -0000 Message-Id: <7a3ab3c3377044ba839e410b53bd4288@git.apache.org> In-Reply-To: <193e0237fcde4222b16c7f4a58f29667@git.apache.org> References: <193e0237fcde4222b16c7f4a58f29667@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [03/51] [partial] mahout git commit: WEBSITE rename old-site archived-at: Sat, 29 Apr 2017 05:08:25 -0000 http://git-wip-us.apache.org/repos/asf/mahout/blob/0e718ec9/website/oldsite/_site/users/algorithms/d-spca.html ---------------------------------------------------------------------- diff --git a/website/oldsite/_site/users/algorithms/d-spca.html b/website/oldsite/_site/users/algorithms/d-spca.html new file mode 100644 index 0000000..140f174 --- /dev/null +++ b/website/oldsite/_site/users/algorithms/d-spca.html @@ -0,0 +1,482 @@ + + + + + + + Apache Mahout: Scalable machine learning and data mining + + + + + + + + + + + + + + + + + + + + + +

+ + + + +

+ +

Distributed Stochastic PCA

+ +

Intro

+ +

Mahout has a distributed implementation of Stochastic PCA1. This algorithm computes the exact equivalent of Mahout’s dssvd(\(\mathbf{A-1\mu^\top}\)) by modifying the dssvd algorithm so as to avoid forming \(\mathbf{A-1\mu^\top}\), which would densify a sparse input. Thus, it is suitable for work with both dense and sparse inputs.

+ +

Algorithm

+ +

Given an m \(\times\) n matrix \(\mathbf{A}\), a target rank k, and an oversampling parameter p, this procedure computes a k-rank PCA by finding the unknowns in \(\mathbf{A−1\mu^\top \approx U\Sigma V^\top}\):

+ +

Create seed for random n \(\times\) (k+p) matrix \(\Omega\).
\(\mathbf{s_\Omega \leftarrow \Omega^\top \mu}\).
\(\mathbf{Y_0 \leftarrow A\Omega − 1 {s_\Omega}^\top, Y \in \mathbb{R}^{m\times(k+p)}}\).
Column-orthonormalize \(\mathbf{Y_0} \rightarrow \mathbf{Q}\) by computing thin decomposition \(\mathbf{Y_0} = \mathbf{QR}\). Also, \(\mathbf{Q}\in\mathbb{R}^{m\times(k+p)}, \mathbf{R}\in\mathbb{R}^{(k+p)\times(k+p)}\).
\(\mathbf{s_Q \leftarrow Q^\top 1}\).
\(\mathbf{B_0 \leftarrow Q^\top A: B \in \mathbb{R}^{(k+p)\times n}}\).
\(\mathbf{s_B \leftarrow {B_0}^\top \mu}\).
For i in 1..q repeat (power iterations): +
- For j in 1..n apply \(\mathbf{(B_{i−1})_{∗j} \leftarrow (B_{i−1})_{∗j}−\mu_j s_Q}\).
- \(\mathbf{Y_i \leftarrow A{B_{i−1}}^\top−1(s_B−\mu^\top \mu s_Q)^\top}\).
- Column-orthonormalize \(\mathbf{Y_i} \rightarrow \mathbf{Q}\) by computing thin decomposition \(\mathbf{Y_i = QR}\).
- \(\mathbf{s_Q \leftarrow Q^\top 1}\).
- \(\mathbf{B_i \leftarrow Q^\top A}\).
- \(\mathbf{s_B \leftarrow {B_i}^\top \mu}\).
+
Let \(\mathbf{C \triangleq s_Q {s_B}^\top}\). \(\mathbf{M \leftarrow B_q {B_q}^\top − C − C^\top + \mu^\top \mu s_Q {s_Q}^\top}\).
Compute an eigensolution of the small symmetric \(\mathbf{M = \hat{U} \Lambda \hat{U}^\top: M \in \mathbb{R}^{(k+p)\times(k+p)}}\).
The singular values \(\Sigma = \Lambda^{\circ 0.5}\), or, in other words, \(\mathbf{\sigma_i= \sqrt{\lambda_i}}\).
If needed, compute \(\mathbf{U = Q\hat{U}}\).
If needed, compute \(\mathbf{V = B^\top \hat{U} \Sigma^{−1}}\).
If needed, items converted to the PCA space can be computed as \(\mathbf{U\Sigma}\).

+ +

Implementation

+ +

Mahout dspca(...) is implemented in the mahout math-scala algebraic optimizer which translates Mahout’s R-like linear algebra operators into a physical plan for both Spark and H2O distributed engines.

+ +

def dspca[K](drmA: DrmLike[K], k: Int, p: Int = 15, q: Int = 0): 
+(DrmLike[K], DrmLike[Int], Vector) = {
+
+    // Some mapBlock() calls need it
+    implicit val ktag =  drmA.keyClassTag
+
+    val drmAcp = drmA.checkpoint()
+    implicit val ctx = drmAcp.context
+
+    val m = drmAcp.nrow
+	val n = drmAcp.ncol
+    assert(k <= (m min n), "k cannot be greater than smaller of m, n.")
+    val pfxed = safeToNonNegInt((m min n) - k min p)
+
+    // Actual decomposition rank
+    val r = k + pfxed
+
+    // Dataset mean
+    val mu = drmAcp.colMeans
+
+    val mtm = mu dot mu
+
+    // We represent Omega by its seed.
+    val omegaSeed = RandomUtils.getRandom().nextInt()
+    val omega = Matrices.symmetricUniformView(n, r, omegaSeed)
+
+    // This done in front in a single-threaded fashion for now. Even though it doesn't require any
+    // memory beyond that is required to keep xi around, it still might be parallelized to backs
+    // for significantly big n and r. TODO
+    val s_o = omega.t %*% mu
+
+    val bcastS_o = drmBroadcast(s_o)
+    val bcastMu = drmBroadcast(mu)
+
+    var drmY = drmAcp.mapBlock(ncol = r) {
+        case (keys, blockA) ⇒
+            val s_o:Vector = bcastS_o
+            val blockY = blockA %*% Matrices.symmetricUniformView(n, r, omegaSeed)
+            for (row ← 0 until blockY.nrow) blockY(row, ::) -= s_o
+            keys → blockY
+    }
+            // Checkpoint Y
+            .checkpoint()
+
+    var drmQ = dqrThin(drmY, checkRankDeficiency = false)._1.checkpoint()
+
+    var s_q = drmQ.colSums()
+    var bcastVarS_q = drmBroadcast(s_q)
+
+    // This actually should be optimized as identically partitioned map-side A'B since A and Q should
+    // still be identically partitioned.
+    var drmBt = (drmAcp.t %*% drmQ).checkpoint()
+
+    var s_b = (drmBt.t %*% mu).collect(::, 0)
+    var bcastVarS_b = drmBroadcast(s_b)
+
+    for (i ← 0 until q) {
+
+        // These closures don't seem to live well with outside-scope vars. This doesn't record closure
+        // attributes correctly. So we create additional set of vals for broadcast vars to properly
+        // create readonly closure attributes in this very scope.
+        val bcastS_q = bcastVarS_q
+        val bcastMuInner = bcastMu
+
+        // Fix Bt as B' -= xi cross s_q
+        drmBt = drmBt.mapBlock() {
+            case (keys, block) ⇒
+                val s_q: Vector = bcastS_q
+                val mu: Vector = bcastMuInner
+                keys.zipWithIndex.foreach {
+                    case (key, idx) ⇒ block(idx, ::) -= s_q * mu(key)
+                }
+                keys → block
+        }
+
+        drmY.uncache()
+        drmQ.uncache()
+
+        val bCastSt_b = drmBroadcast(s_b -=: mtm * s_q)
+
+        drmY = (drmAcp %*% drmBt)
+            // Fix Y by subtracting st_b from each row of the AB'
+            .mapBlock() {
+            case (keys, block) ⇒
+                val st_b: Vector = bCastSt_b
+                block := { (_, c, v) ⇒ v - st_b(c) }
+                keys → block
+        }
+        // Checkpoint Y
+        .checkpoint()
+
+        drmQ = dqrThin(drmY, checkRankDeficiency = false)._1.checkpoint()
+
+        s_q = drmQ.colSums()
+        bcastVarS_q = drmBroadcast(s_q)
+
+        // This on the other hand should be inner-join-and-map A'B optimization since A and Q_i are not
+        // identically partitioned anymore.
+        drmBt = (drmAcp.t %*% drmQ).checkpoint()
+
+        s_b = (drmBt.t %*% mu).collect(::, 0)
+        bcastVarS_b = drmBroadcast(s_b)
+    }
+
+    val c = s_q cross s_b
+    val inCoreBBt = (drmBt.t %*% drmBt).checkpoint(CacheHint.NONE).collect -=:
+        c -=: c.t +=: mtm *=: (s_q cross s_q)
+    val (inCoreUHat, d) = eigen(inCoreBBt)
+    val s = d.sqrt
+
+    // Since neither drmU nor drmV are actually computed until actually used, we don't need the flags
+    // instructing compute (or not compute) either of the U,V outputs anymore. Neat, isn't it?
+    val drmU = drmQ %*% inCoreUHat
+    val drmV = drmBt %*% (inCoreUHat %*% diagv(1 / s))
+
+    (drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k))
+}
+

+ +

Usage

+ +

The scala dspca(...) method can easily be called in any Spark, Flink, or H2O application built with the math-scala library and the corresponding Spark, Flink, or H2O engine module as follows:

+ +

import org.apache.mahout.math._
+import decompositions._
+import drm._
+
+val (drmU, drmV, s) = dspca(drmA, k=200, q=1)
+

+ +

Note the parameter is optional and its default value is zero.

+ +

References

+ + +

+ + + + + + + + http://git-wip-us.apache.org/repos/asf/mahout/blob/0e718ec9/website/oldsite/_site/users/algorithms/d-ssvd.html ---------------------------------------------------------------------- diff --git a/website/oldsite/_site/users/algorithms/d-ssvd.html b/website/oldsite/_site/users/algorithms/d-ssvd.html new file mode 100644 index 0000000..e0daa43 --- /dev/null +++ b/website/oldsite/_site/users/algorithms/d-ssvd.html @@ -0,0 +1,443 @@ + + + + + + + Apache Mahout: Scalable machine learning and data mining + + + + + + + + + + + + + + + + + + + + + +

+ + + + +

+ +

Distributed Stochastic Singular Value Decomposition

+ +

Intro

+ +

Mahout has a distributed implementation of Stochastic Singular Value Decomposition 1 using the parallelization strategy comprehensively defined in Nathan Halko’s dissertation “Randomized methods for computing low-rank approximations of matrices” 2.

+ +

Modified SSVD Algorithm

+ +

Given an \(m\times n\) +matrix \(\mathbf{A}\), a target rank \(k\in\mathbb{N}_{1}\) +, an oversampling parameter \(p\in\mathbb{N}_{1}\), +and the number of additional power iterations \(q\in\mathbb{N}_{0}\), +this procedure computes an \(m\times\left(k+p\right)\) +SVD \(\mathbf{A\approx U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\):

+ +

+
Create seed for random \(n\times\left(k+p\right)\) + matrix \(\boldsymbol{\Omega}\). The seed defines matrix \(\mathbf{\Omega}\) + using Gaussian unit vectors per one of suggestions in [Halko, Martinsson, Tropp].
+
+
\(\mathbf{Y=A\boldsymbol{\Omega}},\,\mathbf{Y}\in\mathbb{R}^{m\times\left(k+p\right)}\)
+
+
Column-orthonormalize \(\mathbf{Y}\rightarrow\mathbf{Q}\) + by computing thin decomposition \(\mathbf{Y}=\mathbf{Q}\mathbf{R}\). + Also, \(\mathbf{Q}\in\mathbb{R}^{m\times\left(k+p\right)},\,\mathbf{R}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\); denoted as \(\mathbf{Q}=\mbox{qr}\left(\mathbf{Y}\right).\mathbf{Q}\)
+
+
\(\mathbf{B}_{0}=\mathbf{Q}^{\top}\mathbf{A}:\,\,\mathbf{B}\in\mathbb{R}^{\left(k+p\right)\times n}\).
+
+
If \(q>0\) + repeat: for \(i=1..q\): + \(\mathbf{B}_{i}^{\top}=\mathbf{A}^{\top}\mbox{qr}\left(\mathbf{A}\mathbf{B}_{i-1}^{\top}\right).\mathbf{Q}\) + (power iterations step).
+
+
Compute Eigensolution of a small Hermitian \(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}=\mathbf{\hat{U}}\boldsymbol{\Lambda}\mathbf{\hat{U}}^{\top}\), + \(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\).
+
+
Singular values \(\mathbf{\boldsymbol{\Sigma}}=\boldsymbol{\Lambda}^{0.5}\), + or, in other words, \(s_{i}=\sqrt{\sigma_{i}}\).
+
+
If needed, compute \(\mathbf{U}=\mathbf{Q}\hat{\mathbf{U}}\).
+
+
If needed, compute \(\mathbf{V}=\mathbf{B}_{q}^{\top}\hat{\mathbf{U}}\boldsymbol{\Sigma}^{-1}\). +Another way is \(\mathbf{V}=\mathbf{A}^{\top}\mathbf{U}\boldsymbol{\Sigma}^{-1}\).
+

+ +

Implementation

+ +

Mahout dssvd(...) is implemented in the mahout math-scala algebraic optimizer which translates Mahout’s R-like linear algebra operators into a physical plan for both Spark and H2O distributed engines.

+ +

def dssvd[K: ClassTag](drmA: DrmLike[K], k: Int, p: Int = 15, q: Int = 0):
+    (DrmLike[K], DrmLike[Int], Vector) = {
+
+    val drmAcp = drmA.checkpoint()
+
+    val m = drmAcp.nrow
+    val n = drmAcp.ncol
+    assert(k <= (m min n), "k cannot be greater than smaller of m, n.")
+    val pfxed = safeToNonNegInt((m min n) - k min p)
+
+    // Actual decomposition rank
+    val r = k + pfxed
+
+    // We represent Omega by its seed.
+    val omegaSeed = RandomUtils.getRandom().nextInt()
+
+    // Compute Y = A*Omega.  
+    var drmY = drmAcp.mapBlock(ncol = r) {
+        case (keys, blockA) =>
+            val blockY = blockA %*% Matrices.symmetricUniformView(n, r, omegaSeed)
+        keys -> blockY
+    }
+
+    var drmQ = dqrThin(drmY.checkpoint())._1
+
+    // Checkpoint Q if last iteration
+    if (q == 0) drmQ = drmQ.checkpoint()
+
+    var drmBt = drmAcp.t %*% drmQ
+    
+    // Checkpoint B' if last iteration
+    if (q == 0) drmBt = drmBt.checkpoint()
+
+    for (i <- 0  until q) {
+        drmY = drmAcp %*% drmBt
+        drmQ = dqrThin(drmY.checkpoint())._1            
+        
+        // Checkpoint Q if last iteration
+        if (i == q - 1) drmQ = drmQ.checkpoint()
+        
+        drmBt = drmAcp.t %*% drmQ
+        
+        // Checkpoint B' if last iteration
+        if (i == q - 1) drmBt = drmBt.checkpoint()
+    }
+
+    val (inCoreUHat, d) = eigen(drmBt.t %*% drmBt)
+    val s = d.sqrt
+
+    // Since neither drmU nor drmV are actually computed until actually used
+    // we don't need the flags instructing compute (or not compute) either of the U,V outputs 
+    val drmU = drmQ %*% inCoreUHat
+    val drmV = drmBt %*% (inCoreUHat %*%: diagv(1 /: s))
+
+    (drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k))
+}
+

+ +

Note: As a side effect of checkpointing, U and V values are returned as logical operators (i.e. they are neither checkpointed nor computed). Therefore there is no physical work actually done to compute \(\mathbf{U}\) or \(\mathbf{V}\) until they are used in a subsequent expression.

+ +

Usage

+ +

The scala dssvd(...) method can easily be called in any Spark or H2O application built with the math-scala library and the corresponding Spark or H2O engine module as follows:

+ +

import org.apache.mahout.math._
+import decompositions._
+import drm._
+
+
+val(drmU, drmV, s) = dssvd(drma, k = 40, q = 1)
+

+ +

References

+ +

approximations of matrices](http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf)

+ + +

+ + + + + + + +

Twitter

Apache Software Foundation

Related Projects

Distributed Stochastic PCA

Intro

Algorithm

Implementation

Usage

References

Twitter

Apache Software Foundation

Related Projects

Distributed Stochastic Singular Value Decomposition

Intro

Modified SSVD Algorithm

Implementation

Usage

References