spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From r...@apache.org
Subject git commit: MLLIB-25: Implicit ALS runs out of memory for moderately large numbers of features
Date Fri, 21 Feb 2014 21:39:43 GMT
Repository: incubator-spark
Updated Branches:
  refs/heads/branch-0.9 b3fff962e -> 998abaecb


MLLIB-25: Implicit ALS runs out of memory for moderately large numbers of features

There's a step in implicit ALS where the matrix `Yt * Y` is computed. It's computed as the
sum of matrices; an f x f matrix is created for each of n user/item rows in a partition. In
`ALS.scala:214`:

```
        factors.flatMapValues{ case factorArray =>
          factorArray.map{ vector =>
            val x = new DoubleMatrix(vector)
            x.mmul(x.transpose())
          }
        }.reduceByKeyLocally((a, b) => a.addi(b))
         .values
         .reduce((a, b) => a.addi(b))
```

Completely correct, but there's a subtle but quite large memory problem here. map() is going
to create all of these matrices in memory at once, when they don't need to ever all exist
at the same time.
For example, if a partition has n = 100000 rows, and f = 200, then this intermediate product
requires 32GB of heap. The computation will never work unless you can cough up workers with
(more than) that much heap.

Fortunately there's a trivial change that fixes it; just add `.view` in there.

Author: Sean Owen <sowen@cloudera.com>

Closes #629 from srowen/ALSMatrixAllocationOptimization and squashes the following commits:

062cda9 [Sean Owen] Update style per review comments
e9a5d63 [Sean Owen] Avoid unnecessary out of memory situation by not simultaneously allocating
lots of matrices

(cherry picked from commit c8a4c9b1f6005815f5a4a331970624d1706b6b13)
Signed-off-by: Reynold Xin <rxin@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/incubator-spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spark/commit/998abaec
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spark/tree/998abaec
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spark/diff/998abaec

Branch: refs/heads/branch-0.9
Commit: 998abaecbc76f0a2b0317350d0ee589e78b0fbb0
Parents: b3fff96
Author: Sean Owen <sowen@cloudera.com>
Authored: Fri Feb 21 12:46:12 2014 -0800
Committer: Reynold Xin <rxin@apache.org>
Committed: Fri Feb 21 13:39:17 2014 -0800

----------------------------------------------------------------------
 .../main/scala/org/apache/spark/mllib/recommendation/ALS.scala   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spark/blob/998abaec/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
----------------------------------------------------------------------
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
index 89ee070..3e93402 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
@@ -209,8 +209,8 @@ class ALS private (var numBlocks: Int, var rank: Int, var iterations:
Int, var l
   def computeYtY(factors: RDD[(Int, Array[Array[Double]])]) = {
     if (implicitPrefs) {
       Option(
-        factors.flatMapValues{ case factorArray =>
-          factorArray.map{ vector =>
+        factors.flatMapValues { case factorArray =>
+          factorArray.view.map { vector =>
             val x = new DoubleMatrix(vector)
             x.mmul(x.transpose())
           }


Mime
View raw message