##### Site index · List index
Message view
Top
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1695) Create machine learning library
Date Mon, 16 Mar 2015 17:23:40 GMT

]

ASF GitHub Bot commented on FLINK-1695:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

--- Diff: docs/ml/alternating_least_squares.md ---
@@ -0,0 +1,157 @@
+---
+mathjax: include
+title: Alternating Least Squares
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+
+Unless required by applicable law or agreed to in writing,
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+The alternating least squares (ALS) algorithm factorizes a given matrix $R$ into two
factors $U$ and $V$ such that $R \approx U^TV$.
+The unknown row dimension is given as a parameter to the algorithm and is called latent
factors.
+Since matrix factorization can be used in the context of recommendation, the matrices
$U$ and $V$ can be called user and item matrix, respectively.
+The $i$th column of the user matrix is denoted by $u_i$ and the $i$th column of the item
matrix is $v_i$.
+The matrix $R$ can be called the ratings matrix with $$(R)_{i,j} = r_{i,j}$$.
+
+In order to find the user and item matrix, the following problem is solved:
+
+$$\arg\min_{U,V} \sum_{\{i,j\mid r_{i,j} \not= 0\}} \left(r_{i,j} - u_{i}^Tv_{j}\right)^2 + +\lambda \left(\sum_{i} n_{u_i} \left\lVert u_i \right\rVert^2 + \sum_{j} n_{v_j} \left\lVert v_j \right\rVert^2 \right)$$
+
+with $\lambda$ being the regularization factor, $$n_{u_i}$$ being the number of items
the user $i$ has rated and $$n_{v_j}$$ being the number of times the item $j$ has been rated.
+This regularization scheme to avoid overfitting is called weighted-$\lambda$-regularization.
+Details can be found in the work of [Zhou et al.](http://dx.doi.org/10.1007/978-3-540-68880-8_32).
+
+By fixing one of the matrices $U$ or $V$, we obtain a quadratic form which can be solved
directly.
+The solution of the modified problem is guaranteed to monotonically decrease the overall
cost function.
+By applying this step alternately to the matrices $U$ and $V$, we can iteratively improve
the matrix factorization.
+
+The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where
$i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$.
+
+
+## Parameters
+
+The alternating least squares implementation can be controlled by the following parameters:
+
+   <table class="table table-bordered">
+      <tr>
+        <th class="text-left" style="width: 20%">Parameters</th>
+        <th class="text-center">Description</th>
+      </tr>
+
+    <tbody>
+      <tr>
+        <td><strong>NumFactors</strong></td>
+        <td>
+          <p>
+            The number of latent factors to use for the underlying model.
+            It is equivalent to the dimension of the calculated user and item vectors.
+            (Default value: <strong>10</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>Lambda</strong></td>
+        <td>
+          <p>
+            Regularization factor. Tune this value in order to avoid overfitting or poor
performance due to strong generalization.
+            (Default value: <strong>1</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>Iterations</strong></td>
+        <td>
+          <p>
+            The maximum number of iterations.
+            (Default value: <strong>10</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>Blocks</strong></td>
+        <td>
+          <p>
+            The number of blocks into which the user and item matrix a grouped.
+            The fewer blocks one uses, the less data is sent redundantly.
+            However, bigger blocks entail bigger update messages which have to be stored
on the heap.
+            If the algorithm fails because of an OutOfMemoryException, then try to increase
the number of blocks.
+            (Default value: '''None''')
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>Seed</strong></td>
+        <td>
+          <p>
+            Random seed used to generate the initial item matrix for the algorithm.
+            (Default value: <strong>0</strong>)
+          </p>
+        </td>
+      </tr>
+      <tr>
+        <td><strong>TemporaryPath</strong></td>
+        <td>
+          <p>
+            Path to a temporary directory into which intermediate results are stored.

+            If this value is set, then the algorithm is split into two preprocessing
steps, the ALS iteration  and a post-processing step which calculates a last ALS half-step.

+            The preprocessing steps calculate the <code>OutBlockInformation</code>
and <code>InBlockInformation</code> for the given rating matrix.
+            The result of the individual steps are stored in the specified directory.

+            By splitting the algorithm into multiple smaller steps, Flink does not have
to split the available memory amongst too many operators.
+            This allows the system to process bigger individual messasges and improves
the overall performance.
--- End diff --

:+1:

> Create machine learning library
> -------------------------------
>
>          Issue Type: Improvement
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>
> Create the infrastructure for Flink's machine learning library. This includes the creation
of the module structure and the implementation of basic types such as vectors and matrices.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message