flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1695) Create machine learning library
Date Mon, 16 Mar 2015 17:23:40 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363526#comment-14363526
] 

ASF GitHub Bot commented on FLINK-1695:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/479#discussion_r26506950
  
    --- Diff: docs/ml/alternating_least_squares.md ---
    @@ -0,0 +1,157 @@
    +---
    +mathjax: include
    +title: Alternating Least Squares
    +---
    +<!--
    +Licensed to the Apache Software Foundation (ASF) under one
    +or more contributor license agreements.  See the NOTICE file
    +distributed with this work for additional information
    +regarding copyright ownership.  The ASF licenses this file
    +to you under the Apache License, Version 2.0 (the
    +"License"); you may not use this file except in compliance
    +with the License.  You may obtain a copy of the License at
    +
    +  http://www.apache.org/licenses/LICENSE-2.0
    +
    +Unless required by applicable law or agreed to in writing,
    +software distributed under the License is distributed on an
    +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +KIND, either express or implied.  See the License for the
    +specific language governing permissions and limitations
    +under the License.
    +-->
    +
    +* This will be replaced by the TOC
    +{:toc}
    +
    +## Description
    +
    +The alternating least squares (ALS) algorithm factorizes a given matrix $R$ into two
factors $U$ and $V$ such that $R \approx U^TV$.
    +The unknown row dimension is given as a parameter to the algorithm and is called latent
factors.
    +Since matrix factorization can be used in the context of recommendation, the matrices
$U$ and $V$ can be called user and item matrix, respectively.
    +The $i$th column of the user matrix is denoted by $u_i$ and the $i$th column of the item
matrix is $v_i$.
    +The matrix $R$ can be called the ratings matrix with $$(R)_{i,j} = r_{i,j}$$.
    +
    +In order to find the user and item matrix, the following problem is solved:
    +
    +$$\arg\min_{U,V} \sum_{\{i,j\mid r_{i,j} \not= 0\}} \left(r_{i,j} - u_{i}^Tv_{j}\right)^2
+ 
    +\lambda \left(\sum_{i} n_{u_i} \left\lVert u_i \right\rVert^2 + \sum_{j} n_{v_j} \left\lVert
v_j \right\rVert^2 \right)$$
    +
    +with $\lambda$ being the regularization factor, $$n_{u_i}$$ being the number of items
the user $i$ has rated and $$n_{v_j}$$ being the number of times the item $j$ has been rated.
    +This regularization scheme to avoid overfitting is called weighted-$\lambda$-regularization.
    +Details can be found in the work of [Zhou et al.](http://dx.doi.org/10.1007/978-3-540-68880-8_32).
    +
    +By fixing one of the matrices $U$ or $V$, we obtain a quadratic form which can be solved
directly.
    +The solution of the modified problem is guaranteed to monotonically decrease the overall
cost function.
    +By applying this step alternately to the matrices $U$ and $V$, we can iteratively improve
the matrix factorization.
    +
    +The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where
$i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$.
    +
    +
    +## Parameters
    +
    +The alternating least squares implementation can be controlled by the following parameters:
    +
    +   <table class="table table-bordered">
    +    <thead>
    +      <tr>
    +        <th class="text-left" style="width: 20%">Parameters</th>
    +        <th class="text-center">Description</th>
    +      </tr>
    +    </thead>
    +
    +    <tbody>
    +      <tr>
    +        <td><strong>NumFactors</strong></td>
    +        <td>
    +          <p>
    +            The number of latent factors to use for the underlying model.
    +            It is equivalent to the dimension of the calculated user and item vectors.
    +            (Default value: <strong>10</strong>)
    +          </p>
    +        </td>
    +      </tr>
    +      <tr>
    +        <td><strong>Lambda</strong></td>
    +        <td>
    +          <p>
    +            Regularization factor. Tune this value in order to avoid overfitting or poor
performance due to strong generalization.
    +            (Default value: <strong>1</strong>)
    +          </p>
    +        </td>
    +      </tr>
    +      <tr>
    +        <td><strong>Iterations</strong></td>
    +        <td>
    +          <p>
    +            The maximum number of iterations.
    +            (Default value: <strong>10</strong>)
    +          </p>
    +        </td>
    +      </tr>
    +      <tr>
    +        <td><strong>Blocks</strong></td>
    +        <td>
    +          <p>
    +            The number of blocks into which the user and item matrix a grouped. 
    +            The fewer blocks one uses, the less data is sent redundantly. 
    +            However, bigger blocks entail bigger update messages which have to be stored
on the heap. 
    +            If the algorithm fails because of an OutOfMemoryException, then try to increase
the number of blocks. 
    +            (Default value: '''None''')
    +          </p>
    +        </td>
    +      </tr>
    +      <tr>
    +        <td><strong>Seed</strong></td>
    +        <td>
    +          <p>
    +            Random seed used to generate the initial item matrix for the algorithm.
    +            (Default value: <strong>0</strong>)
    +          </p>
    +        </td>
    +      </tr>
    +      <tr>
    +        <td><strong>TemporaryPath</strong></td>
    +        <td>
    +          <p>
    +            Path to a temporary directory into which intermediate results are stored.

    +            If this value is set, then the algorithm is split into two preprocessing
steps, the ALS iteration  and a post-processing step which calculates a last ALS half-step.

    +            The preprocessing steps calculate the <code>OutBlockInformation</code>
and <code>InBlockInformation</code> for the given rating matrix. 
    +            The result of the individual steps are stored in the specified directory.

    +            By splitting the algorithm into multiple smaller steps, Flink does not have
to split the available memory amongst too many operators. 
    +            This allows the system to process bigger individual messasges and improves
the overall performance. 
    --- End diff --
    
    :+1: 


> Create machine learning library
> -------------------------------
>
>                 Key: FLINK-1695
>                 URL: https://issues.apache.org/jira/browse/FLINK-1695
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>
> Create the infrastructure for Flink's machine learning library. This includes the creation
of the module structure and the implementation of basic types such as vectors and matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message