Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Hama Wiki" for change notification.
The "SpMV" page has been changed by Mikalai Parafeniuk:
http://wiki.apache.org/hama/SpMV?action=diff&rev1=8&rev2=9
== Distributed Sparse MatrixVector Multiplication on Hama ==
=== Introduction ===
 In further description we will research problem in form u = Av. Most computational algoritms
spend large percent of time for solving systems of linear equations. In general, linear system
of equations can be represented in matrix form Ax = b, where A is matrix with n rows and n
columns, b  vector of size n, x  unknown solution vector which we are searching. Some approaches
for solving linear systems has iterative nature. Assume, we know the initial approximation
of x = x0. After that we represen our system in form xn = Bxn1 + c, where c  vector of size
n. After that we have to found next approximations of x till the convergence. In real world
most of matrices contain relatively small number of nonzero items in comparison to total
number of matrix items. Such matrices are called sparse matrices, matrices which filled most
with nonzero items are called dense. So this page will describe the problem of sparse matrix
vector multiplication(SpMV) with use of Bulk Synchronous Programming(BSP) model implemented
in Apache Hama project. As shown above, SpMV can be used in different iterative solvers for
system of linear equations.
+ In further description we will research problem in form u = Av. Most computational algoritms
spends large percent of time for solving large systems of linear equations. In general, system
of linear equations can be represented in matrix form Ax = b, where A is matrix with n rows
and n columns, b  vector of size n, x  unknown solution vector which we are searching. Some
approaches for solving linear systems has iterative nature. Assume, we know the initial approximation
of x = x0. After that we represent our system in form xn = Bxn1 + c, where c  vector of
size n. After that we have to found next approximations of x till the convergence. In real
world most of matrices contain relatively small number of nonzero items in comparison to
total number of matrix items. Such matrices are called sparse matrices, matrices which filled
most with nonzero items are called dense. Sparse matrices arise when each variable from the
set is connected with small subset of variables (for example, differential equation of heat
conduction). So, this page will describe the problem of sparse matrix vector multiplication(SpMV)
with use of Bulk Synchronous Programming(BSP) model implemented in Apache Hama project. As
shown above, SpMV can be used in different iterative solvers for system of linear equations.
 Bulk Synchronous model proposes it's own smart way of parallelization of programs. The input
problem is separated by peers. Peers can be a processors, threads, separate machines, different
items of cloud. BSP algorithm is divided in sequence of supersteps. Barrier synchronization
of all peers is made after each superstep. The implementation of BSP(Apache Hama) contains
primitives for defining peer number, communication with other peers with different communication
primitives, optimizations of communication between peers.
+ Bulk Synchronous model proposes it's own smart way of parallelization of programs. We can
specify input path for problem and number of peers. Framework reads the input and divides
it between peers. Peers can be a processors, threads, separate machines, different items of
cloud. BSP algorithm is divided in sequence of supersteps. Barrier synchronization of all
peers is made after each superstep. The implementation of BSP(Apache Hama) contains primitives
for defining peer number, communication with other peers with different communication primitives,
optimizations of communication between peers, also it inherits most of features and approaches
of Hadoop project.
=== Problem description ===
As a sequential problem SpMV is almost trivial problem. But in case of parallel version
we should think about some additional aspects:
 1. Partitioning of matrix and vector components. This means that we should split the input
matrix and vectors by peers, if we want to have benefits from usage of parallel algorithm.
Wise partitioning should be taken or communication time will rise very much and algorithm
will be inefficient.
+ 1. Partitioning of matrix and vector components. This means that we should split the input
matrix and vectors by peers, if we want to have benefits from usage of parallel algorithm.
Wise partitioning should be taken or communication time will rise very much or we will get
great load imbalance and algorithm will be inefficient.
2. Load balancing. This means that each peer must perform nearly the same amount of work,
and none of them should idle.
 3. We should keep communication in bounds. In case of paralel SpMV we should take partitioning
wise to keep communication in appropriate bounds independently of sparsity patterns of input
matrix and vector.
+ 3. We must consider Hadoop and Hama approach for parallelization.
=== Implementation tips ===
 1. Order of distribution and representation. We have two choices in this aspect: represent
matrix first and distribute later, or distribute matrix first and represent later. In first
case (represent first, distribute later) all simple operations will be nonlocal and will
bring some unnecessary overhead. In other case (distribute first, represent later) all local
operations on processor remain local: algorithm first determines responsible processor and
it performs operation locally. Thats why I prefer distribution first representation later
approach.
 2. Data transmission direction. Here we also have two choices: delivery vector component
to processor which possesses nonzero matrix component or vice versa. In most cases a number
of nonzero items in matrix is much larger than vector length, thats why we prefer transmission
of vector.
+ 1. Framework splits the input file to peers automatically. So we don't need to perform
mapping of matrix to peers manually. We only must define how matrix can be written to file
and how it can be readed from it. If we create matrix, which consists from separate cells,
framework will give some subset of cells to each peer. If we create matrix consisting from
rows, framework will give subset of rows to each peer. The ways to influence on partitioning:
creating different writables for matrices as described above, overriding default partitioner
class behavior.
+ 2. We don't need to care about communication in case of rowwise matrix access. First of
all, rows of matrix are splitted automatically by the framework. After that we can compute
inner product of the vector and concrete matrix row, and the result can be directly printed
to output, because it is one of the cells of result vector. In this case we assume, that peer's
memory can fit two vectors. Even if we have million x million matrix and vector of size million,
some megabytes will be enough to store them. Even if we split input vector the gain in memory
will be insignificant.
=== Algorithm description ===
 The generic algorithm will be divided in three supersteps:
+ The generic algorithm will contain one superstep, because no communication is needed:
0. Matrix and vector distribution.
 1. Fanout.
+ 1. Custom partitioning.
2. Local computation.
+ 3. Output of result vector.
+ In setup stage every peer reads input dense vector from file. After that, framework will
partition matrix rows by the algorithm provided in custom partitioner automatically. After
that local computation is performed. We gain some cells of result vector, and they are written
to output file.
 3. Fanin.
 In Fanout phase all processors gets needed v components. In local computation phase local
contribution to result vector is calculated. In Fanin phase all local contributions are sent
to an owner of u. Most of efforts should be taken to choose right matrix and vector distribution
which will improve the comunication volume of Fanout and Fanin phase. As base implementation
of distribution I propose to create Cartesian (column mappings are not dependent of row mappings
and vice versa) cyclicblock distribution with cyclical distribution of matrix diagonal. Also
I assume that distr(u) != distr(v), which gives us more freedom in optimising vector distribution.
This type of distribution has such advantages: it is simple, in fanin only communication with
processor column is needed, in fanout only communication with processor row is needed, we
can easily predict the productivity of algorithm. After matrix distribution we perform vector
distribution in greedy way for each processor: processor grabs items until he reaches it's
optimum state. After that stage some vector components can be unassigned (nearly 10%). We
well distribute them in greedy fashion to. To support efficient local computation used data
structure should provide indeces of rows and columns which have the nonzero item in them.
Local computation must be performed with local indeces.

 === Dealing with large matrices ===
 Current code contains classes which work with matrix in memory. That's why algorithm will
fail in case of large matrices. So I propose some steps to modify SpMV algorithm to work with
large matrices. First of all, simple matrix format based on work with file system will be
created. Let's call this class FileMatrix. This format will give such possibilities:
 1. we can set matrix cell and it will be appended to file, without any check.
 2. we can iterate through entire file for getting all matrix cells.
 Such constraints are chosen because it is hard to imagine, how we can efficiently implement
some matrix operations, for example, get cell with specified index. In the same time this
constraints meets constraints of HDFS (large size of block, data will be written once and
read many times, fast sequential reading of entire file). Creation of such class won't take
much time, and it will be possible to store and read large matrices. The bottleneck in current
algorithm in memory consumption  phase of matrix distribution. Array of local matrices is
stored in memory. We can correct the situation in such way: input matrix is stored in file,
after that we iterate through matrix and map its components to local matrices also presented
as FileMatrix. From now we won't store array of local matrices in memory, in this step we
assume that matrix, taken from local file can fit memory of local processor. But even this
can be improved. When local matrix can't fit local processor memory we will invoke local SpMV
algorithm on matrix parts. I propose to create class, which implements Mapper interface from
linearalgebra package, and split local matrix recursively into chunks presented like FileMatrix
until each chunk can fit local memory. After that local chunks will be verified. I will call
further this process twophase mapping. After making such steps, we will avoid storing entire
matrix in memory, now we assume that matrix can fit entire space of hard drives and can store
components of local vector in memory. Also twophase mapping can be used in RandomMatrixGenerator
for large matrices.
=== Possible improvements ===
+ 1. Significant improvement in total time of algorithm can be achieved by creating custom
partitioner class. It will give us load balancing and therefore better efficiency. This is
the main possibility for optimization, because we decided, that using of rowwise matrix access
i acceptable. Maybe it can be achieved by reordering of input or by customizing partitioning
algorithm of framework.
 1. Implementation of Mondrian distribution. In most cases it gives better results in comparison
with cyclicblock Cartesian scheme.
 2. Implement algorithm for detecting matrix sparsity patterns. This will give us a possibility
to define, for example, if matrix is random sparse matrix, or Laplacian matrix. In case of
random sparse matrices we can use distribution patterns which are independent of matrix sparsity
pattern. In case of Laplacian matrices we diamond distribution can give better result.
 3. In phase of vector distribution when some vectors remain unassigned we can use graph
algoritms to determine the owner of vector component.
=== Literature ===
1. Rob H. Bisseling  Parallel Scientific computation. (chapter 4).
