# hama-commits mailing list archives

##### Site index · List index
Message view
Top
From Apache Wiki <wikidi...@apache.org>
Subject [Hama Wiki] Update of "SpMV" by Mikalai Parafeniuk
Date Wed, 27 Jun 2012 08:24:54 GMT
```Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hama Wiki" for change notification.

The "SpMV" page has been changed by Mikalai Parafeniuk:
http://wiki.apache.org/hama/SpMV?action=diff&rev1=5&rev2=6

=== Problem description ===
As a sequential problem SpMV is almost trivial problem. But in case of parallel version
- 1. Partitioning of matrix and vector components. This means that we should split the input
matrix and vectors by peers, if we want to have benefits from usage of parallel algorithm.
Wise partitioning should be taken or communication time will rise very much and algorithm
will be inefficient.
+  1. Partitioning of matrix and vector components. This means that we should split the input
matrix and vectors by peers, if we want to have benefits from usage of parallel algorithm.
Wise partitioning should be taken or communication time will rise very much and algorithm
will be inefficient.
- 2. Load balancing. This means that each peer must perform nearly the same amount of work,
and none of them should idle.
+  2. Load balancing. This means that each peer must perform nearly the same amount of work,
and none of them should idle.
- 3. We should keep communication in bounds. In case of paralel SpMV we should take partitioning
wise to keep communication in appropriate bounds independently of sparsity patterns of input
matrix and vector.
+  3. We should keep communication in bounds. In case of paralel SpMV we should take partitioning
wise to keep communication in appropriate bounds independently of sparsity patterns of input
matrix and vector.

=== Implementation tips ===
- 1. Order of distribution and representation. We have two choices in this aspect: represent
matrix first and distribute later, or distribute matrix first and represent later. In first
case  (represent first, distribute later) all simple operations will be non-local and will
bring some unnecessary overhead. In other case (distribute first, represent later) all local
operations on processor remain local: algorithm first determines responsible processor and
it performs operation locally. Thats why I prefer distribution first representation later
approach.
+  1. Order of distribution and representation. We have two choices in this aspect: represent
matrix first and distribute later, or distribute matrix first and represent later. In first
case  (represent first, distribute later) all simple operations will be non-local and will
bring some unnecessary overhead. In other case (distribute first, represent later) all local
operations on processor remain local: algorithm first determines responsible processor and
it performs operation locally. Thats why I prefer distribution first representation later
approach.
- 2. Data transmission direction. Here we also have two choices: delivery vector component
to processor which possesses non-zero matrix component or vice versa. In most cases a number
of non-zero items in matrix is much larger than vector length, thats why we prefer transmission
of vector.
+  2. Data transmission direction. Here we also have two choices: delivery vector component
to processor which possesses non-zero matrix component or vice versa. In most cases a number
of non-zero items in matrix is much larger than vector length, thats why we prefer transmission
of vector.

=== Algorithm description ===
The generic algorithm will be divided in three supersteps:
- 0. Matrix and vector distribution.
+  0. Matrix and vector distribution.
- 1. Fanout.
+  1. Fanout.
- 2. Local computation.
+  2. Local computation.
- 3. Fanin.
+  3. Fanin.
In Fanout phase all processors gets needed v components. In local computation phase local
contribution to result vector is calculated. In Fanin phase all local contributions are sent
to an owner of u. Most of efforts should be taken to choose right matrix and vector distribution
which will improve the comunication volume of Fanout and Fanin phase. As base implementation
of distribution I propose to create Cartesian (column mappings are not dependent of row mappings
and vice versa) cyclic-block distribution with cyclical distribution of matrix diagonal. Also
I assume that distr(u) != distr(v), which gives us more freedom in optimising vector distribution.
This type of distribution has such advantages: it is simple, in fanin only communication with
processor column is needed, in fanout only communication with processor row is needed, we
can easily predict the productivity of algorithm. After matrix distribution we perform vector
distribution in greedy way for each processor: processor grabs items until he reaches it's
optimum state. After that stage some vector components can be unassigned (nearly 10%). We
well distribute them in greedy fashion to. To support efficient local computation used data
structure should provide indeces of rows and columns which have the non-zero item in them.
Local computation must be performed with local indeces.

=== Possible improvements ===
- 1. Implementation of Mondrian distribution. In most cases it gives better results in comparison
with cyclic-block Cartesian scheme.
+  1. Implementation of Mondrian distribution. In most cases it gives better results in comparison
with cyclic-block Cartesian scheme.
- 2. Implement algorithm for detecting matrix sparsity patterns. This will give us a possibility
to define, for example, if matrix is random sparse matrix, or Laplacian matrix. In case of
random sparse matrices we can use distribution patterns which are independent of matrix sparsity
pattern. In case of Laplacian matrices we diamond distribution can give better result.
+  2. Implement algorithm for detecting matrix sparsity patterns. This will give us a possibility
to define, for example, if matrix is random sparse matrix, or Laplacian matrix. In case of
random sparse matrices we can use distribution patterns which are independent of matrix sparsity
pattern. In case of Laplacian matrices we diamond distribution can give better result.
- 3. In phase of vector distribution when some vectors remain unassigned we can use graph
algoritms to determine the owner of vector component.
+  3. In phase of vector distribution when some vectors remain unassigned we can use graph
algoritms to determine the owner of vector component.

=== Literature ===
- 1. Rob H. Bisseling - Parallel Scientific computation. (chapter 4).
+  1. Rob H. Bisseling - Parallel Scientific computation. (chapter 4).
- 2. Steve Rennich - Block SpMV on GPU.
+  2. Steve Rennich - Block SpMV on GPU.

```
Mime
View raw message