giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Claudio Martella (JIRA)" <>
Subject [jira] [Created] (GIRAPH-436) Improvement of partition selection for out-of-core graph
Date Wed, 21 Nov 2012 15:26:01 GMT
Claudio Martella created GIRAPH-436:

             Summary: Improvement of partition selection for out-of-core graph
                 Key: GIRAPH-436
             Project: Giraph
          Issue Type: Improvement
          Components: graph
    Affects Versions: 0.2.0
            Reporter: Claudio Martella
            Priority: Minor

Currently, the OOC graph component needs a parameter K that describes how many partitions
should be kept in memory out of N (the number of partitions assigned to that worker). Hence
N - K partitions are constantly scanned and from/to disk.
Now, as discussed on GIRAPH-249, for algorithms were the all the vertices are NOT always active,
e.g. SSSP, this can have an impact on performance, as we might get to (a) scan a partition
on disk with very few active vertices (b) scan a partition on disk with NO active vertices
(c) scan a partition on disk with all active vertices while we have some partitions with NO
active vertices in memory. The impact of OOC graph on other algorithms, e.g. PR-like algorithms,
is less important, as the cost of hitting the disk is completely amortized (each IO byte is
actually computed as all the vertices are active) and efficient (scan-based).

For the affected class of algorithms, two contributions are possible:
1) add a header to the partition file with the stats (a boolean hasActive field) about active
vertices in the partition. If that partition hasn't received messages AND all the vertices
are inactive, it can be skipped completely. Solves for problem (a). Cost: one random IO to
the header at the end of each OOC partition computation.
2) greedy heuristic that tries to keep in memory only K partitions, when possible, that have
hasActive set to true. Once an inactive partition is spilled to disk, and it stays to disk,
it won't be scanned/read back to memory unless necessary (works with contribution (1)). This
accounts for problem (c).

As far as (b) is concerned, there's not much one can do without breaking linear scanning.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message