giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Claudio Martella (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-436) Improvement of partition selection for out-of-core graph
Date Sat, 22 Dec 2012 12:07:12 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538765#comment-13538765
] 

Claudio Martella commented on GIRAPH-436:
-----------------------------------------

You're totally right Alessandro, that's a good point. I was thinking about investigating this
exact statistic, but with 1M vertices for example, the likelihood to have a partition with
no active vertices in SSSP is very very low. Probably, the best way would be to implement
it this way, and act on the data partitioning. For example, with some topology partitioning
and such (FB for example could use the country of residence of each user?), it should be more
likely that inactive vertices are partitioned together. 
                
> Improvement of partition selection for out-of-core graph
> --------------------------------------------------------
>
>                 Key: GIRAPH-436
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-436
>             Project: Giraph
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.2.0
>            Reporter: Claudio Martella
>            Priority: Minor
>
> Currently, the OOC graph component needs a parameter K that describes how many partitions
should be kept in memory out of N (the number of partitions assigned to that worker). Hence
N - K partitions are constantly scanned and from/to disk.
> Now, as discussed on GIRAPH-249, for algorithms were the all the vertices are NOT always
active, e.g. SSSP, this can have an impact on performance, as we might get to (a) scan a partition
on disk with very few active vertices (b) scan a partition on disk with NO active vertices
(c) scan a partition on disk with all active vertices while we have some partitions with NO
active vertices in memory. The impact of OOC graph on other algorithms, e.g. PR-like algorithms,
is less important, as the cost of hitting the disk is completely amortized (each IO byte is
actually computed as all the vertices are active) and efficient (scan-based).
> For the affected class of algorithms, two contributions are possible:
> 1) add a header to the partition file with the stats (a boolean hasActive field) about
active vertices in the partition. If that partition hasn't received messages AND all the vertices
are inactive, it can be skipped completely. Solves for problem (a). Cost: one random IO to
the header at the end of each OOC partition computation.
> 2) greedy heuristic that tries to keep in memory only K partitions, when possible, that
have hasActive set to true. Once an inactive partition is spilled to disk, and it stays to
disk, it won't be scanned/read back to memory unless necessary (works with contribution (1)).
This accounts for problem (c).
> As far as (b) is concerned, there's not much one can do without breaking linear scanning.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message