giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-247) Introduce edge based partitioning for InputSplits
Date Thu, 12 Jul 2012 17:25:34 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412974#comment-13412974
] 

Eli Reisman commented on GIRAPH-247:
------------------------------------

I'm going to try using the edge count coming off that round of the vertexReader and just clear
my data after each partition is built. If this works I think we can avoid iteration inside
the Partition.getEdgeCount() completely, good save Allesandro, thanks!

Will upload new patch ASAP
                
> Introduce edge based partitioning for InputSplits
> -------------------------------------------------
>
>                 Key: GIRAPH-247
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-247
>             Project: Giraph
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-247-1.patch
>
>
> Experiments on larger data input sets while maintaining low memory profile has revealed
that typical social graph data is very lumpy and partitioning by vertices can easily overload
some unlucky worker nodes who end up with partitions containing highly-connected vertices
while other nodes process partitions with the same number of vertices but far fewer out-edges
per vertex. This often results in cascading failures during data load-in even on tiny data
sets.
> By partitioning using edges (the default I set in GiraphJob.MAX_EDGES_PER_PARTITION_DEFAULT
is 200,000 per partition, or the old default # of vertices, whichever the user's input format
reaches first when reading InputSplits) I have seen dramatic "de-lumpification" of data, allow
the processing of 8x larger data sets before memory problems occur at a given configuration
setting.
> This needs more tuning, but comes with a -Dgiraph.maxEdgesPerPartition that can be set
to more edges/partition as your data sets grow or memory limitations shrink. This might be
considered a first attempt, perhaps simply allowing us to default to this type of partitioning
or the old version would be more compatible with existing users' needs? That would not be
a hard feature to add to this. But I think this method of partition production has merit for
typical large-scale graph data that Giraph is designed to process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message