giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman" <>
Subject Review Request: Regulating the size of outgoing Collections of BasicVertex over RPC during INPUT_SUPERSTEP by amount of vertices instead of edges creates large variation in message size, can overwhelm Netty buffers.
Date Wed, 18 Jul 2012 00:31:28 GMT

This is an automatically generated e-mail. To reply, visit:

Review request for giraph.


Variable and Configuration option GiraphJob.MAX_VERTICES_PER_PARTITION was unclearly named.
This config setting has nothing to do with regulating partition size, but in fact regulated
the size of collections of vertices (temporarily stored in Partition objects) that were to
be sent over RPC to their assigned host for the calculation super steps. By regulating these
sizes based on # of vertices rather than # of edges per collection sent, the message sizes
varied greatly, and in metrics based testing would overwhelm Netty buffers and cause failure
of otherwise acceptable large jobs on my cluster, given my memory and worker constraints.
This patch:

1. renames vars/config options to clearer titles that indicate they have to do with outgoing
data burst size at the host who reads the data from InputSplit; the vertices in the data bursts
are assembled into their true Partitions after traveling to the remote host where the master
has assigned them according to partitionID. This greatly adds to the clarity of this very
important section of code

2. Adds -Dgiraph.maxEdgesPerTransfer to regulate the data burst sizes by # of edges in a given
temporary collection rather than by # of vertices, which dramatically increased throughput
and kept healthy jobs from being overwhelmed at any point by a large RPC data burst during
INPUT_SUPERSTEP. This repeatedly made a huge difference in the amount of data I could run
in a successful job without failures, using the same memory and worker constraints I was previously
running on "trunk".




July 14, 15, 16th on cluster with many times the data throughput I could achieve without it
using the same memory/worker constraints. Passes 'mvn verify' etc.


Eli Reisman

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message