spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <>
Subject Question about GraphX connected-components
Date Fri, 09 Oct 2015 21:13:47 GMT
We are looking into using the GraphX connected-components algorithm on Hadoop for grouping
operations.  Our typical data is on the order of 50-200M vertices with an edge:vertex ratio
between 2 and 30.  While there are pathological cases of very large groups, they tend to be
small.  I am trying to get a handle on the level of performance and scaling we should expect,
and how to best configure GraphX/Spark to get there.  After some trying, we cannot get to
100M vertices/edges without running out of memory on a small cluster (8 nodes with 4 cores
and 8GB available for YARN on each node).  This limit seems low, as 64GB/100M is 640 bytes
per vertex, which should be enough.  Is this within reason?  Does anyone have sample they
can share that has the right configurations for succeeding with this size of data and cluster?
 What level of performance should we expect?  What happens when the data set exceed memory,
does it spill to disk "nicely" or degrade catastrophically?

John Lilley

View raw message