spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Berman <igor.ber...@gmail.com>
Subject Re: Question about GraphX connected-components
Date Sat, 10 Oct 2015 18:06:04 GMT
let's start from some basics: might be u need to split your data into more
partitions?
spilling depends on your configuration when you create graph(look for
storage level param) and your global configuration.
in addition, you assumption of 64GB/100M is probably wrong, since spark
divides memory into 3 regions - for in memory caching, for shuffling and
for "workspace" of serialization/deserialization etc see fraction
parameters.

so depending on number of your partitions might be worker will try to
ingest too much data at once(#cores * memory pressure of one task per one
partition)

there is no such thing as "right" configuration. It depends on your
application. You can post your configuration and people will suggest some
tunning, still best way is to try what is best for ur case depending on
what u see in spark ui metrics(as starting point)

On 10 October 2015 at 00:13, John Lilley <john.lilley@redpoint.net> wrote:

> Greetings,
>
> We are looking into using the GraphX connected-components algorithm on
> Hadoop for grouping operations.  Our typical data is on the order of
> 50-200M vertices with an edge:vertex ratio between 2 and 30.  While there
> are pathological cases of very large groups, they tend to be small.  I am
> trying to get a handle on the level of performance and scaling we should
> expect, and how to best configure GraphX/Spark to get there.  After some
> trying, we cannot get to 100M vertices/edges without running out of memory
> on a small cluster (8 nodes with 4 cores and 8GB available for YARN on each
> node).  This limit seems low, as 64GB/100M is 640 bytes per vertex, which
> should be enough.  Is this within reason?  Does anyone have sample they can
> share that has the right configurations for succeeding with this size of
> data and cluster?  What level of performance should we expect?  What
> happens when the data set exceed memory, does it spill to disk “nicely” or
> degrade catastrophically?
>
>
>
> Thanks,
>
> *John Lilley*
>
>
>

Mime
View raw message