giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <>
Subject Re: Out of memory with giraph-release-1.0.0-RC3, used to work on old Giraph
Date Wed, 04 Sep 2013 18:18:20 GMT
The amount of memory for the send message cache is per worker =
number of compute threads * number of workers * size of the cache.

The number of partitions doesn't affect the memory usage very much.  My 
advice would be to dial down the cache size a bit with MAX_MSG_REQUEST_SIZE.


On 9/4/13 3:33 AM, Lukas Nalezenec wrote:
> Thanks,
> I was not sure if it really works as I described.
> > Facebook can't be using it like this if, as described, they have 
> billions of vertices and a trillion edges.
> Yes, its strange. I guess configuration does not help so much on large 
> cluster. What might help are properties of input data.
> > So do you, or Avery, have any idea how you might initialize this is 
> a more reasonable way, and how???
> Fast workaround is to set number of partitions to from W^2 to W or 2*W 
> .  It will help if you dont have very large number of workers.
> I would not change MAX_*_REQUEST_SIZE much since it may hurt performance.
> You can do some preprocessing before loading data to Giraph.
> How to change Giraph:
> The caches could be flushed if total sum of vertexes/edges in all 
> caches exceeds some number. Ideally, it should prevent not only 
> OutOfMemory errors but also raising high water mark. Not sure if it 
> (preventing raising HWM) is easy to do.
> I am going to use almost-prebuild partitions. For my use case it would 
> be ideal to detect if some cache is abandoned and i would not be used 
> anymore. It would cut memory usage in caches from ~O(n^3) to ~O(n).  
> It could be done by counting number of cache flushes or cache 
> insertions and if some cache was not touched for long time it would be 
> flushed.
> There could be separated configuration MAX_*_REQUEST_SIZE for per 
> partition caches during loading data.
> I guess there should be simple but efficient way how to trace memory 
> high-water mark. It could look like:
> Loading data: Memory high-water mark: start: 100 Gb end: 300 Gb
> Iteration 1 Computation: Memory high-water mark: start: 300 Gb end: 300 Gb
> Iteration 1 XYZ ....
> Iteration 2 Computation: Memory high-water mark: start: 300 Gb end: 300 Gb
> .
> .
> .
> Lukas
> On 09/04/13 01:12, Jeff Peters wrote:
>> Thank you Lukas!!! That's EXACTLY the kind of model I was building in 
>> my head over the weekend about why this might be happening, and why 
>> increasing the number of AWS instances (and workers) does not solve 
>> the problem without increasing each worker's VM. Surely Facebook 
>> can't be using it like this if, as described, they have billions of 
>> vertices and a trillion edges. So do you, or Avery, have any idea how 
>> you might initialize this is a more reasonable way, and how???
>> On Mon, Sep 2, 2013 at 6:08 AM, Lukas Nalezenec 
>> < 
>> <>> wrote:
>>     Hi
>>     I wasted few days on similar problem.
>>     I guess the problem was that during loading - if you have got W
>>     workers and W^2 partitions there are W^2 partition caches in each
>>     worker.
>>     Each cache can hold 10 000 vertexes by default.
>>     I had 26 000 000 vertexes, 60 workers -> 3600 partitions. It
>>     means that there can be up to 36 000 000 vertexes in caches in
>>     each worker if input files are random.
>>     Workers were assigned 450 000 vertexes but failed when they had
>>     900 000 vertexes in memory.
>>     Btw: Why default number of partitions is W^2 ?
>>     (I can be wrong)
>>     Lukas
>>     On 08/31/13 01:54, Avery Ching wrote:
>>>     Ah, the new caches. =)  These make things a lot faster (bulk
>>>     data sending), but do take up some additional memory.  if you
>>>     look at GiraphConstants, you can find ways to change the cache
>>>     sizes (this will reduce that memory usage).
>>>     For example, MAX_EDGE_REQUEST_SIZE will affect the size of the
>>>     edge cache. MAX_MSG_REQUEST_SIZE will affect the size of the
>>>     message cache.  The caches are per worker, so 100 workers would
>>>     require 50 MB  per worker by default.  Feel free to trim it if
>>>     you like.
>>>     The byte arrays for the edges are the most efficient storage
>>>     possible (although not as performance as the native edge stores).
>>>     Hope that helps,
>>>     Avery
>>>     On 8/29/13 4:53 PM, Jeff Peters wrote:
>>>>     Avery, it would seem that optimizations to Giraph have,
>>>>     unfortunately, turned the majority of the heap into "dark
>>>>     matter". The two snapshots are at unknown points in a superstep
>>>>     but I waited for several supersteps so that the activity had
>>>>     more or less stabilized. About the only thing comparable
>>>>     between the two snapshots are the vertexes, 192561 X
>>>>     "RecsVertex" in the new version and 191995 X "Coloring" in the
>>>>     old system. But with the new Giraph 672710176 out of 824886184
>>>>     bytes are stored as primitive byte arrays. That's probably
>>>>     indicative of some very fine performance optimization work, but
>>>>     it makes it extremely difficult to know what's really out
>>>>     there, and why. I did notice that a number of caches have
>>>>     appeared that did not exist before,
>>>>     namely SendEdgeCache, SendPartitionCache, SendMessageCache
>>>>     and SendMutationsCache.
>>>>     Could any of those account for a larger per-worker footprint in
>>>>     a modern Giraph? Should I simply assume that I need to force
>>>>     AWS to configure its EMR Hadoop so that each instance has fewer
>>>>     map tasks but with a somewhat larger VM max, say 3GB instead of
>>>>     2GB?
>>>>     On Wed, Aug 28, 2013 at 4:57 PM, Avery Ching <
>>>>     <>> wrote:
>>>>         Try dumping a histogram of memory usage from a running JVM
>>>>         and see where the memory is going.  I can't think of
>>>>         anything in particular that changed...
>>>>         On 8/28/13 4:39 PM, Jeff Peters wrote:
>>>>             I am tasked with updating our ancient (circa 7/10/2012)
>>>>             Giraph to giraph-release-1.0.0-RC3. Most jobs run fine
>>>>             but our largest job now runs out of memory using the
>>>>             same AWS elastic-mapreduce configuration we have always
>>>>             used. I have never tried to configure either Giraph or
>>>>             the AWS Hadoop. We build for Hadoop 1.0.2 because
>>>>             that's closest to the 1.0.3 AWS provides us. The 8 X
>>>>             m2.4xlarge cluster we use seems to provide 8*14=112 map
>>>>             tasks fitted out with 2GB heap each. Our code is
>>>>             completely unchanged except as required to adapt to the
>>>>             new Giraph APIs. Our vertex, edge, and message data are
>>>>             completely unchanged. On smaller jobs, that work, the
>>>>             aggregate heap usage high-water mark seems about the
>>>>             same as before, but the "committed heap" seems to run
>>>>             higher. I can't even make it work on a cluster of 12.
>>>>             In that case I get one map task that seems to end up
>>>>             with nearly twice as many messages as most of the
>>>>             others so it runs out of memory anyway. It only takes
>>>>             one to fail the job. Am I missing something here?
>>>>             Should I be configuring my new Giraph in some way I
>>>>             didn't used to need to with the old one?

View raw message