I am experimenting with a memory-intensive Giraph application on top of a large graph (50 million nodes), on a 14 node cluster.
When setting the number of workers to a large number (500 in this example), I get errors for not being able to fulfill the number of requested workers (Please see the log excerpt below). To my understanding, this contradicts with how Yarn/MR map tasks operate, as if the number of map tasks is more than what is currently available in terms of resources, only a subset of the maps are started, and new ones are assigned as new slots become available. In other words, as many map tasks as possible can run concurrently, and new ones are run as resources become available. Is not this the case with Giraph workers? I expect it to be the case, since workers are basically map tasks, so the same should apply to them. However, the log below suggests otherwise, as based on my resources, 37 map tasks (workers) could be created, but the application could not proceed without creating all the 500 workers. Could you please help explaining what is causing this?
Only found 37 responses of 500 needed to start superstep -1. Reporting every 30000 msecs, 296929 more msecs left before giving up.
2015-01-20 01:29:49,007 ERROR [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 37 of 500 required) after waiting 600000msecs). This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers.
2015-01-20 01:29:49,015 FATAL [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: failJob: Killing job job_1421703431598_0006
2015-01-20 01:29:49,015 FATAL [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: failJob: exception java.lang.IllegalStateException: Not enough healthy workers to create input splits