giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Cornell <>
Subject How do I control which tasks run on which hosts?
Date Mon, 29 Sep 2014 16:04:29 GMT
Hi Folks,

I have a small CDH4 cluster of five hosts (four compute nodes and a head
node - call them 0-3 and 'w') where hosts 0-3 have 4 cores and 16GB RAM
each, and 'w' has 32 cores and 64GB RAM. All five hosts are running
mapreduce tasktracker services, and 'w' is also running the jobtracker.
Resources are tight for my particular Giraph application (a kind of
path-finding), and I've discovered that some configurations of selected
hosts are better than others. My command specifies four workers:

    hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \ \
    -libjars ${LIBJARS} \
    relpath.RelPathVertex \
    -wc relpath.RelPathWorkerContext \
    -mc relpath.RelPathMasterCompute \
    -vif relpath.JsonAdjacencyListVertexInputFormat \
    -vip $REL_PATH_INPUT \
    -of relpath.JsonAdjacencyListTextOutputFormat \
    -op $REL_PATH_OUTPUT \
    -ca RelPathVertex.path=$REL_PATH_PATH \
    -w 4

When Giraph (Zookeeper?) puts three or more of the Giraph map tasks on 'w'
(e.g., 01www or 1wwww), then that host maxes out ram, cpu, and swap, and
the job hangs. However, when the system spreads the work out more evenly so
that 'w' has only two or fewer tasks (e.g., 123ww or 0321w), then the job
finishes fine.

My question is 1) what program is deciding the task-to-host assigment, and
2) how do I control that? Thanks very much!

Matthew Cornell | | 413-626-3621 | 34 Dickinson
Street, Amherst MA 01002 |

View raw message