hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Stewart <robstewar...@gmail.com>
Subject clarifying same-node parallelism options
Date Fri, 03 Feb 2012 01:31:08 GMT

I'd like to clarify the difference between two Hadoop parameters, and
the MultithreadedMapper class relating to same-node parallelism.
Please correct my assumptions if they are wrong:

-- mapreduce.tasktracker.[map/reduce].tasks.maximum
This is the maximum number of tasks that a task tracker will host at
any one time. As each task is by default assigned in their own JVM, it
could similarly be said that this parameter states the maximum number
of JVMs that could be running on a slave node at any one time (in
addition to the TaskTracker JVM itself.. ?

-- mapreduce.job.jvm.numtasks
The number of tasks that a JVM will sequentially accept and execute,
before it is then killed. A value of -1 means an indefinite number of

1) What effect does setting the number of JVM tasks to -1 have ? Does
such a JVM greedily consume all incoming tasks to the node? What if
you set the maximum number of tasks to, say, 10, but also set the
maximum number of tasks per JVM to -1 ? Will the first 10 incoming
tasks spawn 10 JVMs, or will the first incoming task spawn one JVM,
which then greedily consumes the remaining 9, as well? What about a
node that is to receive 100 tasks for a job. If you set
mapreduce.job.jvm.numtasks=-1, and
mapreduce.tasktracker.map.tasks.maximum=10, then will 10 JVMs be
created initially, which will then stay alive for the duration, and no
more JVMs are created.. ?

2) When would you want to set mapreduce.job.jvm.numtasks to more than
1? A very large number of very small map tasks, perhaps?

3) Does setting mapreduce.job.jvm.numtasks to more than 1 implement
some sort of queueing mechanism for each JVM? i.e. Some task queue
that has its own scheduler accepting tasks from the task tracker? Or
is it simply a case of - once the JVM has evaluated the task, it won't
die, and instead state its "ready" state?

4) I'm not sure I understand the purpose of the MultithreadedMapper
class. Imagine the case of using a cluster of 8-core slave nodes. What
is the difference between using a Hadoop single-threaded environment,
setting the maximum number of tasks to 8 VS using MultithreadedMapper
setting maximum threads to 8, and the maximum number of tasks to 1. In
either case, you'd be evaluating 8 map tasks in parallel on the
multicore node. One by using 8 JVMs as OS processes. The other, but
using 8 threads in one JVM i.e. 1 OS process. Are there separate cases
for arguing both approaches? I have read that Hadoop is not
thread-safe, though I'm not sure of the implications of this - is it a
performance penalty, or worse, transform your code into a
non-deterministic form, or wreak havoc with platform stability?



View raw message