hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Stewart <robstewar...@gmail.com>
Subject Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum
Date Fri, 10 Feb 2012 13:02:21 GMT
hi Harsh,

On 10 February 2012 12:42, Harsh J <harsh@cloudera.com> wrote:

> 4 JVMs if you have 4 tasks in your Job  (# of map tasks of a job is
> dependent on its input).
> Each JVM will then run the MultithreadedMapper code, which will then
> run 4 threads to call your map() inside of it cause you've asked that
> of it.

So.. the MultithreadedMapper class splits *one* map task into N number
of threads? How is this achieved? I wasn't aware that a map task could
be implicitly sub-divided implicitly? I was under the (false?)
impression that the purpose of a MultithreadedMapper enabled the
opportunity to send N number of independent map tasks to be forked as
threads. ?

Also, from what you say.. if you have map.tasks.maximum = 4 and
setNumberOfThreads(4), then in all, for each compute node, up to 16
threads could be forked at any one time?

I'm trying to identify the performance penalty or performance benefit
of achieving node concurrency with threads, rather than multiple JVMs.
I and I was hoping that setting map.tasks.maximum = 1, and
setNumberOfThreads( #cores ), I would achieve that. Maybe not?



View raw message