Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of shijuwei@gmail.com designates
 209.85.216.176 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:date:message-id:subject:from:to:content-type;
        b=LFZ/nS/hYEbsy8J0lSIC8XZyvSim6FSXjG/yf/Vp9W7o+UptEICebzxoPequGOqsp8
         p3hUDX1Fr/NUnNPw8Y5xSlt/6/a/liTyLy8Ae26vPPheM5mixScE1E7YxBFMV/TA7pND
         CQDbpCACkTEOHlfXS+RbzvHITmVO/Kbi2lUfo=
MIME-Version: 1.0
Date: Tue, 19 Apr 2011 17:58:21 +0800
Message-ID: <BANLkTimBc9+kBVREgW9RvpyZBf4v0BghrQ@mail.gmail.com>
Subject: Questions about MultithreadedMapper
From: Juwei Shi <shijuwei@gmail.com>
To: mapreduce-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=90e6ba5bb9dd5685e504a1428cc9

--90e6ba5bb9dd5685e504a1428cc9
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi,


I am looking at the feature of multithreaded map tasks. I find that the new
API provides org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper class
to enable multi-thread in each map task. We can also set the number of
threads in the thread pool that will run the map function by
setNumberOfThreads API.


Here I want to clarify the scenarios in which we should enable the
multithreaded map tasks. Generally, Hadoop MapReduce provides the
mapred.tasktracker.map.tasks.maximum parameter to control capacity of
concurrent map tasks (also we have corresponding parameter for reduce
tasks). We can start more child task JVM to increase CPU utilization. We do
not need multithreaded tasks in most scenarios. However, multithreaded task=
s
may be enabled in the specific scenarios:

1)      When the workload is bounded by Memory or I/O, not CPU. For example=
,
we want load input of running map task into memory, and we can only load 50
GB input to the cluster at most, but the CPU of the cluster is not fully
utilized. Then we can enable multithreaded tasks to increase the CPU
utilization.

2)      When the tasks are unbalanced. I have encountered this problem when
I process very large social graphs. If I assigned 200 map tasks (averagely =
8
concurrent map tasks for each node, totally 7 nodes), 99% of tasks complete
within 1 hour. But the rest 1% of tasks will take more than 10 hours. This
is caused by un-balanced degree distribution of the social graph. The CPU
utilization of the running node is lower than 20% when most tasks complete.
I think that we can enable multi-threaded tasks now to increase the CPU
utilization.


My questions:

1.       Is above understanding right?

2.       Why there=92s no multithreaded reducer interface?

3.       How to set right number of thread? (The number to enable all cores
being utilized?)

4.       I see some prior articles point out that we should pay attention t=
o
thread safe when using multithreaded mapper. I can not quite understand
this. The basic model of MapReduce enables the naturally isolation of each
key. I guess a key should be processed within a thread even if we enable th=
e
multithreaded mapper, how could multiple threads interact with each other?


Discussion and comments are welcomed!

--=20
- Juwei

--90e6ba5bb9dd5685e504a1428cc9
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable


<p class=3D"MsoNormal"><span lang=3D"EN-US">Hi,</span></p>

<p class=3D"MsoNormal"><span lang=3D"EN-US">=A0</span></p>

<p class=3D"MsoNormal"><span lang=3D"EN-US">I am looking at the feature of =
multithreaded
map tasks. I find that the new API provides org.apache.hadoop.mapreduce.lib=
.map.MultithreadedMapper
class to enable multi-thread in each map task. We can also set the number o=
f
threads in the thread pool that will run the map function by setNumberOfThr=
eads
API. </span></p>

<p class=3D"MsoNormal"><span lang=3D"EN-US">=A0</span></p>

<p class=3D"MsoNormal"><span lang=3D"EN-US">Here I want to clarify the scen=
arios in
which we should enable the multithreaded map tasks. Generally, Hadoop MapRe=
duce
provides the <a name=3D"12f6d2fb881ecbc1_mapred.tasktracker.map.tasks.maxim=
um">mapred.tasktracker.map.tasks.maximum</a>
parameter to control capacity of concurrent map tasks (also we have
corresponding parameter for reduce tasks). We can start more child task JVM=
 to
increase CPU utilization. We do not need multithreaded tasks in most scenar=
ios.
However, multithreaded tasks may be enabled in the specific scenarios: </sp=
an></p>

<p class=3D"MsoNormal" style=3D"margin-left: 18pt;"><span lang=3D"EN-US"><s=
pan>1)<span style=3D"font-family: &quot;Times New Roman&quot;; font-style: =
normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-hei=
ght: normal; font-size-adjust: none; font-stretch: normal;">=A0=A0=A0=A0=A0
</span></span></span><span lang=3D"EN-US">When the workload is bounded by
Memory or I/O, not CPU. For example, we want load input of running map task
into memory, and we can only load 50 GB input to the cluster at most, but t=
he
CPU of the cluster is not fully utilized. Then we can enable multithreaded
tasks to increase the CPU utilization. </span></p>

<p class=3D"MsoNormal" style=3D"margin-left: 18pt;"><span lang=3D"EN-US"><s=
pan>2)<span style=3D"font-family: &quot;Times New Roman&quot;; font-style: =
normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-hei=
ght: normal; font-size-adjust: none; font-stretch: normal;">=A0=A0=A0=A0=A0
</span></span></span><span lang=3D"EN-US">When the tasks are unbalanced.
I have encountered this problem when I process very large social graphs. If=
 I
assigned 200 map tasks (averagely 8 concurrent map tasks for each node, tot=
ally
7 nodes), 99% of tasks complete within 1 hour. But the rest 1% of tasks wil=
l
take more than 10 hours. This is caused by un-balanced degree distribution =
of
the social graph. The CPU utilization of the running node is lower than 20%
when most tasks complete. I think that we can enable multi-threaded tasks n=
ow
to increase the CPU utilization. </span></p>

<p class=3D"MsoNormal"><span lang=3D"EN-US">=A0</span></p>

<p class=3D"MsoNormal"><span lang=3D"EN-US">My questions:</span></p>

<p class=3D"MsoNormal" style=3D"margin-left: 18pt;"><span lang=3D"EN-US"><s=
pan>1.<span style=3D"font-family: &quot;Times New Roman&quot;; font-style: =
normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-hei=
ght: normal; font-size-adjust: none; font-stretch: normal;">=A0=A0=A0=A0=A0=
=A0
</span></span></span><span lang=3D"EN-US">Is above understanding right?</sp=
an></p>

<p class=3D"MsoNormal" style=3D"margin-left: 18pt;"><span lang=3D"EN-US"><s=
pan>2.<span style=3D"font-family: &quot;Times New Roman&quot;; font-style: =
normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-hei=
ght: normal; font-size-adjust: none; font-stretch: normal;">=A0=A0=A0=A0=A0=
=A0
</span></span></span><span lang=3D"EN-US">Why there=92s no multithreaded
reducer interface?</span></p>

<p class=3D"MsoNormal" style=3D"margin-left: 18pt;"><span lang=3D"EN-US"><s=
pan>3.<span style=3D"font-family: &quot;Times New Roman&quot;; font-style: =
normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-hei=
ght: normal; font-size-adjust: none; font-stretch: normal;">=A0=A0=A0=A0=A0=
=A0
</span></span></span><span lang=3D"EN-US">How to set right number of
thread? (The number to enable all cores being utilized?)</span></p>

<p class=3D"MsoNormal" style=3D"margin-left: 18pt;"><span lang=3D"EN-US"><s=
pan>4.<span style=3D"font-family: &quot;Times New Roman&quot;; font-style: =
normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-hei=
ght: normal; font-size-adjust: none; font-stretch: normal;">=A0=A0=A0=A0=A0=
=A0
</span></span></span><span lang=3D"EN-US">I see some prior articles point
out that we should pay attention to thread safe when using multithreaded
mapper. I can not quite understand this. The basic model of MapReduce enabl=
es
the naturally isolation of each key. I guess a key should be processed with=
in a
thread even if we enable the multithreaded mapper, how could multiple threa=
ds interact
with each other?</span></p>

<p class=3D"MsoNormal"><span lang=3D"EN-US">=A0</span></p>

<p class=3D"MsoNormal"><span lang=3D"EN-US">Discussion and comments are wel=
comed!</span></p>

<br>-- <br>- Juwei <br>

--90e6ba5bb9dd5685e504a1428cc9--