Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3F4621E10 for ; Tue, 19 Apr 2011 13:58:50 +0000 (UTC) Received: (qmail 40514 invoked by uid 500); 19 Apr 2011 09:58:50 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 40475 invoked by uid 500); 19 Apr 2011 09:58:50 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 40465 invoked by uid 99); 19 Apr 2011 09:58:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Apr 2011 09:58:50 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of shijuwei@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qy0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Apr 2011 09:58:42 +0000 Received: by qyk30 with SMTP id 30so4272833qyk.14 for ; Tue, 19 Apr 2011 02:58:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:date:message-id:subject:from:to :content-type; bh=tNUksWluJwM3cvWlrXsr7fF5BkFkaUXvi9mTGoLdCPU=; b=JzjK4c0UiRhlb/S8RV0R1E/NXBJYv/3aG2LjARXCaAmDONXQXfuohVkRDb93/WXktW P40A90cY5xd2JtC3o/xpcxXzH4zOu7NzlqGtp31va/oCylo6yZwKZQxNgf0j/nNQUfAP bIz31jQTmWNTkbKdZWJakSw07JsBYvYjKunjE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=LFZ/nS/hYEbsy8J0lSIC8XZyvSim6FSXjG/yf/Vp9W7o+UptEICebzxoPequGOqsp8 p3hUDX1Fr/NUnNPw8Y5xSlt/6/a/liTyLy8Ae26vPPheM5mixScE1E7YxBFMV/TA7pND CQDbpCACkTEOHlfXS+RbzvHITmVO/Kbi2lUfo= MIME-Version: 1.0 Received: by 10.229.0.75 with SMTP id 11mr4280837qca.94.1303207101236; Tue, 19 Apr 2011 02:58:21 -0700 (PDT) Received: by 10.229.239.8 with HTTP; Tue, 19 Apr 2011 02:58:21 -0700 (PDT) Date: Tue, 19 Apr 2011 17:58:21 +0800 Message-ID: Subject: Questions about MultithreadedMapper From: Juwei Shi To: mapreduce-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=90e6ba5bb9dd5685e504a1428cc9 X-Virus-Checked: Checked by ClamAV on apache.org --90e6ba5bb9dd5685e504a1428cc9 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi, I am looking at the feature of multithreaded map tasks. I find that the new API provides org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper class to enable multi-thread in each map task. We can also set the number of threads in the thread pool that will run the map function by setNumberOfThreads API. Here I want to clarify the scenarios in which we should enable the multithreaded map tasks. Generally, Hadoop MapReduce provides the mapred.tasktracker.map.tasks.maximum parameter to control capacity of concurrent map tasks (also we have corresponding parameter for reduce tasks). We can start more child task JVM to increase CPU utilization. We do not need multithreaded tasks in most scenarios. However, multithreaded task= s may be enabled in the specific scenarios: 1) When the workload is bounded by Memory or I/O, not CPU. For example= , we want load input of running map task into memory, and we can only load 50 GB input to the cluster at most, but the CPU of the cluster is not fully utilized. Then we can enable multithreaded tasks to increase the CPU utilization. 2) When the tasks are unbalanced. I have encountered this problem when I process very large social graphs. If I assigned 200 map tasks (averagely = 8 concurrent map tasks for each node, totally 7 nodes), 99% of tasks complete within 1 hour. But the rest 1% of tasks will take more than 10 hours. This is caused by un-balanced degree distribution of the social graph. The CPU utilization of the running node is lower than 20% when most tasks complete. I think that we can enable multi-threaded tasks now to increase the CPU utilization. My questions: 1. Is above understanding right? 2. Why there=92s no multithreaded reducer interface? 3. How to set right number of thread? (The number to enable all cores being utilized?) 4. I see some prior articles point out that we should pay attention t= o thread safe when using multithreaded mapper. I can not quite understand this. The basic model of MapReduce enables the naturally isolation of each key. I guess a key should be processed within a thread even if we enable th= e multithreaded mapper, how could multiple threads interact with each other? Discussion and comments are welcomed! --=20 - Juwei --90e6ba5bb9dd5685e504a1428cc9 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable

Hi,

=A0

I am looking at the feature of = multithreaded map tasks. I find that the new API provides org.apache.hadoop.mapreduce.lib= .map.MultithreadedMapper class to enable multi-thread in each map task. We can also set the number o= f threads in the thread pool that will run the map function by setNumberOfThr= eads API.

=A0

Here I want to clarify the scen= arios in which we should enable the multithreaded map tasks. Generally, Hadoop MapRe= duce provides the mapred.tasktracker.map.tasks.maximum parameter to control capacity of concurrent map tasks (also we have corresponding parameter for reduce tasks). We can start more child task JVM= to increase CPU utilization. We do not need multithreaded tasks in most scenar= ios. However, multithreaded tasks may be enabled in the specific scenarios:

1)=A0=A0=A0=A0=A0 When the workload is bounded by Memory or I/O, not CPU. For example, we want load input of running map task into memory, and we can only load 50 GB input to the cluster at most, but t= he CPU of the cluster is not fully utilized. Then we can enable multithreaded tasks to increase the CPU utilization.

2)=A0=A0=A0=A0=A0 When the tasks are unbalanced. I have encountered this problem when I process very large social graphs. If= I assigned 200 map tasks (averagely 8 concurrent map tasks for each node, tot= ally 7 nodes), 99% of tasks complete within 1 hour. But the rest 1% of tasks wil= l take more than 10 hours. This is caused by un-balanced degree distribution = of the social graph. The CPU utilization of the running node is lower than 20% when most tasks complete. I think that we can enable multi-threaded tasks n= ow to increase the CPU utilization.

=A0

My questions:

1.=A0=A0=A0=A0=A0= =A0 Is above understanding right?

2.=A0=A0=A0=A0=A0= =A0 Why there=92s no multithreaded reducer interface?

3.=A0=A0=A0=A0=A0= =A0 How to set right number of thread? (The number to enable all cores being utilized?)

4.=A0=A0=A0=A0=A0= =A0 I see some prior articles point out that we should pay attention to thread safe when using multithreaded mapper. I can not quite understand this. The basic model of MapReduce enabl= es the naturally isolation of each key. I guess a key should be processed with= in a thread even if we enable the multithreaded mapper, how could multiple threa= ds interact with each other?

=A0

Discussion and comments are wel= comed!


--
- Juwei
--90e6ba5bb9dd5685e504a1428cc9--