hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Multithreaded Mapper and Map runner
Date Thu, 17 Jun 2010 04:59:14 GMT
If only thread is created to run mapper/reducer, how would
mapred.child.java.opts be effective ?

Please refer to src/mapred/org/apache/hadoop/mapred/TaskRunner.java which is
not very long.

On Wed, Jun 16, 2010 at 9:10 PM, Jyothish Soman <jyothish.soman@gmail.com>wrote:

> I have another doubt, for cross checking. The number set in
> mapred.tasktracker.map/reduce.tasks.maximum creates that many JVM instances,
> or does it just create that many threads. Though I could not see any
> explicit statement about it, it was pointed everywhere as if it is a JVM
> instance.
> Please do clarify
> On Mon, Jun 14, 2010 at 2:04 AM, Jyothish Soman <jyothish.soman@gmail.com>wrote:
>> Ok, understood this part, even though the architecture of hadoop is
>> designed for thread safety, the actual implementation level details make it
>> thread unsafe.
>> Thank you for the comments, did a good background check and figured out
>> that staying within the hadoop framework, best way to manage multicore is
>> virtualization. Not just simple multithreading.
>> Regards,
>> Jyothish Soman
>> On Fri, Jun 11, 2010 at 7:09 PM, Aaron Kimball <aaron@cloudera.com>wrote:
>>> This will likely break most programs you try to run. Many mapper
>>> implementations are not thread safe.
>>> That having been said, if you want to force all programs using the old
>>> API (org.apache.hadoop.mapred.*) to run on the multithreaded maprunner, you
>>> can do this by setting mapred.map.runner.class to
>>> org.apache.hadoop.mapred.lib.MultithreadedMapRunner in mapred-site.xml.
>>> Rather than do this in mapred-site.xml, it is far preferable to
>>> explicitly call jobConf.setMapRunnerClass() in the applications that require
>>> the multithreaded map runner.
>>> In the new API, the MapRunnable interface is not used. Instead the
>>> Mapper.run() method controls the execution of the map() method. For your own
>>> applications, you should subclass
>>> org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper instead of
>>> o.a.h.mapreduce.Mapper. This will provide a multithreaded run() method. I am
>>> pretty sure that you cannot independently switch out the run() layer of an
>>> existing application except by modifying its source to subclass the
>>> MultithreadedMapper.
>>> Finally, you should really ask yourself why you're doing this. If you
>>> have multi-core machines, the best way to manage parallelism is to configure
>>> Hadoop to use multiple task slots per machine. Set
>>> mapred.tasktracker.map.tasks.maximum to '8' to use eight map tasks per node
>>> (This is changed to mapreduce.tasktracker.map.tasks.maximum in 0.21+). This
>>> allows single-threaded mapper code to efficiently process multiple input
>>> splits in parallel. The only time when it's better to use multithreaded
>>> maprunners is when a specific map() process is high-latency; e.g., you're
>>> running a web crawler in a mapper, and you want to overlap requests to
>>> foreign sites. But since this is not the norm, you should generally leave
>>> things singlethreaded.
>>> Hope this helps
>>> Cheers
>>> - Aaron
>>> On Fri, Jun 11, 2010 at 7:30 AM, Jyothish Soman <
>>> jyothish.soman@gmail.com> wrote:
>>>> Hi,
>>>> I am a newbie to Hadoop. I want to use the Multi threaded runner by
>>>> default, so I tried to change the MapTask.java code. it failed to compile
>>>> using ant, as mapreduce - mapred library conflict was there, Can you please
>>>> suggest a way through, so that  I can use the same.
>>>> Regards,
>>>> Jyothish Soman

View raw message