hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@yahoo-inc.com>
Subject Re: map tasks and processes
Date Fri, 15 Aug 2008 17:41:58 GMT

On Aug 15, 2008, at 9:15 AM, charles du wrote:

> Thanks a lot for the information.
>
> I used option '-file' provided by hadoop-streamiing to upload read- 
> only
> files for my map/reduce job, and read them as a local file in my perl
> script. I am wondering if it is similar to what distributed cache does
> performance wise? Thanks.
>

For read-only files -cache* is slightly better since it uses the  
DistributedCache, with -file the data-file is passed along with the  
job.jar.

http://hadoop.apache.org/core/docs/current/streaming.html#Large+files+and+archives+in+Hadoop+Streaming

Arun

> tp.
>
> On Tue, Aug 12, 2008 at 5:07 PM, Arun C Murthy <acm@yahoo-inc.com>  
> wrote:
>
>>
>> On Aug 12, 2008, at 11:21 AM, charles du wrote:
>>
>> Hi:
>>>
>>> Does hadoop always start a new process for each map task?
>>>
>>>
>> Yes. http://issues.apache.org/jira/browse/HADOOP-249 is open to  
>> optimize
>> that.
>>
>> Till HADOOP-249 is fixed, you could try and launch fewer, fatter  
>> maps by
>> doing more work on each to get around your 'long-initialization'.  
>> Please
>> take a look at InputFormat.getSplits on how to do that.:
>> http://hadoop.apache.org/core/docs/r0.17.1/api/org/apache/hadoop/mapred/InputFormat.html
>>
>> Also, you could consider using the DistributedCache for distributing
>> read-only data if necessary for your maps/reduces:
>>
>> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache
>>
>> Arun
>>
>>
>> I have a 20s-machine cluster and configured each task tracker to  
>> run 2
>>> concurrent tasks at most. So the cluster can run 40 task in  
>>> parallel. If I
>>> start a hadoop job with 1000 tasks, will hadoop  create 1000 map  
>>> processes
>>> during the execution of the job, or it will start 40 processes at  
>>> the
>>> beginning, and process 1000 tasks one  by one (of course, at any
>>> particular
>>> time, only 40 running)?
>>>
>>> My map tasks have a long initialization time before they start  
>>> processing
>>> data files. So it will be ideal if map processes could be reused  
>>> among
>>> different tasks, instead of creating a new process for each of  
>>> them. Is
>>> there a way to do it?
>>>
>>> Thanks.
>>>
>>> --
>>> tp
>>>
>>
>>
>
>
> -- 
> tp


Mime
View raw message