hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Luo <lgpub...@yahoo.com.cn>
Subject Re: when to sent distributed cache file
Date Wed, 17 Mar 2010 20:58:02 GMT
Thanks Ravi.

Here are some observations. I run job1 to generate some data used by the following job2 without
replication. The total size of the job 1 output is 25mb and is in 50 files. I use distributed
cache to sent all the files to nodes running job2 tasks. When job2 starts, it stayed at "map
0% reduce 0%" for 10 minutes. When the job1 output is in 10 files (using 10 reducers in job1),
the time consumed here are 2 minutes. 

So, I think the time to distribute cache files is actually counted as part of the total time
of the MR job. And in order to sent a cache file from HDFS to local disk, it sent at least
one block (64mb by default) even that file is only 1mb. Is that right? If so, how much space
that cache file takes on the local disk, 64mb or 1mb? 


----- 原始邮件 ----
发件人: Ravi Phulari <rphulari@yahoo-inc.com>
收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>; Gang Luo
发送日期: 2010/3/17 (周三) 3:52:24 下午
主   题: Re: when to sent distributed cache file

Hello Gang,
      The framework will copy the necessary files to the slave node  before any tasks for
the job are executed on that node.
Not sure if  time required to distribute cache is counted in map reduce job time but it is
included in job submission process in JobClient .

On 3/17/10 11:32 AM, "Gang Luo" <lgpublic@yahoo.com.cn> wrote:

Hi all,
I doubt when does hadoop distributes the cache files. The moment we call DistributedCache.addCacheFile()
? Will the time to distribute caches be counted as part of the mapreduce job time?



View raw message