hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kai Voigt...@123.org>
Subject Re: distributed cache
Date Sat, 22 Dec 2012 12:51:49 GMT

simple math. Assuming you have n TaskTrackers in your cluster that will need to access the
files in the distributed cache. And r is the replication level of those files.

Copying the files into HDFS requires r copy operations over the network. The n TaskTrackers
need to get their local copies from HDFS, so the n TaskTrackers copy from r DataNodes, so
n/r concurrent operation. In total, r+n/r concurrent operations. If you optimize r depending
on n, SRQT(n) is the optimal replication level. So 10 is a reasonable default setting for
most clusters that are not 500+ nodes big.


Am 22.12.2012 um 13:46 schrieb Lin Ma <linlma@gmail.com>:

> Thanks Kai, using higher replication count for the purpose of?
> regards,
> Lin
> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k@123.org> wrote:
> Hi,
> Am 22.12.2012 um 13:03 schrieb Lin Ma <linlma@gmail.com>:
> > I want to confirm when on each task node either mapper or reducer access distributed
cache file, it resides on disk, not resides in memory. Just want to make sure distributed
cache file does not fully loaded into memory which compete memory consumption with mapper/reducer
tasks. Is that correct?
> Yes, you are correct. The JobTracker will put files for the distributed cache into HDFS
with a higher replication count (10 by default). Whenever a TaskTracker needs those files
for a task it is launching locally, it will fetch a copy to its local disk. So it won't need
to do this again for future tasks on this node. After a job is done, all local copies and
the HDFS copies of files in the distributed cache are cleaned up.
> Kai
> --
> Kai Voigt
> k@123.org

Kai Voigt

View raw message