hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lin Ma <lin...@gmail.com>
Subject Re: distributed cache
Date Sat, 22 Dec 2012 13:24:12 GMT
Hi Kai,

Smart answer! :-)

   - The assumption you have is one distributed cache replica could only
   serve one download session for tasktracker node (this is why you get
   concurrency n/r). The question is, why one distributed cache replica cannot
   serve multiple concurrent download session? For example, supposing a
   tasktracker use elapsed time t to download a file from a specific
   distributed cache replica, it is possible for 2 tasktrackers to download
   from the specific distributed cache replica in parallel using elapsed time
   t as well, or 1.5 t, which is faster than sequential download time 2t you
   mentioned before?
   - "In total, r+n/r concurrent operations. If you optimize r depending on
   n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for
   minimize r+n/r? Appreciate if you could point me to more details.


On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <k@123.org> wrote:

> Hi,
> simple math. Assuming you have n TaskTrackers in your cluster that will
> need to access the files in the distributed cache. And r is the replication
> level of those files.
> Copying the files into HDFS requires r copy operations over the network.
> The n TaskTrackers need to get their local copies from HDFS, so the n
> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
> the optimal replication level. So 10 is a reasonable default setting for
> most clusters that are not 500+ nodes big.
> Kai
> Am 22.12.2012 um 13:46 schrieb Lin Ma <linlma@gmail.com>:
> Thanks Kai, using higher replication count for the purpose of?
> regards,
> Lin
> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <k@123.org> wrote:
>> Hi,
>> Am 22.12.2012 um 13:03 schrieb Lin Ma <linlma@gmail.com>:
>> > I want to confirm when on each task node either mapper or reducer
>> access distributed cache file, it resides on disk, not resides in memory.
>> Just want to make sure distributed cache file does not fully loaded into
>> memory which compete memory consumption with mapper/reducer tasks. Is that
>> correct?
>> Yes, you are correct. The JobTracker will put files for the distributed
>> cache into HDFS with a higher replication count (10 by default). Whenever a
>> TaskTracker needs those files for a task it is launching locally, it will
>> fetch a copy to its local disk. So it won't need to do this again for
>> future tasks on this node. After a job is done, all local copies and the
>> HDFS copies of files in the distributed cache are cleaned up.
>> Kai
>> --
>> Kai Voigt
>> k@123.org
> --
> Kai Voigt
> k@123.org

View raw message