Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 95900D6A5 for ; Sat, 22 Dec 2012 13:24:45 +0000 (UTC) Received: (qmail 17008 invoked by uid 500); 22 Dec 2012 13:24:40 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 16868 invoked by uid 500); 22 Dec 2012 13:24:40 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 16853 invoked by uid 99); 22 Dec 2012 13:24:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 22 Dec 2012 13:24:40 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of linlma@gmail.com designates 209.85.212.45 as permitted sender) Received: from [209.85.212.45] (HELO mail-vb0-f45.google.com) (209.85.212.45) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 22 Dec 2012 13:24:33 +0000 Received: by mail-vb0-f45.google.com with SMTP id p1so6196352vbi.32 for ; Sat, 22 Dec 2012 05:24:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=cx6Z7DDBj83UX4WkYySKXXzzgsVhJxf8HaDl6Q65i5g=; b=TNs/bjp1xRmS9E5uLTS53AOpK0VbwtPDYsAQ9ZB07WywrdHsCcIODG7FFBbz6LGbXC of3AwoMi8h9WhsurltqpCLMux8e2vEIUwQrLYl9psVAJ1HhiHldU+/kiUEeb7TP7G+z7 acZhSr23jL5QpvnoG8zTTSzO3WdKCJEsfs8bLgYWCWLdfAyieIloFSWtgDjLCOSibgp+ CCIdwebAqBlLm52NG3VvW1o4KRLcApGFHbq5sOvztbAeGNWrglXZDrZMOpTIcURdAER+ Dg5s0B8x+qYeDzULR0l/MEh5Smr1CvoI8oX3DI0gmfcJh7agb9878nTosU6DWHPCryOG hcoA== MIME-Version: 1.0 Received: by 10.220.115.20 with SMTP id g20mr24724026vcq.31.1356182652345; Sat, 22 Dec 2012 05:24:12 -0800 (PST) Received: by 10.58.68.135 with HTTP; Sat, 22 Dec 2012 05:24:12 -0800 (PST) In-Reply-To: References: <491FA550-FC92-4280-8FB5-186E5F7A4743@123.org> Date: Sat, 22 Dec 2012 21:24:12 +0800 Message-ID: Subject: Re: distributed cache From: Lin Ma To: Kai Voigt , user@hadoop.apache.org Content-Type: multipart/alternative; boundary=f46d0434c2ec3e5c9604d170e1c4 X-Virus-Checked: Checked by ClamAV on apache.org --f46d0434c2ec3e5c9604d170e1c4 Content-Type: text/plain; charset=ISO-8859-1 Hi Kai, Smart answer! :-) - The assumption you have is one distributed cache replica could only serve one download session for tasktracker node (this is why you get concurrency n/r). The question is, why one distributed cache replica cannot serve multiple concurrent download session? For example, supposing a tasktracker use elapsed time t to download a file from a specific distributed cache replica, it is possible for 2 tasktrackers to download from the specific distributed cache replica in parallel using elapsed time t as well, or 1.5 t, which is faster than sequential download time 2t you mentioned before? - "In total, r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for minimize r+n/r? Appreciate if you could point me to more details. regards, Lin On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt wrote: > Hi, > > simple math. Assuming you have n TaskTrackers in your cluster that will > need to access the files in the distributed cache. And r is the replication > level of those files. > > Copying the files into HDFS requires r copy operations over the network. > The n TaskTrackers need to get their local copies from HDFS, so the n > TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total, > r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is > the optimal replication level. So 10 is a reasonable default setting for > most clusters that are not 500+ nodes big. > > Kai > > Am 22.12.2012 um 13:46 schrieb Lin Ma : > > Thanks Kai, using higher replication count for the purpose of? > > regards, > Lin > > On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt wrote: > >> Hi, >> >> Am 22.12.2012 um 13:03 schrieb Lin Ma : >> >> > I want to confirm when on each task node either mapper or reducer >> access distributed cache file, it resides on disk, not resides in memory. >> Just want to make sure distributed cache file does not fully loaded into >> memory which compete memory consumption with mapper/reducer tasks. Is that >> correct? >> >> >> Yes, you are correct. The JobTracker will put files for the distributed >> cache into HDFS with a higher replication count (10 by default). Whenever a >> TaskTracker needs those files for a task it is launching locally, it will >> fetch a copy to its local disk. So it won't need to do this again for >> future tasks on this node. After a job is done, all local copies and the >> HDFS copies of files in the distributed cache are cleaned up. >> >> Kai >> >> -- >> Kai Voigt >> k@123.org >> >> >> >> >> > > -- > Kai Voigt > k@123.org > > > > > --f46d0434c2ec3e5c9604d170e1c4 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Kai,

Smart answer! :-)
  • The assumption you have is one = distributed cache replica could only serve one download session for tasktra= cker node (this is why you get concurrency n/r). The question is, why one d= istributed cache replica cannot serve multiple concurrent download session?= For example, supposing a tasktracker use elapsed time t to download a file= from a specific distributed cache replica, it is possible for 2 tasktracke= rs to download from the specific distributed cache replica in parallel usin= g elapsed time t as well, or 1.5 t, which is faster than sequential downloa= d time 2t you mentioned before?
  • "In total, r+n/r concurrent operations. If you optimize r dep= ending on n, SRQT(n) is the optimal replication level." -- how do you = get SRQT(n) for minimize r+n/r? Appreciate if you could point me to more de= tails.
regards,
Lin

On Sat, Dec 22,= 2012 at 8:51 PM, Kai Voigt <k@123.org> wrote:
Hi,

simple math. Ass= uming you have n TaskTrackers in your cluster that will need to access the = files in the distributed cache. And r is the replication level of those fil= es.

Copying the files into HDFS requires r copy operations = over the network. The n TaskTrackers need to get their local copies from HD= FS, so the n TaskTrackers copy from r DataNodes, so n/r concurrent operatio= n. In total, r+n/r concurrent operations. If you optimize r depending on n,= SRQT(n) is the optimal replication level. So 10 is a reasonable default se= tting for most clusters that are not 500+ nodes big.

Kai

Am 22.12.2012 um 13:46 schr= ieb Lin Ma <linlma= @gmail.com>:

Thanks Kai, using higher replication count for the purpose of?

regards,
Lin

On Sat, Dec 22, 2012 = at 8:44 PM, Kai Voigt <k@123.org> wrote:
Hi,

Am 22.12.2012 um 13:03 schrieb Lin Ma <linlma@gmail.com>:

> I want to confirm when on each task node either mapper or reducer acce= ss distributed cache file, it resides on disk, not resides in memory. Just = want to make sure distributed cache file does not fully loaded into memory = which compete memory consumption with mapper/reducer tasks. Is that correct= ?


Yes, you are correct. The JobTracker will put files for the distribut= ed cache into HDFS with a higher replication count (10 by default). Wheneve= r a TaskTracker needs those files for a task it is launching locally, it wi= ll fetch a copy to its local disk. So it won't need to do this again fo= r future tasks on this node. After a job is done, all local copies and the = HDFS copies of files in the distributed cache are cleaned up.

Kai

--
Kai Voigt
k@123.org






--=A0
Kai Voigt





--f46d0434c2ec3e5c9604d170e1c4--