Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 735C4DDBF for ; Thu, 16 May 2013 06:22:12 +0000 (UTC) Received: (qmail 55129 invoked by uid 500); 16 May 2013 06:22:07 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 54956 invoked by uid 500); 16 May 2013 06:22:07 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 54932 invoked by uid 99); 16 May 2013 06:22:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 May 2013 06:22:06 +0000 X-ASF-Spam-Status: No, hits=-5.0 required=5.0 tests=RCVD_IN_DNSWL_HI,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Nikhil.Agarwal@netapp.com designates 216.240.18.77 as permitted sender) Received: from [216.240.18.77] (HELO mx12.netapp.com) (216.240.18.77) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 May 2013 06:22:02 +0000 X-IronPort-AV: E=Sophos;i="4.87,681,1363158000"; d="scan'208";a="54070485" Received: from smtp1.corp.netapp.com ([10.57.156.124]) by mx12-out.netapp.com with ESMTP; 15 May 2013 23:21:42 -0700 Received: from vmwexceht03-prd.hq.netapp.com (vmwexceht03-prd.hq.netapp.com [10.106.76.241]) by smtp1.corp.netapp.com (8.13.1/8.13.1/NTAP-1.6) with ESMTP id r4G6LgIi001129 for ; Wed, 15 May 2013 23:21:42 -0700 (PDT) Received: from SACEXCMBX01-PRD.hq.netapp.com ([169.254.2.208]) by vmwexceht03-prd.hq.netapp.com ([10.106.76.241]) with mapi id 14.03.0123.003; Wed, 15 May 2013 23:21:41 -0700 From: "Agarwal, Nikhil" To: "user@hadoop.apache.org" Subject: RE: Map Tasks do not obey data locality principle........ Thread-Topic: Map Tasks do not obey data locality principle........ Thread-Index: Ac5QkWcV9ppaf2q7RI2I/01hA0HsdABWJPgAAARqIjAADwOTAAAOnTjA Date: Thu, 16 May 2013 06:21:42 +0000 Message-ID: <7B0D51053A50034199FF706B2513104F09C5A630@SACEXCMBX01-PRD.hq.netapp.com> References: <7B0D51053A50034199FF706B2513104F09C59287@SACEXCMBX01-PRD.hq.netapp.com> <7B0D51053A50034199FF706B2513104F09C5A605@SACEXCMBX01-PRD.hq.netapp.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.106.53.53] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Agreed. Thanks for replying. As hints what I have given is the ip address o= f the node where the file is residing but still it does not follow data loc= ality. One clarification - If map task for file A is being submitted to a TaskTra= cker running on different node then does it necessarily mean that entire fi= le A was transferred to the other node? Regards, Nikhil -----Original Message----- From: Harsh J [mailto:harsh@cloudera.com]=20 Sent: Thursday, May 16, 2013 11:47 AM To: Subject: Re: Map Tasks do not obey data locality principle........ The scheduling is done based on block locations filled in by the input spli= ts. If there's no hints being provided by your FS, then the result you're s= eeing is correct. Note that if you don't use a block concept, you ought to consider a whole f= ile as one block and return a location based on that. Essentially, your http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.h= tml#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long) form of API calls has to return valid values for scheduling to work. On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil wrote: > No, it does not. I have kept the granularity at file level rather than a= block. I do not think that should affect the mapping of tasks ? > > Regards, > Nikhil > > -----Original Message----- > From: Harsh J [mailto:harsh@cloudera.com] > Sent: Thursday, May 16, 2013 2:31 AM > To: > Subject: Re: Map Tasks do not obey data locality principle........ > > Also, does your custom FS report block locations in the exact same format= as how HDFS does? > > On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil wrote: >> Hi, >> >> >> >> I have a 3-node cluster, with JobTracker running on one machine and=20 >> TaskTrackers on other two (say, slave1 and slave2). Instead of using=20 >> HDFS, I have written my own FileSystem implementation. Since, unlike=20 >> HDFS I am unable to provide a shared filesystem view to JobTrackers=20 >> and TaskTracker thus, I mounted the root container of slave2 on a=20 >> directory in slave1 (nfs mount). By doing this I am able to submit MR=20 >> job to JobTracker, with input path as=20 >> my_scheme://slave1_IP:Port/dir1, etc. MR runs successfully but what=20 >> happens is that data locality is not ensured i.e. if files A,B,C are=20 >> kept on >> slave1 and D,E,F on slave2 then according to data locality, map tasks=20 >> should be submitted such that map task of A,B,C are submitted to=20 >> TaskTracker running on slave1 and D,E,F on slave2. Instead of this,=20 >> it randomly schedules the map task to any of the tasktrackers. If map=20 >> task of file A is submitted to TaskTracker running on slave2 then it=20 >> implies that file A is being fetched over the network by slave2. >> >> >> >> How do I avoid this from happening? >> >> >> >> Thanks, >> >> Nikhil >> >> >> >> > > > > -- > Harsh J -- Harsh J