Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 57870101F1 for ; Thu, 9 May 2013 02:12:07 +0000 (UTC) Received: (qmail 3436 invoked by uid 500); 9 May 2013 02:12:02 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 3311 invoked by uid 500); 9 May 2013 02:12:02 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 3301 invoked by uid 99); 9 May 2013 02:12:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 May 2013 02:12:02 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [209.85.217.173] (HELO mail-lb0-f173.google.com) (209.85.217.173) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 May 2013 02:11:56 +0000 Received: by mail-lb0-f173.google.com with SMTP id 10so60251lbf.18 for ; Wed, 08 May 2013 19:11:15 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:x-gm-message-state; bh=6x6oD7M5aXL5YKEl/jET2YmV4XEPux/adxgIsjZP2eo=; b=Bo1UuxPzWkKX7s1WohGNdCYc7iPJveopVhZZIw1OitEYEXPtuFVAZA0XLcAT7pGPKb wp1xbkZjvCk1stJN9NCzIiGcoxrQEKT++kyP9UfzXAB/2/1e/vY0YZCW+s1KJ/kKuzUh 4ajXPrKAsvLT33TdpLoOqmQy+CvzMQi8mQq0q2FO2RYJ8UNdE2x6jwmp8yTn3fTFKGYh skIaCICbVm1ANADyAGOEy7GAKK60GBMExWVva01UF4vsw1rYV4Pq/G9Abah/TCyquzdX 8uvPLuJMU3h+8DWRHHiE0czqmN4K1hnCh7Vs/v/3McMYQ3gpcjuu2+orSFea1yuLsKGc Np7g== X-Received: by 10.112.167.98 with SMTP id zn2mr4399345lbb.86.1368065474983; Wed, 08 May 2013 19:11:14 -0700 (PDT) MIME-Version: 1.0 Received: by 10.114.20.5 with HTTP; Wed, 8 May 2013 19:10:54 -0700 (PDT) In-Reply-To: <9AA3D4E4-CDCB-4BDC-94F2-77D8CB3FD0CE@apache.org> References: <518A5F6D.30700@etinternational.com> <9AA3D4E4-CDCB-4BDC-94F2-77D8CB3FD0CE@apache.org> From: Ted Dunning Date: Wed, 8 May 2013 19:10:54 -0700 Message-ID: Subject: Re: MapReduce - FileInputFormat and Locality To: "common-user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=001a11c33caaaa87a204dc3f90f1 X-Gm-Message-State: ALoCoQlgCqFF/KlaFyHPww0uzZT3sUngxVFzc6Wz2wjin32j+weasFBsEw2AGf8bS717JIwE0o8z X-Virus-Checked: Checked by ClamAV on apache.org --001a11c33caaaa87a204dc3f90f1 Content-Type: text/plain; charset=ISO-8859-1 I think that you just said what the OP said. Your two cases reduce to the same single case that they had. Whether this matters is another question, but it seems like it could in cases where splits != blocks, especially if a split starts near the end of a block which could give an illusion of locality. My guess is that since data locality is typically very high that this doesn't much matter. On Wed, May 8, 2013 at 3:00 PM, Vinod Kumar Vavilapalli < vinodkv@hortonworks.com> wrote: > I think you misread it. > > If a given split has only one block, it uses all the locations of that > block. > > If it so happens that a given split has multiple blocks, it uses all the > locations of the first block. > > HTH, > +Vinod Kumar Vavilapalli > Hortonworks Inc. > http://hortonworks.com/ > > > On May 8, 2013, at 7:21 AM, Brian C. Huffman wrote: > > All, > > I'm trying to understand how the current FileInputFormat implements > locality. As far as I can tell, it calculates splits using getSplit and > each split will contain the node that hosts the first block of data in that > split. Is my understanding correct? > > Looking at the FileInputFormat for the old API (mapred), it appears that > it does more to implement locality, using getSplitHosts to "return the > hosts that contribute most for a given split" > > If I understand correctly, why was this changed? > > Thanks, > Brian > > > --001a11c33caaaa87a204dc3f90f1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I think that you just said what the OP said.

Your two cases reduce to the same single case that they had. = =A0

Whether this matters is another qu= estion, but it seems like it could in cases where splits !=3D blocks, espec= ially if a split starts near the end of a block which could give an illusio= n of locality.

My guess is that since data locality is typ= ically very high that this doesn't much matter.


On Wed, May 8, 2013 at = 3:00 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com> wrote:
I t= hink you misread it.

If a given split has only one= block, it uses all the locations of that block.

If it so happens that a given split has multiple blocks= , it uses all the locations of the first block.

HTH,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/


On May 8, 2013, at 7:21 AM, Brian C. Huffman wrote:

=
All,

I'm trying to understand how= the current FileInputFormat implements locality. =A0As far as I can tell, = it calculates splits using getSplit and each split will contain the node th= at hosts the first block of data in that split. =A0Is my understanding corr= ect?

Looking at the FileInputFormat for the old API (mapred), it appears tha= t it does more to implement locality, using getSplitHosts to "return t= he hosts that contribute most for a given split"

If I understan= d correctly, why was this changed?

Thanks,
Brian



--001a11c33caaaa87a204dc3f90f1--