Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C4F2310851 for ; Sat, 22 Mar 2014 02:07:10 +0000 (UTC) Received: (qmail 97063 invoked by uid 500); 22 Mar 2014 02:07:06 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 96999 invoked by uid 500); 22 Mar 2014 02:07:05 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 96987 invoked by uid 99); 22 Mar 2014 02:07:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 22 Mar 2014 02:07:04 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mikesam460@gmail.com designates 209.85.212.170 as permitted sender) Received: from [209.85.212.170] (HELO mail-wi0-f170.google.com) (209.85.212.170) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 22 Mar 2014 02:06:59 +0000 Received: by mail-wi0-f170.google.com with SMTP id bs8so1035388wib.5 for ; Fri, 21 Mar 2014 19:06:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=vjHAHl7NoX9+Gf4aAxqy7LC4GFW7UBPZ769Qovug7tY=; b=pldGuody5dDI6HTUkh3CnJxApkOu5VpVqEzh4QwQjUWfVBCr5ZbAUgRqAa+Pe0GGdt vDikAE8ctcAm4csKNULm8A27EUtSWFbSfKkhOOxxxlLf/Vib02osv8J6PCkOdQoi6bLq 52KKq8iX5PUa+7jzMsPhhHP9bhDRe4/s86RcZa2Le+9Yis8aFgxMKVWgDQkv9B3l9nIU Qrq2KM3sFcYETfeUBcI1gKl2oZMl//0msTFXXYjXlgBkXGGycSI4GYuQOTaSusm90bth ajS9yZGYcJdYU+XgMGGkjeueepj/PN5ESzxfSQv7vDFLByEDcsF+YREAaWT9IC/xN44/ QHYg== MIME-Version: 1.0 X-Received: by 10.180.19.98 with SMTP id d2mr731524wie.57.1395453998679; Fri, 21 Mar 2014 19:06:38 -0700 (PDT) Received: by 10.216.73.130 with HTTP; Fri, 21 Mar 2014 19:06:38 -0700 (PDT) Date: Fri, 21 Mar 2014 19:06:38 -0700 Message-ID: Subject: Data Locality Importance From: Mike Sam To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=bcaec53d5461e44e2504f528731f X-Virus-Checked: Checked by ClamAV on apache.org --bcaec53d5461e44e2504f528731f Content-Type: text/plain; charset=ISO-8859-1 How important is Data Locality to Hadoop? I mean, if we prefer to separate the HDFS cluster from the MR cluster, we will lose data locality but my question is how bad is this assuming we provider a reasonable network connection between the two clusters? EMR kills data locality when using S3 as storage but we do not see a significant job time difference running same job from the HDFS cluster of the same setup. So, I am wondering how important is Data Locality to Hadoop in practice? Thanks, Mike --bcaec53d5461e44e2504f528731f--