Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 9217 invoked from network); 20 Dec 2009 20:09:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 20 Dec 2009 20:09:48 -0000 Received: (qmail 95994 invoked by uid 500); 20 Dec 2009 20:09:46 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 95891 invoked by uid 500); 20 Dec 2009 20:09:45 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 95881 invoked by uid 500); 20 Dec 2009 20:09:45 -0000 Delivered-To: apmail-hadoop-core-user@hadoop.apache.org Received: (qmail 95878 invoked by uid 99); 20 Dec 2009 20:09:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Dec 2009 20:09:45 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Dec 2009 20:09:35 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1NMS5b-00029e-CD for core-user@hadoop.apache.org; Sun, 20 Dec 2009 12:09:15 -0800 Message-ID: <26866842.post@talk.nabble.com> Date: Sun, 20 Dec 2009 12:09:15 -0800 (PST) From: doopha shaf To: core-user@hadoop.apache.org Subject: general question - how hadoop works MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: doopha.shaf@gmail.com Trying to figure out how hadoop actually achieves its speed. Assuming that data locality is central to the efficiency of hadoop, how does the magic actually happen, given that data still gets moved all over the network to reach the reducers? For example, if I have 1gb of logs spread across 10 data nodes, and for the sake of argument, assume I use the identity mapper. Then 90% of data still needs to move across the network - how does the network not become saturated this way? What did I miss?... Thanks, D.S. -- View this message in context: http://old.nabble.com/general-question---how-hadoop-works-tp26866842p26866842.html Sent from the Hadoop core-user mailing list archive at Nabble.com.