Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A1B5FE3AC for ; Sat, 9 Feb 2013 05:13:07 +0000 (UTC) Received: (qmail 93120 invoked by uid 500); 9 Feb 2013 05:13:02 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 92885 invoked by uid 500); 9 Feb 2013 05:13:02 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 92875 invoked by uid 99); 9 Feb 2013 05:13:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Feb 2013 05:13:01 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates 209.85.210.173 as permitted sender) Received: from [209.85.210.173] (HELO mail-ia0-f173.google.com) (209.85.210.173) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Feb 2013 05:12:55 +0000 Received: by mail-ia0-f173.google.com with SMTP id h37so4962327iak.32 for ; Fri, 08 Feb 2013 21:12:34 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:x-gm-message-state; bh=WvZ9Rjxib/FHt6QwJepYnQkGgq3jP1OYM7+Hm7iNHh0=; b=cjSnSM9n7vp8Iqodf+Qh8awBZfsTPiRLsXUI3dyfK+5zc+KYw3tcakw4/SVvKwEKL6 w9Lcdso6wyT8iQlpVAVBwqWSIM7R1wzEq9CmZGsCeJdRIKxMRtxrYaVud/KBIFWshpEX uq2ny2qrPJPIYP/LZvinelmMQj4IhRrrduV++IhIqKysngKu0fGklrmxFLW7D4bhTfRN baRwV5OzgfnnsEwym7fDBduVIWiyO3Pu1Rk+u1XM3uTVPxpeZ+W7LelqrYBNB+LGdTbZ /UPVvwguImjTklIi90pirxBW5s65D88B4pF7ZLX+5jCw3LgtZryNNKdR2kve7Um5QOcK C+GQ== X-Received: by 10.50.237.5 with SMTP id uy5mr7185700igc.12.1360386754772; Fri, 08 Feb 2013 21:12:34 -0800 (PST) MIME-Version: 1.0 Received: by 10.50.91.161 with HTTP; Fri, 8 Feb 2013 21:12:13 -0800 (PST) In-Reply-To: References: From: Harsh J Date: Sat, 9 Feb 2013 10:42:13 +0530 Message-ID: Subject: Re: How MapReduce selects data blocks for processing user request To: "" Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQm8o93oftBkTlUX4/ZlerIlu3yu0e442IQlsX6RJSXKIWdeJXirZLxRfglsTZDMCpIC8cnv X-Virus-Checked: Checked by ClamAV on apache.org Hi Mehal, > I am confused over how MapReduce tasks select data blocks for processing user requests ? I suggest reading chapter 6 of Tom White's Hadoop: The Definitive Guide, titled "How MapReduce Works". It explains almost everything you need to know in very clear language, and should help you generally if you get this or other such good books. > As data block replication replicates single data block over multiple datanodes, during job processing how uniquely data blocks are selected for processing user requests ? The first point to clear up is that MapReduce is not hard-tied to HDFS. It generates splits on any FS and the splits are unique, based on your given input path. Each split therefore relates to one task and the task's input goal is hence defined at submit-time itself. Each split is further defined by its path, start offset into the file and length after offset to be processed - "uniquely" defining itself. > How does it guarantees that no same block gets chosen twice or thrice for different mapper task. See above - each "block" (or a "split" in MR terms), is defined by its start-offset and length. No two splits generated for a single file would be the same, as we generate it that way - and hence there won't be such a case you're worried about. On Sat, Feb 9, 2013 at 6:10 AM, Mehal Patel wrote: > Hello All, > > I am confused over how MapReduce tasks select data blocks for processing > user requests ? > > As data block replication replicates single data block over multiple > datanodes, during job processing how uniquely > data blocks are selected for processing user requests ? How does it > guarantees that no same block gets chosen twice or thrice > for different mapper task. > > > Thank you > > -Mehal -- Harsh J