Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates
 209.85.210.173 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CACxNn7fn_BCHTr7s07F8WYbUVNvQ=hKA4BLUjQeOObMDQsXUuQ@mail.gmail.com>
References: 
 <CACxNn7fn_BCHTr7s07F8WYbUVNvQ=hKA4BLUjQeOObMDQsXUuQ@mail.gmail.com>
From: Harsh J <harsh@cloudera.com>
Date: Sat, 9 Feb 2013 10:42:13 +0530
Message-ID: 
 <CAOcnVr3JZ_2CMa6Je69BKxvwmzLtVQQ2R6o1MCTCYd8RR5hcxw@mail.gmail.com>
Subject: Re: How MapReduce selects data blocks for processing user request
To: "<user@hadoop.apache.org>" <user@hadoop.apache.org>
Content-Type: text/plain; charset=ISO-8859-1

Hi Mehal,

> I am confused over how MapReduce tasks select data blocks for processing user requests ?

I suggest reading chapter 6 of Tom White's Hadoop: The Definitive
Guide, titled "How MapReduce Works". It explains almost everything you
need to know in very clear language, and should help you generally if
you get this or other such good books.

> As data block replication replicates single data block over multiple datanodes, during job processing how uniquely data blocks are selected for processing user requests ?

The first point to clear up is that MapReduce is not hard-tied to
HDFS. It generates splits on any FS and the splits are unique, based
on your given input path. Each split therefore relates to one task and
the task's input goal is hence defined at submit-time itself. Each
split is further defined by its path, start offset into the file and
length after offset to be processed - "uniquely" defining itself.

> How does it guarantees that no same block gets chosen twice or thrice for different mapper task.

See above - each "block" (or a "split" in MR terms), is defined by its
start-offset and length. No two splits generated for a single file
would be the same, as we generate it that way - and hence there won't
be such a case you're worried about.

On Sat, Feb 9, 2013 at 6:10 AM, Mehal Patel <mehal01988@gmail.com> wrote:
> Hello All,
>
> I am confused over how MapReduce tasks select data blocks for processing
> user requests ?
>
> As data block replication replicates single data block over multiple
> datanodes, during job processing how uniquely
> data blocks are selected for processing user requests ? How does it
> guarantees that no same block gets chosen twice or thrice
> for different mapper task.
>
>
> Thank you
>
> -Mehal


--
Harsh J