Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1F3B5E572 for ; Mon, 3 Dec 2012 05:53:35 +0000 (UTC) Received: (qmail 46613 invoked by uid 500); 3 Dec 2012 05:53:29 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 46260 invoked by uid 500); 3 Dec 2012 05:53:29 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 46220 invoked by uid 99); 3 Dec 2012 05:53:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Dec 2012 05:53:28 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jayunit100@gmail.com designates 209.85.160.176 as permitted sender) Received: from [209.85.160.176] (HELO mail-gh0-f176.google.com) (209.85.160.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Dec 2012 05:53:19 +0000 Received: by mail-gh0-f176.google.com with SMTP id g10so362318ghb.35 for ; Sun, 02 Dec 2012 21:52:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=references:mime-version:in-reply-to:content-type :content-transfer-encoding:message-id:cc:x-mailer:from:subject:date :to; bh=sLS+AGDV/8D11AP0NsmbG9cw6iI0lCadiZNPaBCmPjY=; b=Os4VitCugTCXsy6QoATuFBPDXvRb6evXPNdawRiK/mU0DNOJzJQAXhDBKYHNRgrO2O hhcN0TRZvLu3UIqkb3KqTjJ6fpYy8WUeOwVnPtYi3R6/tRVGGcDEKWNB9bU8n12CWapt tikr8bNDTxkjLCj4DPp2jJ5wsTEvlMlIEdEoV+qAdvCS8/6HAtsW+dDMQNexQ4REJ1Y+ B226UlhuGJkboO8ntCeIEMDMOURy1sFfW3Dq+M4X+63+QjQ6D3QbX1HwO+n9L0S3vuSP gAYBuTkJd3waP2D7UVzIsXiBfRoB2/8+l/Tu32i3hx1ZZwa0YhQKPP+50RTmqKkNk1A1 7CLg== Received: by 10.236.122.148 with SMTP id t20mr9261854yhh.19.1354513977975; Sun, 02 Dec 2012 21:52:57 -0800 (PST) Received: from [10.73.218.26] ([166.205.48.73]) by mx.google.com with ESMTPS id f15sm12288489yhi.11.2012.12.02.21.52.56 (version=SSLv3 cipher=OTHER); Sun, 02 Dec 2012 21:52:57 -0800 (PST) References: Mime-Version: 1.0 (1.0) In-Reply-To: Content-Type: multipart/alternative; boundary=Apple-Mail-3EB120DF-3D3D-4C1C-A9C1-4B45B838C230 Content-Transfer-Encoding: 7bit Message-Id: <5BCC444E-9267-487A-BB98-D06EE9EB27F4@gmail.com> Cc: "user@hadoop.apache.org" X-Mailer: iPhone Mail (10A523) From: Jay Vyas Subject: Re: Input splits for sequence file input Date: Mon, 3 Dec 2012 00:52:56 -0500 To: "user@hadoop.apache.org" X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-3EB120DF-3D3D-4C1C-A9C1-4B45B838C230 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable This question is fundamentally flawed : it assumes that a mapper will ask fo= r anything. The mapper class "run" method reads from a record reader. The question you r= eally should ask is : How does a RecordReader read records across block boundaries? Jay Vyas=20 http://jayunit100.blogspot.com On Dec 2, 2012, at 9:08 PM, Jeff Zhang wrote: > method createRecordReader will handle the record boundary issue. You can c= heck the code for details >=20 > On Mon, Dec 3, 2012 at 6:03 AM, Jeff LI wrote: >> Hello, >>=20 >> I was reading on the relationship between input splits and HDFS blocks an= d a question came up to me: >>=20 >> If a logical record crosses HDFS block boundary, let's say block#1 and bl= ock#2, does the mapper assigned with this input split asks for (1) both bloc= ks, or (2) block#1 and just the part of block#2 that this logical record ext= ends to, or (3) block#1 and part of block#2 up to some sync point that cover= s this particular logical record? Note the input is sequence file. >>=20 >> I guess my question really is: does Hadoop operate on a block basis or do= es it respect some sort of logical structure within a block when it's trying= to feed the mappers with input data. >>=20 >> Cheers >>=20 >> Jeff >=20 >=20 >=20 > --=20 > Best Regards >=20 > Jeff Zhang --Apple-Mail-3EB120DF-3D3D-4C1C-A9C1-4B45B838C230 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 7bit
This question is fundamentally flawed : it assumes that a mapper will ask for anything.

The mapper class "run" method reads from a record reader.  The question you really should ask is :

How does a RecordReader read records across block boundaries?

Jay Vyas 

On Dec 2, 2012, at 9:08 PM, Jeff Zhang <zjffdu@gmail.com> wrote:

method createRecordReader will handle the record boundary issue. You can check the code for details

On Mon, Dec 3, 2012 at 6:03 AM, Jeff LI <uniquejeff@gmail.com> wrote:
Hello,

I was reading on the relationship between input splits and HDFS blocks and a question came up to me:

If a logical record crosses HDFS block boundary, let's say block#1 and block#2, does the mapper assigned with this input split asks for (1) both blocks, or (2) block#1 and just the part of block#2 that this logical record extends to, or (3) block#1 and part of block#2 up to some sync point that covers this particular logical record?  Note the input is sequence file.

I guess my question really is: does Hadoop operate on a block basis or does it respect some sort of logical structure within a block when it's trying to feed the mappers with input data.

Cheers

Jeff




--
Best Regards

Jeff Zhang
--Apple-Mail-3EB120DF-3D3D-4C1C-A9C1-4B45B838C230--