Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 110B7D9F1 for ; Thu, 14 Mar 2013 11:43:25 +0000 (UTC) Received: (qmail 42752 invoked by uid 500); 14 Mar 2013 11:43:20 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 42309 invoked by uid 500); 14 Mar 2013 11:43:17 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 42243 invoked by uid 99); 14 Mar 2013 11:43:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Mar 2013 11:43:14 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dontariq@gmail.com designates 209.85.128.175 as permitted sender) Received: from [209.85.128.175] (HELO mail-ve0-f175.google.com) (209.85.128.175) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Mar 2013 11:43:07 +0000 Received: by mail-ve0-f175.google.com with SMTP id cy12so1572325veb.20 for ; Thu, 14 Mar 2013 04:42:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=oKvDDyCyEQJ9vC35M5sa0939bG9bjs8078U+yPZISUE=; b=C/LNKv9kAGWWp1T+skH0PFSIG/pw0dPpsnzK7xuxeAtDg+3jKqS+AIcg8Fd2sjkXUK 397at9LbIoe02BfoSEKtBf7gu/obl12MptJ+ALQi/dIFBS3HkUvxCBtovzsnxsoMlg7Q 04d6X6VBPPu99KOe9HlTHGqucjhnwGmOUs/jJadrxLJz37e07nhqjNEEvn7VU1ZuoBJ6 XJS9dyUaNrQn8iLDwh5nGrKAduaCmNN/w8v4PtkGdd8I9BmzIRx/k7NQ5UrDzzly8cwT txUe/mekxEqGpMcQztg+Tzrf3gV5iRVRmrL73nsmtD8ygB0S88dAP9dL7ndlXUX8KOyx RjuQ== X-Received: by 10.220.153.2 with SMTP id i2mr1006972vcw.53.1363261366915; Thu, 14 Mar 2013 04:42:46 -0700 (PDT) MIME-Version: 1.0 Received: by 10.59.13.9 with HTTP; Thu, 14 Mar 2013 04:42:05 -0700 (PDT) In-Reply-To: <1363260557.74062.androidMobile@web161904.mail.bf1.yahoo.com> References: <1363260557.74062.androidMobile@web161904.mail.bf1.yahoo.com> From: Mohammad Tariq Date: Thu, 14 Mar 2013 17:12:05 +0530 Message-ID: Subject: Re: Block vs FileSplit vs record vs line To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=f46d043891078328db04d7e10576 X-Virus-Checked: Checked by ClamAV on apache.org --f46d043891078328db04d7e10576 Content-Type: text/plain; charset=ISO-8859-1 Just to add to what Manish sir has said, HDFS blocks and MR filesplits are 2 different things. filesplits are just logical division of your data such that each split goes to a mapper for processing. split creation depends on the InputFormat you use. but it's not always necessary that for each split you'll always have an exclusive mapper. for example, if you process a huge csv file with (say) 1 million rows, you won't get 1 million mappers as it'll add a lot of overhead. the framework actually tries to do everything as efficiently as possible. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Thu, Mar 14, 2013 at 4:59 PM, Manish Bhoge wrote: > Sai, > Each file is divided into split as per the map input format, each split is > equal to a map. You rightly stated 1 split=1 block=1 map. Record can be > combination of block defined by recordreader code. One record can be series > of maps or splits or blocks. > > Hope this will clear. > > Sent from HTC via Rocket! excuse typo. > > ------------------------------ > * From: * Sai Sai ; > * To: * user@hadoop.apache.org ; > * Subject: * Re: Block vs FileSplit vs record vs line > * Sent: * Thu, Mar 14, 2013 8:45:53 AM > > Just wondering if this is right way to understand this: > A large file is split into multiple blocks and each block is split into > multiple file splits and each file split has multiple records and each > record has multiple lines. Each line is processed by 1 instance of mapper. > Any help is appreciated. > Thanks > Sai > > > > --f46d043891078328db04d7e10576 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Just to add to what Manish sir has said, HDFS blocks and M= R filesplits are 2 different things. filesplits are just logical division o= f your data such that each split goes to a mapper for processing. split cre= ation depends on the InputFormat you use. but it's not always necessary= that for each split you'll always have an exclusive mapper. for exampl= e, if you process a huge csv file with (say) 1 million rows, you won't = get 1 million mappers as it'll add a lot of overhead. the framework act= ually tries to do everything as efficiently as possible.



On Thu, Mar 14, 2013 at 4:59 PM, Manish = Bhoge <manishbhoge@rocketmail.com> wrote:

Sai,
Each file is divided into split as per the map input format, each split is = equal to a map. You rightly stated 1 split=3D1 block=3D1 map. Record can be= combination of block defined by recordreader code. One record can be serie= s of maps or splits or blocks.

Hope this will clear.

Sent from HTC via Rocket! excuse typo.



From: Sai Sai <saigraph@yahoo.in>; <= br> To: user@hadoop.apache.org <user@hadoop.apache.org>; = =
Subject: Re: Block vs FileSplit vs record vs line =
Sent: Thu, Mar 14, 2013 8:45:53 AM =

Just wondering if this is right way to understand this:
A lar= ge file is split into multiple blocks and each block is split into multiple= file splits and each file split has multiple records and each record has m= ultiple lines. Each line is processed by 1 instance of mapper.
Any help is appreciated.
Thanks
Sai



--f46d043891078328db04d7e10576--