Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of dontariq@gmail.com designates
 209.85.128.175 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <1363260557.74062.androidMobile@web161904.mail.bf1.yahoo.com>
References: <1363260557.74062.androidMobile@web161904.mail.bf1.yahoo.com>
From: Mohammad Tariq <dontariq@gmail.com>
Date: Thu, 14 Mar 2013 17:12:05 +0530
Message-ID: 
 <CAMVC6RPw=N0q2f6K5gaQROCMGs0c=Or+_n0rf7Af770XuL7OJw@mail.gmail.com>
Subject: Re: Block vs FileSplit vs record vs line
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=f46d043891078328db04d7e10576

--f46d043891078328db04d7e10576
Content-Type: text/plain; charset=ISO-8859-1

Just to add to what Manish sir has said, HDFS blocks and MR filesplits are
2 different things. filesplits are just logical division of your data such
that each split goes to a mapper for processing. split creation depends on
the InputFormat you use. but it's not always necessary that for each split
you'll always have an exclusive mapper. for example, if you process a huge
csv file with (say) 1 million rows, you won't get 1 million mappers as
it'll add a lot of overhead. the framework actually tries to do everything
as efficiently as possible.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Thu, Mar 14, 2013 at 4:59 PM, Manish Bhoge <manishbhoge@rocketmail.com>wrote:

> Sai,
> Each file is divided into split as per the map input format, each split is
> equal to a map. You rightly stated 1 split=1 block=1 map. Record can be
> combination of block defined by recordreader code. One record can be series
> of maps or splits or blocks.
>
> Hope this will clear.
>
> Sent from HTC via Rocket! excuse typo.
>
>  ------------------------------
> * From: * Sai Sai <saigraph@yahoo.in>;
> * To: * user@hadoop.apache.org <user@hadoop.apache.org>;
> * Subject: * Re: Block vs FileSplit vs record vs line
> * Sent: * Thu, Mar 14, 2013 8:45:53 AM
>
>   Just wondering if this is right way to understand this:
> A large file is split into multiple blocks and each block is split into
> multiple file splits and each file split has multiple records and each
> record has multiple lines. Each line is processed by 1 instance of mapper.
> Any help is appreciated.
> Thanks
> Sai
>
>
>
>

--f46d043891078328db04d7e10576
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Just to add to what Manish sir has said, HDFS blocks and M=
R filesplits are 2 different things. filesplits are just logical division o=
f your data such that each split goes to a mapper for processing. split cre=
ation depends on the InputFormat you use. but it&#39;s not always necessary=
 that for each split you&#39;ll always have an exclusive mapper. for exampl=
e, if you process a huge csv file with (say) 1 million rows, you won&#39;t =
get 1 million mappers as it&#39;ll add a lot of overhead. the framework act=
ually tries to do everything as efficiently as possible.</div>

<div class=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"ltr">Warm Reg=
ards,<div>Tariq</div><div><a href=3D"https://mtariq.jux.com/" target=3D"_bl=
ank">https://mtariq.jux.com/</a><br></div><div><a href=3D"http://cloudfront=
.blogspot.com" target=3D"_blank">cloudfront.blogspot.com</a><br>

</div></div></div>
<br><br><div class=3D"gmail_quote">On Thu, Mar 14, 2013 at 4:59 PM, Manish =
Bhoge <span dir=3D"ltr">&lt;<a href=3D"mailto:manishbhoge@rocketmail.com" t=
arget=3D"_blank">manishbhoge@rocketmail.com</a>&gt;</span> wrote:<br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex">

<table cellspacing=3D"0" cellpadding=3D"0" border=3D"0"><tbody><tr><td vali=
gn=3D"top" style=3D"font:inherit"><p>Sai,<br>
Each file is divided into split as per the map input format, each split is =
equal to a map. You rightly stated 1 split=3D1 block=3D1 map. Record can be=
 combination of block defined by recordreader code. One record can be serie=
s of maps or splits or blocks. </p>


<p>Hope this will clear. </p>
<p>Sent from HTC via Rocket! excuse typo.</p>
</td></tr></tbody></table>            <div>
                <div>
                    <br>
                    <div style=3D"font-family:times new roman,new york,time=
s,serif;font-size:12pt">
                        <font face=3D"Tahoma">
                            <hr size=3D"1">
                            <b>
                                <span style=3D"font-weight:bold">From:</spa=
n>
                            </b>
                            Sai Sai &lt;<a href=3D"mailto:saigraph@yahoo.in=
" target=3D"_blank">saigraph@yahoo.in</a>&gt;;                            <=
br>
                            <b>
                                <span>To:</span>
                            </b>
                            <a href=3D"mailto:user@hadoop.apache.org" targe=
t=3D"_blank">user@hadoop.apache.org</a> &lt;<a href=3D"mailto:user@hadoop.a=
pache.org" target=3D"_blank">user@hadoop.apache.org</a>&gt;;               =
                                                                           =
           <br>


                            <b>
                                <span>Subject:</span>
                            </b>
                            Re: Block vs FileSplit vs record vs line       =
                     <br>
                            <b>
                                <span style=3D"font-weight:bold">Sent:</spa=
n>
                            </b>
                            Thu, Mar 14, 2013 8:45:53 AM                   =
         <br>
                            </font>
                            <br>
                            <table cellspacing=3D"0" cellpadding=3D"0" bord=
er=3D"0">
                                <tbody>
                                    <tr>
                                        <td valign=3D"top" style=3D"font:in=
herit"><div style=3D"font-size:12pt;font-family:times new roman,new york,ti=
mes,serif">Just wondering if this is right way to understand this:<br>A lar=
ge file is split into multiple blocks and each block is split into multiple=
 file splits and each file split has multiple records and each record has m=
ultiple lines. Each line is processed by 1 instance of mapper.<br>

Any help is appreciated.<br>Thanks<br>Sai<br><div><span><br></span></div><b=
r> <div style=3D"font-family:times new roman,new york,times,serif;font-size=
:12pt"><br> </div>  </div></td>
                                    </tr>
                                </tbody>
                            </table>
                    </div>
                </div>
            </div>
</blockquote></div><br></div>

--f46d043891078328db04d7e10576--