Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of mcharts@yahoo.com designates
 98.138.91.23 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s2048; d=yahoo.com;
	b=M2pESJ6m/rGw95ZFiJINB/neJtBea6k3U9e57Zw/V+BPAHhlcbA9jo4H6cIz4K7d5lxL7khZb8pTCY39SL3xC1iyLc8KOf5uppxUIkm4JLmNJ+5MO/iOx9BiINW98iMHiQUBQAL+w/HJMulRqmPMKlEJxEW4niSPNGDLKKwaLUyzFvbCHwfvcTNNiHgDejprkmYtueY5vcHi0TCdb5EftwVLUi9APzBh+1VfrG9x/uI2do3NjudSMDLQumw0xfjyogu9iYJoIqrX/TnjxkYjFwtTSwejPfBj2wpmWaCy1yroZPNh5X9KLgZnDV2Csk+Ig14k6dcjo+j1Ozv/k0DGnw==;
Date: Wed, 17 Dec 2014 16:15:23 +0000 (UTC)
From: mark charts <mcharts@yahoo.com>
Reply-To: mark charts <mcharts@yahoo.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Message-ID: 
 <857210796.179464.1418832923662.JavaMail.yahoo@jws10035.mail.ne1.yahoo.com>
In-Reply-To: 
 <CALSJUsRP86fVMPWXeM8PfyKDLnjvqz=s2eEtd2QomYk0CDcnVg@mail.gmail.com>
References: 
 <CALSJUsRP86fVMPWXeM8PfyKDLnjvqz=s2eEtd2QomYk0CDcnVg@mail.gmail.com>
Subject: Re: How many blocks does one input split have?
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_179463_147626630.1418832923650"

------=_Part_179463_147626630.1418832923650
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hello.

FYI.
"The way HDFS has been set up, it breaks down very large files into large b=
locks(for example, measuring 128MB), and stores three copies of these block=
s ondifferent nodes in the cluster. HDFS has no awareness of the content of=
 thesefiles.=C2=A0In YARN, when a MapReduce job is started, the Resource Ma=
nager (thecluster resource management and job scheduling facility) creates =
anApplication Master daemon to look after the lifecycle of the job. (In Had=
oop 1,the JobTracker monitored individual jobs as well as handling job =C2=
=ADschedulingand cluster resource management. One of the first things the A=
pplication Masterdoes is determine which file blocks are needed for process=
ing. The Application=C2=A0Master requests details from the NameNode on wher=
e the replicas of the needed data blocks are stored. Using the location dat=
a for the file blocks, the Application=C2=A0Master makes requests to the Re=
source Manager to have map tasks process specific=C2=A0blocks on the slave =
nodes where they=E2=80=99re stored. The key to efficient MapReduce processi=
ng is that, wherever possible, data isprocessed locally =E2=80=94 on the sl=
ave node where it=E2=80=99s stored.Before looking at how the data blocks ar=
e processed, you need to look moreclosely at how Hadoop stores data. In Had=
oop, files are composed of individualrecords, which are ultimately processe=
d one-by-one by mapper tasks. Forexample, the sample data set we use in thi=
s book contains information aboutcompleted flights within the United States=
 between 1987 and 2008. We have onelarge file for each year, and within eve=
ry file, each individual line represents asingle flight. In other words, on=
e line represents one record. Now, rememberthat the block size for the Hado=
op cluster is 64MB, which means that the lightdata files are broken into ch=
unks of exactly 64MB.
Do you see the problem? If each map task processes all records in a specifi=
cdata block, what happens to those records that span block boundaries?File =
blocks are exactly 64MB (or whatever you set the block size to be), andbeca=
use HDFS has no conception of what=E2=80=99s inside the file blocks, it can=
=E2=80=99t gaugewhen a record might spill over into another block. To solve=
 this problem,Hadoop uses a logical representation of the data stored in fi=
le blocks, known asinput splits. When a MapReduce job client calculates the=
 input splits, it figuresout where the first whole record in a block begins=
 and where the last recordin the block ends. In cases where the last record=
 in a block is incomplete, theinput split includes location information for=
 the next block and the byte offsetof the data needed to complete the recor=
d.=C2=A0 You can configure the Application Master daemon (or JobTracker, if=
 you=E2=80=99re inHadoop 1) to calculate the input splits instead of the jo=
b client, which wouldbe faster for jobs processing a large number of data b=
locks.MapReduce data processing is driven by this concept of input splits. =
Thenumber of input splits that are calculated for a specific application de=
terminesthe number of mapper tasks. Each of these mapper tasks is assigned,=
 wherepossible, to a slave node where the input split is stored. The Resour=
ce Manager(or JobTracker, if you=E2=80=99re in Hadoop 1) does its best to e=
nsure that input splitsare processed locally." =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sic
Courtesy of=C2=A0Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,Rafael Coss, =
and Roman B. Melnyk


Mark Charts

=20

     On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <drdwitte@gm=
ail.com> wrote:
  =20

 Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-s=
plit-size-vs-block-size

Regards, D


2014-12-17 15:16 GMT+01:00 Todd <bit1129@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is ran=
dom or the number can be configured or fixed(can't be changed)?
Thanks!


------=_Part_179463_147626630.1418832923650
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html><body><div style=3D"color:#000; background-color:#fff; font-family:Co=
urier New, courier, monaco, monospace, sans-serif;font-size:24px"><div dir=
=3D"ltr" id=3D"yui_3_16_0_1_1418811322787_239635"><span>Hello.</span></div>=
<div dir=3D"ltr" id=3D"yui_3_16_0_1_1418811322787_239739"><span><br></span>=
</div><div dir=3D"ltr" id=3D"yui_3_16_0_1_1418811322787_239738"><span><br><=
/span></div><div dir=3D"ltr" id=3D"yui_3_16_0_1_1418811322787_239737"><span=
>FYI.</span></div><div dir=3D"ltr" id=3D"yui_3_16_0_1_1418811322787_239736"=
><span><br></span></div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_=
16_0_1_1418811322787_239735">"The way HDFS has been set up, it breaks down =
very large files into large blocks</div><div dir=3D"ltr" class=3D"" style=
=3D"" id=3D"yui_3_16_0_1_1418811322787_239734">(for example, measuring 128M=
B), and stores three copies of these blocks on</div><div dir=3D"ltr" class=
=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_239733">different nodes i=
n the cluster. HDFS has no awareness of the content of these</div><div dir=
=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_239732">fil=
es.</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_14188113=
22787_239732">&nbsp;</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_=
3_16_0_1_1418811322787_239731">In YARN, when a MapReduce job is started, th=
e Resource Manager (the</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"y=
ui_3_16_0_1_1418811322787_239730">cluster resource management and job sched=
uling facility) creates an</div><div dir=3D"ltr" class=3D"" style=3D"" id=
=3D"yui_3_16_0_1_1418811322787_239729">Application Master daemon to look af=
ter the lifecycle of the job. (In Hadoop 1,</div><div dir=3D"ltr" class=3D"=
" style=3D"" id=3D"yui_3_16_0_1_1418811322787_239740">the JobTracker monito=
red individual jobs as well as handling job =C2=ADscheduling</div><div dir=
=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_239728">and=
 cluster resource management. One of the first things the Application Maste=
r</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322=
787_239728">does is determine which file blocks are needed for processing. =
The Application&nbsp;</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui=
_3_16_0_1_1418811322787_239728">Master requests details from the NameNode o=
n where the replicas of the needed data blocks are stored. Using the locati=
on data for the file blocks, the Application&nbsp;</div><div dir=3D"ltr" cl=
ass=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_239728">Master makes r=
equests to the Resource Manager to have map tasks process specific&nbsp;</d=
iv><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_=
239728">blocks on the slave nodes where they=E2=80=99re stored.</div><div d=
ir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_239723"><=
span class=3D"" style=3D"white-space:pre">=09</span></div><div dir=3D"ltr" =
class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_240404">The key to e=
fficient MapReduce processing is that, wherever possible, data is</div><div=
 dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_239722"=
>processed locally =E2=80=94 on the slave node where it=E2=80=99s stored.</=
div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787=
_240405">Before looking at how the data blocks are processed, you need to l=
ook more</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_141=
8811322787_239721">closely at how Hadoop stores data. In Hadoop, files are =
composed of individual</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yu=
i_3_16_0_1_1418811322787_240406">records, which are ultimately processed on=
e-by-one by mapper tasks. For</div><div dir=3D"ltr" class=3D"" style=3D"" i=
d=3D"yui_3_16_0_1_1418811322787_240407">example, the sample data set we use=
 in this book contains information about</div><div dir=3D"ltr" class=3D"" s=
tyle=3D"" id=3D"yui_3_16_0_1_1418811322787_239720">completed flights within=
 the United States between 1987 and 2008. We have one</div><div dir=3D"ltr"=
 class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_240408">large file =
for each year, and within every file, each individual line represents a</di=
v><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_2=
39719">single flight. In other words, one line represents one record. Now, =
remember</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_141=
8811322787_240409">that the block size for the Hadoop cluster is 64MB, whic=
h means that the light</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yu=
i_3_16_0_1_1418811322787_239718">data files are broken into chunks of exact=
ly 64MB.</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_141=
8811322787_239718"><br></div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"y=
ui_3_16_0_1_1418811322787_240410">Do you see the problem? If each map task =
processes all records in a specific</div><div dir=3D"ltr" class=3D"" style=
=3D"" id=3D"yui_3_16_0_1_1418811322787_239717">data block, what happens to =
those records that span block boundaries?</div><div dir=3D"ltr" class=3D"" =
style=3D"" id=3D"yui_3_16_0_1_1418811322787_240755">File blocks are exactly=
 64MB (or whatever you set the block size to be), and</div><div dir=3D"ltr"=
 class=3D"" style=3D"">because HDFS has no conception of what=E2=80=99s ins=
ide the file blocks, it can=E2=80=99t gauge</div><div dir=3D"ltr" class=3D"=
" style=3D"" id=3D"yui_3_16_0_1_1418811322787_239716">when a record might s=
pill over into another block. To solve this problem,</div><div dir=3D"ltr" =
class=3D"" style=3D"">Hadoop uses a logical representation of the data stor=
ed in file blocks, known as</div><div dir=3D"ltr" class=3D"" style=3D"" id=
=3D"yui_3_16_0_1_1418811322787_239715">input splits. When a MapReduce job c=
lient calculates the input splits, it figures</div><div dir=3D"ltr" class=
=3D"" style=3D"">out where the first whole record in a block begins and whe=
re the last record</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_=
16_0_1_1418811322787_239714">in the block ends. In cases where the last rec=
ord in a block is incomplete, the</div><div dir=3D"ltr" class=3D"" style=3D=
"" id=3D"yui_3_16_0_1_1418811322787_239713">input split includes location i=
nformation for the next block and the byte offset</div><div dir=3D"ltr" cla=
ss=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_239709">of the data nee=
ded to complete the record.&nbsp;</div><div dir=3D"ltr" class=3D"" style=3D=
"" id=3D"yui_3_16_0_1_1418811322787_239712"><span class=3D"" style=3D"white=
-space:pre">=09</span></div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yu=
i_3_16_0_1_1418811322787_239702">You can configure the Application Master d=
aemon (or JobTracker, if you=E2=80=99re in</div><div dir=3D"ltr" class=3D""=
 style=3D"" id=3D"yui_3_16_0_1_1418811322787_239711">Hadoop 1) to calculate=
 the input splits instead of the job client, which would</div><div dir=3D"l=
tr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_239701">be faste=
r for jobs processing a large number of data blocks.</div><div dir=3D"ltr" =
class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_239700">MapReduce da=
ta processing is driven by this concept of input splits. The</div><div dir=
=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_1418811322787_239699">num=
ber of input splits that are calculated for a specific application determin=
es</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_141881132=
2787_241165">the number of mapper tasks. Each of these mapper tasks is assi=
gned, where</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_=
1418811322787_241166">possible, to a slave node where the input split is st=
ored. The Resource Manager</div><div dir=3D"ltr" class=3D"" style=3D"" id=
=3D"yui_3_16_0_1_1418811322787_241167">(or JobTracker, if you=E2=80=99re in=
 Hadoop 1) does its best to ensure that input splits</div><div dir=3D"ltr">=
</div><div dir=3D"ltr" class=3D"" style=3D"" id=3D"yui_3_16_0_1_14188113227=
87_241168">are processed locally." &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp;<b><i>sic</i></b></div><div dir=3D"ltr" id=
=3D"yui_3_16_0_1_1418811322787_241169"><span><br></span></div><div dir=3D"l=
tr" id=3D"yui_3_16_0_1_1418811322787_241169"><span id=3D"yui_3_16_0_1_14188=
11322787_241599">Courtesy of&nbsp;</span>Dirk deRoos, Paul C. Zikopoulos, B=
ruce Brown,</div><div dir=3D"ltr" id=3D"yui_3_16_0_1_1418811322787_241169" =
class=3D"" style=3D"">Rafael Coss, and Roman B. Melnyk</div><div dir=3D"ltr=
" id=3D"yui_3_16_0_1_1418811322787_241169" class=3D"" style=3D""><br></div>=
<div dir=3D"ltr" id=3D"yui_3_16_0_1_1418811322787_241169" class=3D"" style=
=3D""><br></div><div dir=3D"ltr" id=3D"yui_3_16_0_1_1418811322787_241169" c=
lass=3D"" style=3D""><br></div><div dir=3D"ltr" id=3D"yui_3_16_0_1_14188113=
22787_241169" class=3D"" style=3D"">Mark Charts</div><div dir=3D"ltr" id=3D=
"yui_3_16_0_1_1418811322787_241169"><span><br></span></div><div dir=3D"ltr"=
 id=3D"yui_3_16_0_1_1418811322787_241169"><span><br></span></div> <div clas=
s=3D"qtdSeparateBR"><br><br></div><div class=3D"yahoo_quoted" style=3D"disp=
lay: block;"> <div style=3D"font-family: Courier New, courier, monaco, mono=
space, sans-serif; font-size: 24px;"> <div style=3D"font-family: HelveticaN=
eue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif; font-size=
: 16px;"> <div dir=3D"ltr"> <font size=3D"2" face=3D"Arial"> On Wednesday, =
December 17, 2014 10:32 AM, Dieter De Witte &lt;drdwitte@gmail.com&gt; wrot=
e:<br> </font> </div>  <br><br> <div class=3D"y_msg_container"><div id=3D"y=
iv1191570950"><div><div dir=3D"ltr"><div>Hi,<br clear=3D"none"><br clear=3D=
"none"></div><div>Check this post: <a rel=3D"nofollow" shape=3D"rect" targe=
t=3D"_blank" href=3D"http://stackoverflow.com/questions/17727468/hadoop-inp=
ut-split-size-vs-block-size">http://stackoverflow.com/questions/17727468/ha=
doop-input-split-size-vs-block-size</a><br clear=3D"none"><br clear=3D"none=
"></div><div>Regards, D<br clear=3D"none"></div><br clear=3D"none"></div><d=
iv class=3D"yiv1191570950gmail_extra"><br clear=3D"none"><div class=3D"yiv1=
191570950gmail_quote">2014-12-17 15:16 GMT+01:00 Todd <span dir=3D"ltr">&lt=
;<a rel=3D"nofollow" shape=3D"rect" ymailto=3D"mailto:bit1129@163.com" targ=
et=3D"_blank" href=3D"mailto:bit1129@163.com">bit1129@163.com</a>&gt;</span=
>:<blockquote class=3D"yiv1191570950gmail_quote" style=3D"margin:0 0 0 .8ex=
;border-left:1px #ccc solid;padding-left:1ex;"><div class=3D"yiv1191570950y=
qt5463972672" id=3D"yiv1191570950yqt09126"><div style=3D"line-height:1.7;co=
lor:#000000;font-size:14px;font-family:Arial;"><div>Hi Hadoopers,<br clear=
=3D"none"><br clear=3D"none">I got a question about how many blocks does on=
e input split have? It is random or the number can be configured or fixed(c=
an't be changed)?<br clear=3D"none">Thanks!<br clear=3D"none"></div></div><=
/div></blockquote></div></div></div></div><br><br></div>  </div> </div>  </=
div> </div></body></html>
------=_Part_179463_147626630.1418832923650--