hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark charts <mcha...@yahoo.com>
Subject Re: How many blocks does one input split have?
Date Wed, 17 Dec 2014 16:15:23 GMT

"The way HDFS has been set up, it breaks down very large files into large blocks(for example,
measuring 128MB), and stores three copies of these blocks ondifferent nodes in the cluster.
HDFS has no awareness of the content of thesefiles. In YARN, when a MapReduce job is started,
the Resource Manager (thecluster resource management and job scheduling facility) creates
anApplication Master daemon to look after the lifecycle of the job. (In Hadoop 1,the JobTracker
monitored individual jobs as well as handling job ­schedulingand cluster resource management.
One of the first things the Application Masterdoes is determine which file blocks are needed
for processing. The Application Master requests details from the NameNode on where the replicas
of the needed data blocks are stored. Using the location data for the file blocks, the Application Master
makes requests to the Resource Manager to have map tasks process specific blocks on the slave
nodes where they’re stored. The key to efficient MapReduce processing is that, wherever
possible, data isprocessed locally — on the slave node where it’s stored.Before looking
at how the data blocks are processed, you need to look moreclosely at how Hadoop stores data.
In Hadoop, files are composed of individualrecords, which are ultimately processed one-by-one
by mapper tasks. Forexample, the sample data set we use in this book contains information
aboutcompleted flights within the United States between 1987 and 2008. We have onelarge file
for each year, and within every file, each individual line represents asingle flight. In other
words, one line represents one record. Now, rememberthat the block size for the Hadoop cluster
is 64MB, which means that the lightdata files are broken into chunks of exactly 64MB.
Do you see the problem? If each map task processes all records in a specificdata block, what
happens to those records that span block boundaries?File blocks are exactly 64MB (or whatever
you set the block size to be), andbecause HDFS has no conception of what’s inside the file
blocks, it can’t gaugewhen a record might spill over into another block. To solve this problem,Hadoop
uses a logical representation of the data stored in file blocks, known asinput splits. When
a MapReduce job client calculates the input splits, it figuresout where the first whole record
in a block begins and where the last recordin the block ends. In cases where the last record
in a block is incomplete, theinput split includes location information for the next block
and the byte offsetof the data needed to complete the record.  You can configure the Application
Master daemon (or JobTracker, if you’re inHadoop 1) to calculate the input splits instead
of the job client, which wouldbe faster for jobs processing a large number of data blocks.MapReduce
data processing is driven by this concept of input splits. Thenumber of input splits that
are calculated for a specific application determinesthe number of mapper tasks. Each of these
mapper tasks is assigned, wherepossible, to a slave node where the input split is stored.
The Resource Manager(or JobTracker, if you’re in Hadoop 1) does its best to ensure that
input splitsare processed locally."                                    
Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,Rafael Coss, and Roman B. Melnyk

Mark Charts


     On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <drdwitte@gmail.com>


Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D

2014-12-17 15:16 GMT+01:00 Todd <bit1129@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number
can be configured or fixed(can't be changed)?

View raw message