hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "bit1129@163.com" <bit1...@163.com>
Subject Re: Re: How many blocks does one input split have?
Date Thu, 18 Dec 2014 03:54:59 GMT
Sure, thanks Mark. That mean, the completed mapper task is not reused to work on the pending
input splits.



bit1129@163.com
 
From: daemeon reiydelle
Date: 2014-12-18 11:11
To: user
CC: mark charts
Subject: Re: Re: How many blocks does one input split have?
There would be thousands of tasks, but not all fired off at the same time. The number of parallel
tasks is configurable but typically 1 per data node core.


.......

On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bit1129@163.com> wrote:
Thanks Mark and Dieter for the reply.

Actually, I got another question in mind. What's the relationship between input split and
mapper task?Is it one one relation or a mapper task can handle more than one input splits?

If mapper task can only handle one input split, then if there are many input splits(say, the
the original file is 1TB or larger,then there may be thousands of input splits), thousands
of mapper tasks would be created.



bit1129@163.com
 
From: mark charts
Date: 2014-12-18 00:15
To: user@hadoop.apache.org
Subject: Re: How many blocks does one input split have?
Hello.


FYI.

"The way HDFS has been set up, it breaks down very large files into large blocks
(for example, measuring 128MB), and stores three copies of these blocks on
different nodes in the cluster. HDFS has no awareness of the content of these
files.
 
In YARN, when a MapReduce job is started, the Resource Manager (the
cluster resource management and job scheduling facility) creates an
Application Master daemon to look after the lifecycle of the job. (In Hadoop 1,
the JobTracker monitored individual jobs as well as handling job -scheduling
and cluster resource management. One of the first things the Application Master
does is determine which file blocks are needed for processing. The Application 
Master requests details from the NameNode on where the replicas of the needed data blocks
are stored. Using the location data for the file blocks, the Application 
Master makes requests to the Resource Manager to have map tasks process specific 
blocks on the slave nodes where they’re stored.
The key to efficient MapReduce processing is that, wherever possible, data is
processed locally ― on the slave node where it’s stored.
Before looking at how the data blocks are processed, you need to look more
closely at how Hadoop stores data. In Hadoop, files are composed of individual
records, which are ultimately processed one-by-one by mapper tasks. For
example, the sample data set we use in this book contains information about
completed flights within the United States between 1987 and 2008. We have one
large file for each year, and within every file, each individual line represents a
single flight. In other words, one line represents one record. Now, remember
that the block size for the Hadoop cluster is 64MB, which means that the light
data files are broken into chunks of exactly 64MB.

Do you see the problem? If each map task processes all records in a specific
data block, what happens to those records that span block boundaries?
File blocks are exactly 64MB (or whatever you set the block size to be), and
because HDFS has no conception of what’s inside the file blocks, it can’t gauge
when a record might spill over into another block. To solve this problem,
Hadoop uses a logical representation of the data stored in file blocks, known as
input splits. When a MapReduce job client calculates the input splits, it figures
out where the first whole record in a block begins and where the last record
in the block ends. In cases where the last record in a block is incomplete, the
input split includes location information for the next block and the byte offset
of the data needed to complete the record. 
You can configure the Application Master daemon (or JobTracker, if you’re in
Hadoop 1) to calculate the input splits instead of the job client, which would
be faster for jobs processing a large number of data blocks.
MapReduce data processing is driven by this concept of input splits. The
number of input splits that are calculated for a specific application determines
the number of mapper tasks. Each of these mapper tasks is assigned, where
possible, to a slave node where the input split is stored. The Resource Manager
(or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits
are processed locally."                                          sic

Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
Rafael Coss, and Roman B. Melnyk



Mark Charts




On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <drdwitte@gmail.com> wrote:


Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D


2014-12-17 15:16 GMT+01:00 Todd <bit1129@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number
can be configured or fixed(can't be changed)?
Thanks!


Mime
View raw message