hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dieter De Witte <drdwi...@gmail.com>
Subject Re: How many blocks does one input split have?
Date Wed, 17 Dec 2014 19:56:55 GMT
Well formulated answer, thanks for sharing!

2014-12-17 17:15 GMT+01:00 mark charts <mcharts@yahoo.com>:
>
> Hello.
>
>
> FYI.
>
> "The way HDFS has been set up, it breaks down very large files into large
> blocks
> (for example, measuring 128MB), and stores three copies of these blocks on
> different nodes in the cluster. HDFS has no awareness of the content of
> these
> files.
>
> In YARN, when a MapReduce job is started, the Resource Manager (the
> cluster resource management and job scheduling facility) creates an
> Application Master daemon to look after the lifecycle of the job. (In
> Hadoop 1,
> the JobTracker monitored individual jobs as well as handling job
> ­scheduling
> and cluster resource management. One of the first things the Application
> Master
> does is determine which file blocks are needed for processing. The
> Application
> Master requests details from the NameNode on where the replicas of the
> needed data blocks are stored. Using the location data for the file blocks,
> the Application
> Master makes requests to the Resource Manager to have map tasks process
> specific
> blocks on the slave nodes where they’re stored.
> The key to efficient MapReduce processing is that, wherever possible, data
> is
> processed locally — on the slave node where it’s stored.
> Before looking at how the data blocks are processed, you need to look more
> closely at how Hadoop stores data. In Hadoop, files are composed of
> individual
> records, which are ultimately processed one-by-one by mapper tasks. For
> example, the sample data set we use in this book contains information about
> completed flights within the United States between 1987 and 2008. We have
> one
> large file for each year, and within every file, each individual line
> represents a
> single flight. In other words, one line represents one record. Now,
> remember
> that the block size for the Hadoop cluster is 64MB, which means that the
> light
> data files are broken into chunks of exactly 64MB.
>
> Do you see the problem? If each map task processes all records in a
> specific
> data block, what happens to those records that span block boundaries?
> File blocks are exactly 64MB (or whatever you set the block size to be),
> and
> because HDFS has no conception of what’s inside the file blocks, it can’t
> gauge
> when a record might spill over into another block. To solve this problem,
> Hadoop uses a logical representation of the data stored in file blocks,
> known as
> input splits. When a MapReduce job client calculates the input splits, it
> figures
> out where the first whole record in a block begins and where the last
> record
> in the block ends. In cases where the last record in a block is
> incomplete, the
> input split includes location information for the next block and the byte
> offset
> of the data needed to complete the record.
> You can configure the Application Master daemon (or JobTracker, if you’re
> in
> Hadoop 1) to calculate the input splits instead of the job client, which
> would
> be faster for jobs processing a large number of data blocks.
> MapReduce data processing is driven by this concept of input splits. The
> number of input splits that are calculated for a specific application
> determines
> the number of mapper tasks. Each of these mapper tasks is assigned, where
> possible, to a slave node where the input split is stored. The Resource
> Manager
> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
> splits
> are processed locally."                                          *sic*
>
> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
> Rafael Coss, and Roman B. Melnyk
>
>
>
> Mark Charts
>
>
>
>
>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
> drdwitte@gmail.com> wrote:
>
>
> Hi,
>
> Check this post:
> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>
> Regards, D
>
>
> 2014-12-17 15:16 GMT+01:00 Todd <bit1129@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>
>
>
>

Mime
View raw message