hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From daemeon reiydelle <daeme...@gmail.com>
Subject Re: Re: How many blocks does one input split have?
Date Thu, 18 Dec 2014 03:11:48 GMT
There would be thousands of tasks, but not all fired off at the same time.
The number of parallel tasks is configurable but typically 1 per data node


On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bit1129@163.com> wrote:
> Thanks Mark and Dieter for the reply.
> Actually, I got another question in mind. What's the relationship between
> input split and mapper task?Is it one one relation or a mapper task can
> handle more than one input splits?
> If mapper task can only handle one input split, then if there are many
> input splits(say, the the original file is 1TB or larger,then there may be
> thousands of input splits), thousands of mapper tasks would be created.
> ------------------------------
> bit1129@163.com
> *From:* mark charts <mcharts@yahoo.com>
> *Date:* 2014-12-18 00:15
> *To:* user@hadoop.apache.org
> *Subject:* Re: How many blocks does one input split have?
> Hello.
> FYI.
> "The way HDFS has been set up, it breaks down very large files into large
> blocks
> (for example, measuring 128MB), and stores three copies of these blocks on
> different nodes in the cluster. HDFS has no awareness of the content of
> these
> files.
> In YARN, when a MapReduce job is started, the Resource Manager (the
> cluster resource management and job scheduling facility) creates an
> Application Master daemon to look after the lifecycle of the job. (In
> Hadoop 1,
> the JobTracker monitored individual jobs as well as handling job
> ­scheduling
> and cluster resource management. One of the first things the Application
> Master
> does is determine which file blocks are needed for processing. The
> Application
> Master requests details from the NameNode on where the replicas of the
> needed data blocks are stored. Using the location data for the file blocks,
> the Application
> Master makes requests to the Resource Manager to have map tasks process
> specific
> blocks on the slave nodes where they’re stored.
> The key to efficient MapReduce processing is that, wherever possible, data
> is
> processed locally — on the slave node where it’s stored.
> Before looking at how the data blocks are processed, you need to look more
> closely at how Hadoop stores data. In Hadoop, files are composed of
> individual
> records, which are ultimately processed one-by-one by mapper tasks. For
> example, the sample data set we use in this book contains information about
> completed flights within the United States between 1987 and 2008. We have
> one
> large file for each year, and within every file, each individual line
> represents a
> single flight. In other words, one line represents one record. Now,
> remember
> that the block size for the Hadoop cluster is 64MB, which means that the
> light
> data files are broken into chunks of exactly 64MB.
> Do you see the problem? If each map task processes all records in a
> specific
> data block, what happens to those records that span block boundaries?
> File blocks are exactly 64MB (or whatever you set the block size to be),
> and
> because HDFS has no conception of what’s inside the file blocks, it can’t
> gauge
> when a record might spill over into another block. To solve this problem,
> Hadoop uses a logical representation of the data stored in file blocks,
> known as
> input splits. When a MapReduce job client calculates the input splits, it
> figures
> out where the first whole record in a block begins and where the last
> record
> in the block ends. In cases where the last record in a block is
> incomplete, the
> input split includes location information for the next block and the byte
> offset
> of the data needed to complete the record.
> You can configure the Application Master daemon (or JobTracker, if you’re
> in
> Hadoop 1) to calculate the input splits instead of the job client, which
> would
> be faster for jobs processing a large number of data blocks.
> MapReduce data processing is driven by this concept of input splits. The
> number of input splits that are calculated for a specific application
> determines
> the number of mapper tasks. Each of these mapper tasks is assigned, where
> possible, to a slave node where the input split is stored. The Resource
> Manager
> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
> splits
> are processed locally."                                          *sic*
> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
> Rafael Coss, and Roman B. Melnyk
> Mark Charts
>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
> drdwitte@gmail.com> wrote:
> Hi,
> Check this post:
> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
> Regards, D
> 2014-12-17 15:16 GMT+01:00 Todd <bit1129@163.com>:
> Hi Hadoopers,
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!

View raw message