hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dieter De Witte <drdwi...@gmail.com>
Subject Re: Re: How many blocks does one input split have?
Date Thu, 18 Dec 2014 15:02:29 GMT
1 maptask = 1 input split, but a Mapperclass can handle multiple tasks
albeit one at a time..

2014-12-18 4:54 GMT+01:00 bit1129@163.com <bit1129@163.com>:
> Sure, thanks Mark. That mean, the completed mapper task is not reused to
> work on the pending input splits.
> ------------------------------
> bit1129@163.com
> *From:* daemeon reiydelle <daemeonr@gmail.com>
> *Date:* 2014-12-18 11:11
> *To:* user <user@hadoop.apache.org>
> *CC:* mark charts <mcharts@yahoo.com>
> *Subject:* Re: Re: How many blocks does one input split have?
> There would be thousands of tasks, but not all fired off at the same time.
> The number of parallel tasks is configurable but typically 1 per data node
> core.
> *.......*
> On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bit1129@163.com> wrote:
>> Thanks Mark and Dieter for the reply.
>> Actually, I got another question in mind. What's the relationship between
>> input split and mapper task?Is it one one relation or a mapper task can
>> handle more than one input splits?
>> If mapper task can only handle one input split, then if there are many
>> input splits(say, the the original file is 1TB or larger,then there may be
>> thousands of input splits), thousands of mapper tasks would be created.
>> ------------------------------
>> bit1129@163.com
>> *From:* mark charts <mcharts@yahoo.com>
>> *Date:* 2014-12-18 00:15
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: How many blocks does one input split have?
>> Hello.
>> FYI.
>> "The way HDFS has been set up, it breaks down very large files into large
>> blocks
>> (for example, measuring 128MB), and stores three copies of these blocks on
>> different nodes in the cluster. HDFS has no awareness of the content of
>> these
>> files.
>> In YARN, when a MapReduce job is started, the Resource Manager (the
>> cluster resource management and job scheduling facility) creates an
>> Application Master daemon to look after the lifecycle of the job. (In
>> Hadoop 1,
>> the JobTracker monitored individual jobs as well as handling job
>> -scheduling
>> and cluster resource management. One of the first things the Application
>> Master
>> does is determine which file blocks are needed for processing. The
>> Application
>> Master requests details from the NameNode on where the replicas of the
>> needed data blocks are stored. Using the location data for the file blocks,
>> the Application
>> Master makes requests to the Resource Manager to have map tasks process
>> specific
>> blocks on the slave nodes where they’re stored.
>> The key to efficient MapReduce processing is that, wherever possible,
>> data is
>> processed locally — on the slave node where it’s stored.
>> Before looking at how the data blocks are processed, you need to look more
>> closely at how Hadoop stores data. In Hadoop, files are composed of
>> individual
>> records, which are ultimately processed one-by-one by mapper tasks. For
>> example, the sample data set we use in this book contains information
>> about
>> completed flights within the United States between 1987 and 2008. We have
>> one
>> large file for each year, and within every file, each individual line
>> represents a
>> single flight. In other words, one line represents one record. Now,
>> remember
>> that the block size for the Hadoop cluster is 64MB, which means that the
>> light
>> data files are broken into chunks of exactly 64MB.
>> Do you see the problem? If each map task processes all records in a
>> specific
>> data block, what happens to those records that span block boundaries?
>> File blocks are exactly 64MB (or whatever you set the block size to be),
>> and
>> because HDFS has no conception of what’s inside the file blocks, it can’t
>> gauge
>> when a record might spill over into another block. To solve this problem,
>> Hadoop uses a logical representation of the data stored in file blocks,
>> known as
>> input splits. When a MapReduce job client calculates the input splits, it
>> figures
>> out where the first whole record in a block begins and where the last
>> record
>> in the block ends. In cases where the last record in a block is
>> incomplete, the
>> input split includes location information for the next block and the byte
>> offset
>> of the data needed to complete the record.
>> You can configure the Application Master daemon (or JobTracker, if you’re
>> in
>> Hadoop 1) to calculate the input splits instead of the job client, which
>> would
>> be faster for jobs processing a large number of data blocks.
>> MapReduce data processing is driven by this concept of input splits. The
>> number of input splits that are calculated for a specific application
>> determines
>> the number of mapper tasks. Each of these mapper tasks is assigned, where
>> possible, to a slave node where the input split is stored. The Resource
>> Manager
>> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
>> splits
>> are processed locally."                                          *sic*
>> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
>> Rafael Coss, and Roman B. Melnyk
>> Mark Charts
>>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
>> drdwitte@gmail.com> wrote:
>> Hi,
>> Check this post:
>> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>> Regards, D
>> 2014-12-17 15:16 GMT+01:00 Todd <bit1129@163.com>:
>> Hi Hadoopers,
>> I got a question about how many blocks does one input split have? It is
>> random or the number can be configured or fixed(can't be changed)?
>> Thanks!

View raw message