Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C4F4FC676 for ; Wed, 17 Dec 2014 16:17:02 +0000 (UTC) Received: (qmail 23387 invoked by uid 500); 17 Dec 2014 16:16:57 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 23275 invoked by uid 500); 17 Dec 2014 16:16:57 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 23265 invoked by uid 99); 17 Dec 2014 16:16:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Dec 2014 16:16:56 +0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=FORGED_YAHOO_RCVD,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mcharts@yahoo.com designates 98.138.91.23 as permitted sender) Received: from [98.138.91.23] (HELO nm8-vm0.bullet.mail.ne1.yahoo.com) (98.138.91.23) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Dec 2014 16:16:29 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1418832924; bh=aCNl3VxuUOM2JvfXSMiN0TJmaeQ38jKdLGqj4paPTAM=; h=Date:From:Reply-To:To:In-Reply-To:References:Subject:From:Subject; b=cy8yto4W3bZ6s0ZgGcfIXq+ErgipHvhrwBfqpAZGr68fnRJUExmK/VFGgPBdIKQIYutU7ycMYaxtpzxu/AkdzVCuKbfvAGVFXyX7TqqRUkXC4qbGeztncgUq5Prx9HMz3sMEn1QKxJNYcjc4dhbD7ELz1zaWmWfAPyl8vMwxoGbwSV/RpdzBqn3WJ9CO07aXfymLcO+jLrES3QtfpwvA/75lsK5cdBQfSjzTNXRcPNJ5A7bvGEHtCvio+HbSDNdsC7WObZJEz+nlI5a4seYXfURykssyjm+L1piv68G7E0/N+Pcy4MhJJ0V3hjxkspUYUwE7apRtqUqXRr7ZAuaj1w== DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s2048; d=yahoo.com; b=M2pESJ6m/rGw95ZFiJINB/neJtBea6k3U9e57Zw/V+BPAHhlcbA9jo4H6cIz4K7d5lxL7khZb8pTCY39SL3xC1iyLc8KOf5uppxUIkm4JLmNJ+5MO/iOx9BiINW98iMHiQUBQAL+w/HJMulRqmPMKlEJxEW4niSPNGDLKKwaLUyzFvbCHwfvcTNNiHgDejprkmYtueY5vcHi0TCdb5EftwVLUi9APzBh+1VfrG9x/uI2do3NjudSMDLQumw0xfjyogu9iYJoIqrX/TnjxkYjFwtTSwejPfBj2wpmWaCy1yroZPNh5X9KLgZnDV2Csk+Ig14k6dcjo+j1Ozv/k0DGnw==; Received: from [98.138.100.115] by nm8.bullet.mail.ne1.yahoo.com with NNFMP; 17 Dec 2014 16:15:24 -0000 Received: from [98.138.89.196] by tm106.bullet.mail.ne1.yahoo.com with NNFMP; 17 Dec 2014 16:15:24 -0000 Received: from [127.0.0.1] by omp1054.mail.ne1.yahoo.com with NNFMP; 17 Dec 2014 16:15:24 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 736370.85096.bm@omp1054.mail.ne1.yahoo.com X-YMail-OSG: kBJ548YVM1lZT_U1OS2YyLMZWXtgtjGBtcJo2hCEY9g07xVT8B12Ee4LgBopsJ8 ichwX7sNilhHqmtkr0SJF0uHo1X5TP8nw8IJ9NYNOBnFYvCry4W32YBHh.J5LZ6Llc6a4AwfNQPF x9_snvIMYpYoeMbZJPjFI_jJZcW0EQRtxDGpK6WOE_mQV.Pbvboac1NoB_K7velwnEw320wRKZFu XgbL.kOTMTr6d1KuqxDUyOZzYibH7RRaY9sr7LjUPb0jJ_IB4FyPCLATTZG8Gvp7CoT.hKdcCnjw Al44OjoLfxc1IbEk3HMXxT7LOpc0YHM7IjybDJZVRsxQA5xtkJV4WgN7e4Pf2FFqcuTxu0MB_b2z jgbOZf6gDGX2Ir2C2ukp4PSLdchnbwRV9_xesXMspVQiPHXoHtZJB5SrXKpYiKXUn1V572rhEb68 puLBmNHeMDaHKQtyk6IY7xBvmLuWtI0uBvj57RPKM.Lup0YAX7rPQdeyfNCHbG9VMf7N6fh5v2fU dRljhreDA6gGuoM6vMHSkfhSYDL.a5GgbgDBC_EaV2OuQEFkhFcQuJrf.dTVpOlkl2ipsRUDb0Hk eBTeEm0W16Cpg7S8xf6a08owu3uXlC7qE8a7ieWA.ERrbbsPew.gR1h0NhD4bAyvBAR6MNH0Mpw-- Received: by 98.138.105.225; Wed, 17 Dec 2014 16:15:24 +0000 Date: Wed, 17 Dec 2014 16:15:23 +0000 (UTC) From: mark charts Reply-To: mark charts To: "user@hadoop.apache.org" Message-ID: <857210796.179464.1418832923662.JavaMail.yahoo@jws10035.mail.ne1.yahoo.com> In-Reply-To: References: Subject: Re: How many blocks does one input split have? MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_179463_147626630.1418832923650" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_179463_147626630.1418832923650 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hello. FYI. "The way HDFS has been set up, it breaks down very large files into large b= locks(for example, measuring 128MB), and stores three copies of these block= s ondifferent nodes in the cluster. HDFS has no awareness of the content of= thesefiles.=C2=A0In YARN, when a MapReduce job is started, the Resource Ma= nager (thecluster resource management and job scheduling facility) creates = anApplication Master daemon to look after the lifecycle of the job. (In Had= oop 1,the JobTracker monitored individual jobs as well as handling job =C2= =ADschedulingand cluster resource management. One of the first things the A= pplication Masterdoes is determine which file blocks are needed for process= ing. The Application=C2=A0Master requests details from the NameNode on wher= e the replicas of the needed data blocks are stored. Using the location dat= a for the file blocks, the Application=C2=A0Master makes requests to the Re= source Manager to have map tasks process specific=C2=A0blocks on the slave = nodes where they=E2=80=99re stored. The key to efficient MapReduce processi= ng is that, wherever possible, data isprocessed locally =E2=80=94 on the sl= ave node where it=E2=80=99s stored.Before looking at how the data blocks ar= e processed, you need to look moreclosely at how Hadoop stores data. In Had= oop, files are composed of individualrecords, which are ultimately processe= d one-by-one by mapper tasks. Forexample, the sample data set we use in thi= s book contains information aboutcompleted flights within the United States= between 1987 and 2008. We have onelarge file for each year, and within eve= ry file, each individual line represents asingle flight. In other words, on= e line represents one record. Now, rememberthat the block size for the Hado= op cluster is 64MB, which means that the lightdata files are broken into ch= unks of exactly 64MB. Do you see the problem? If each map task processes all records in a specifi= cdata block, what happens to those records that span block boundaries?File = blocks are exactly 64MB (or whatever you set the block size to be), andbeca= use HDFS has no conception of what=E2=80=99s inside the file blocks, it can= =E2=80=99t gaugewhen a record might spill over into another block. To solve= this problem,Hadoop uses a logical representation of the data stored in fi= le blocks, known asinput splits. When a MapReduce job client calculates the= input splits, it figuresout where the first whole record in a block begins= and where the last recordin the block ends. In cases where the last record= in a block is incomplete, theinput split includes location information for= the next block and the byte offsetof the data needed to complete the recor= d.=C2=A0 You can configure the Application Master daemon (or JobTracker, if= you=E2=80=99re inHadoop 1) to calculate the input splits instead of the jo= b client, which wouldbe faster for jobs processing a large number of data b= locks.MapReduce data processing is driven by this concept of input splits. = Thenumber of input splits that are calculated for a specific application de= terminesthe number of mapper tasks. Each of these mapper tasks is assigned,= wherepossible, to a slave node where the input split is stored. The Resour= ce Manager(or JobTracker, if you=E2=80=99re in Hadoop 1) does its best to e= nsure that input splitsare processed locally." =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sic Courtesy of=C2=A0Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,Rafael Coss, = and Roman B. Melnyk Mark Charts =20 On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte wrote: =20 Hi, Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-s= plit-size-vs-block-size Regards, D 2014-12-17 15:16 GMT+01:00 Todd : Hi Hadoopers, I got a question about how many blocks does one input split have? It is ran= dom or the number can be configured or fixed(can't be changed)? Thanks! ------=_Part_179463_147626630.1418832923650 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hello.
=

=

<= /span>
FYI.

"The way HDFS has been set up, it breaks down = very large files into large blocks
(for example, measuring 128M= B), and stores three copies of these blocks on
different nodes i= n the cluster. HDFS has no awareness of the content of these
fil= es.
 
In YARN, when a MapReduce job is started, th= e Resource Manager (the
cluster resource management and job sched= uling facility) creates an
Application Master daemon to look af= ter the lifecycle of the job. (In Hadoop 1,
the JobTracker monito= red individual jobs as well as handling job =C2=ADscheduling
and= cluster resource management. One of the first things the Application Maste= r
does is determine which file blocks are needed for processing. = The Application 
Master requests details from the NameNode o= n where the replicas of the needed data blocks are stored. Using the locati= on data for the file blocks, the Application 
Master makes r= equests to the Resource Manager to have map tasks process specific 
blocks on the slave nodes where they=E2=80=99re stored.
<= span class=3D"" style=3D"white-space:pre">=09
The key to e= fficient MapReduce processing is that, wherever possible, data is
processed locally =E2=80=94 on the slave node where it=E2=80=99s stored.
Before looking at how the data blocks are processed, you need to l= ook more
closely at how Hadoop stores data. In Hadoop, files are = composed of individual
records, which are ultimately processed on= e-by-one by mapper tasks. For
example, the sample data set we use= in this book contains information about
completed flights within= the United States between 1987 and 2008. We have one
large file = for each year, and within every file, each individual line represents a
single flight. In other words, one line represents one record. Now, = remember
that the block size for the Hadoop cluster is 64MB, whic= h means that the light
data files are broken into chunks of exact= ly 64MB.

Do you see the problem? If each map task = processes all records in a specific
data block, what happens to = those records that span block boundaries?
File blocks are exactly= 64MB (or whatever you set the block size to be), and
because HDFS has no conception of what=E2=80=99s ins= ide the file blocks, it can=E2=80=99t gauge
when a record might s= pill over into another block. To solve this problem,
Hadoop uses a logical representation of the data stor= ed in file blocks, known as
input splits. When a MapReduce job c= lient calculates the input splits, it figures
out where the first whole record in a block begins and whe= re the last record
in the block ends. In cases where the last rec= ord in a block is incomplete, the
input split includes location i= nformation for the next block and the byte offset
of the data nee= ded to complete the record. 
=09
You can configure the Application Master d= aemon (or JobTracker, if you=E2=80=99re in
Hadoop 1) to calculate= the input splits instead of the job client, which would
be faste= r for jobs processing a large number of data blocks.
MapReduce da= ta processing is driven by this concept of input splits. The
num= ber of input splits that are calculated for a specific application determin= es
the number of mapper tasks. Each of these mapper tasks is assi= gned, where
possible, to a slave node where the input split is st= ored. The Resource Manager
(or JobTracker, if you=E2=80=99re in= Hadoop 1) does its best to ensure that input splits
=
are processed locally."            = ;                     &nb= sp;        sic

Courtesy of Dirk deRoos, Paul C. Zikopoulos, B= ruce Brown,
Rafael Coss, and Roman B. Melnyk

=


Mark Charts




On Wednesday, = December 17, 2014 10:32 AM, Dieter De Witte <drdwitte@gmail.com> wrot= e:



2014-12-17 15:16 GMT+01:00 Todd <= ;bit1129@163.com>:
Hi Hadoopers,

I got a question about how many blocks does on= e input split have? It is random or the number can be configured or fixed(c= an't be changed)?
Thanks!
<= /div>


------=_Part_179463_147626630.1418832923650--