Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Sat, 10 May 2014 22:10:21 +0000 (UTC)
From: "Hudson (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.12658636.1374181237836.320857.1399759821758@arcas>
In-Reply-To: <JIRA.12658636.1374181237836@arcas>
References: <JIRA.12658636.1374181237836@arcas>
Subject: [jira] [Commented] (MAPREDUCE-5402) DynamicInputFormat should allow
 overriding of MAX_CHUNKS_TOLERABLE
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/MAPREDUCE-5402?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
13992790#comment-13992790 ]=20

Hudson commented on MAPREDUCE-5402:
-----------------------------------

FAILURE: Integrated in Hadoop-Hdfs-trunk #1751 (See [https://builds.apache.=
org/job/Hadoop-Hdfs-trunk/1751/])
MAPREDUCE-5402. In DynamicInputFormat, change MAX_CHUNKS_TOLERABLE, MAX_CHU=
NKS_IDEAL, MIN_RECORDS_PER_CHUNK and SPLIT_RATIO to be configurable.  Contr=
ibuted by Tsuyoshi OZAWA (szetszwo: http://svn.apache.org/viewcvs.cgi/?root=
=3DApache-SVN&view=3Drev&rev=3D1592703)
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/=
hadoop/tools/DistCpConstants.java
* /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/=
hadoop/tools/mapred/lib/DynamicInputFormat.java
* /hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/=
hadoop/tools/mapred/lib/TestDynamicInputFormat.java


> DynamicInputFormat should allow overriding of MAX_CHUNKS_TOLERABLE
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5402
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5402
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp, mrv2
>            Reporter: David Rosenstrauch
>            Assignee: Tsuyoshi OZAWA
>             Fix For: 2.5.0
>
>         Attachments: MAPREDUCE-5402.1.patch, MAPREDUCE-5402.2.patch, MAPR=
EDUCE-5402.3.patch, MAPREDUCE-5402.4-2.patch, MAPREDUCE-5402.4.patch, MAPRE=
DUCE-5402.5.patch
>
>
> In MAPREDUCE-2765, which provided the design spec for DistCpV2, the autho=
r describes the implementation of DynamicInputFormat, with one of the main =
motivations cited being to reduce the chance of long-tails where a few left=
over mappers run much longer than the rest.
> However, I today ran into a situation where I experienced exactly such a =
long tail using DistCpV2 and DynamicInputFormat.  And when I tried to allev=
iate the problem by overriding the number of mappers and the split ratio us=
ed by the DynamicInputFormat, I was prevented from doing so by the hard-cod=
ed limit set in the code by the MAX_CHUNKS_TOLERABLE constant.  (Currently =
set to 400.)
> This constant is actually set quite low for production use.  (See a descr=
iption of my use case below.)  And although MAPREDUCE-2765 states that this=
 is an "overridable maximum", when reading through the code there does not =
actually appear to be any mechanism available to override it.
> This should be changed.  It should be possible to expand the maximum # of=
 chunks beyond this arbitrary limit.
> For example, here is the situation I ran into today:
> I ran a distcpv2 job on a cluster with 8 machines containing 128 map slot=
s.  The job consisted of copying ~2800 files from HDFS to Amazon S3.  I ove=
rrode the number of mappers for the job from the default of 20 to 128, so a=
s to more properly parallelize the copy across the cluster.  The number of =
chunk files created was calculated as 241, and mapred.num.entries.per.chunk=
 was calculated as 12.
> As the job ran on, it reached a point where there were only 4 remaining m=
ap tasks, which had each been running for over 2 hours.  The reason for thi=
s was that each of the 12 files that those mappers were copying were quite =
large (several hundred megabytes in size) and took ~20 minutes each.  Howev=
er, during this time, all the other 124 mappers sat idle.
> In theory I should be able to alleviate this problem with DynamicInputFor=
mat.  If I were able to, say, quadruple the number of chunk files created, =
that would have made each chunk contain only 3 files, and these large files=
 would have gotten distributed better around the cluster and copied in para=
llel.
> However, when I tried to do that - by overriding mapred.listing.split.rat=
io to, say, 10 - DynamicInputFormat responded with an exception ("Too many =
chunks created with splitRatio:10, numMaps:128. Reduce numMaps or decrease =
split-ratio to proceed.") - presumably because I exceeded the MAX_CHUNKS_TO=
LERABLE value of 400.
> Is there any particular logic behind this MAX_CHUNKS_TOLERABLE limit?  I =
can't personally see any.
> If this limit has no particular logic behind it, then it should be overri=
dable - or even better:  removed altogether.  After all, I'm not sure I see=
 any need for it.  Even if numMaps * splitRatio resulted in an extraordinar=
ily large number, if the code were modified so that the number of chunks go=
t calculated as Math.min( numMaps * splitRatio, numFiles), then there would=
 be no need for MAX_CHUNKS_TOLERABLE.  In this worst-case scenario where th=
e product of numMaps and splitRatio is large, capping the number of chunks =
at the number of files (numberOfChunks =3D numberOfFiles) would result in 1=
 file per chunk - the maximum parallelization possible.  That may not be th=
e best-tuned solution for some users, but I would think that it should be l=
eft up to the user to deal with the potential consequence of not having tun=
ed their job properly.  Certainly that would be better than having an arbit=
rary hard-coded limit that *prevents* proper parallelization when dealing w=
ith large files and/or large numbers of mappers.


--
This message was sent by Atlassian JIRA
(v6.2#6252)