Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-issues@hadoop.apache.org
Date: Fri, 22 Nov 2013 22:34:35 +0000 (UTC)
From: "Mike Liddell (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Message-ID: <JIRA.12680828.1385159650669.17041.1385159675056@arcas>
In-Reply-To: <JIRA.12680828.1385159650669@arcas>
References: <JIRA.12680828.1385159650669@arcas>
Subject: [jira] [Created] (HADOOP-10124) Option to shuffle splits of equal
 size
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Mike Liddell created HADOOP-10124:
-------------------------------------

             Summary: Option to shuffle splits of equal size
                 Key: HADOOP-10124
                 URL: https://issues.apache.org/jira/browse/HADOOP-10124
             Project: Hadoop Common
          Issue Type: Improvement
            Reporter: Mike Liddell


Mapreduce split calculation has the following base logic (via JobClient and=
 the major InputFormat implementations ):
=E2=97=BEenumerate input files in natural (aka linear) order.
=E2=97=BEcreate one split for each 'block-size' of each input. Apart from r=
ack-awareness, combining and so on, the input file order remains in its nat=
ural order.
=E2=97=BEsort the splits by size using a stable sort based on splitsize.

When data from multiple storage services are used in a single hadoop job, w=
e get better I/O utilization if the list of splits does round-robin or rand=
om-access across the services.=20
The particular scenario arises in Azure HDInsight where jobs can easily rea=
d from many storage accounts and each storage account has hard limits on th=
roughtput.  Concurrent access to the accounts is substantially better than=
=20
=20
Two common scenarios can cause non-ideal access pattern:
 1. many/all input files are the same size
 2. files have different sizes, but many/all input files have size>blocksiz=
e.
 In the second scenario, for each file will have one or more splits with si=
ze exactly equal to block size so it basically degenerates to the first sce=
nario.

There are various ways to solve the problem but the simplest is to alter th=
e mapreduce JobClient to sort splits by size _and_ randomize the order of s=
plits with equal size. This keeps the old behavior effectively unchanged wh=
ile also fixing both common problematic scenarios.

Some rare scenarios will still suffer bad access patterns due. For example =
if two storage accounts are used and the files from one storage account are=
 all smaller than from the other then problems can arise. Addressing these =
scenarios would be further work, perhaps by completely randomizing the spli=
t order. These problematic scenarios are considered rare and not requiring =
immediate attention.

If further algorithms for split ordering are necessary, the implementation =
in JobClient will change to being interface-based (eg interface splitOrdere=
r) with various standard implementations.  At this time there is only the n=
eed for two implementations and so simple Boolean flag and if/then logic is=
 used.


--
This message was sent by Atlassian JIRA
(v6.1#6144)