hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Liddell (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-10124) Option to shuffle splits of equal size
Date Fri, 22 Nov 2013 22:34:35 GMT
Mike Liddell created HADOOP-10124:

             Summary: Option to shuffle splits of equal size
                 Key: HADOOP-10124
                 URL: https://issues.apache.org/jira/browse/HADOOP-10124
             Project: Hadoop Common
          Issue Type: Improvement
            Reporter: Mike Liddell

Mapreduce split calculation has the following base logic (via JobClient and the major InputFormat
implementations ):
◾enumerate input files in natural (aka linear) order.
◾create one split for each 'block-size' of each input. Apart from rack-awareness, combining
and so on, the input file order remains in its natural order.
◾sort the splits by size using a stable sort based on splitsize.

When data from multiple storage services are used in a single hadoop job, we get better I/O
utilization if the list of splits does round-robin or random-access across the services. 
The particular scenario arises in Azure HDInsight where jobs can easily read from many storage
accounts and each storage account has hard limits on throughtput.  Concurrent access to the
accounts is substantially better than 
Two common scenarios can cause non-ideal access pattern:
 1. many/all input files are the same size
 2. files have different sizes, but many/all input files have size>blocksize.
 In the second scenario, for each file will have one or more splits with size exactly equal
to block size so it basically degenerates to the first scenario.

There are various ways to solve the problem but the simplest is to alter the mapreduce JobClient
to sort splits by size _and_ randomize the order of splits with equal size. This keeps the
old behavior effectively unchanged while also fixing both common problematic scenarios.

Some rare scenarios will still suffer bad access patterns due. For example if two storage
accounts are used and the files from one storage account are all smaller than from the other
then problems can arise. Addressing these scenarios would be further work, perhaps by completely
randomizing the split order. These problematic scenarios are considered rare and not requiring
immediate attention.

If further algorithms for split ordering are necessary, the implementation in JobClient will
change to being interface-based (eg interface splitOrderer) with various standard implementations.
 At this time there is only the need for two implementations and so simple Boolean flag and
if/then logic is used.

This message was sent by Atlassian JIRA

View raw message