hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10124) Option to shuffle splits of equal size
Date Fri, 22 Nov 2013 22:50:35 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830384#comment-13830384

Suresh Srinivas commented on HADOOP-10124:

[~mikeliddell], can you please set the Affects Version/s? I think you want this in branch-1.
Also is this functionality relevant for trunk?

I am going to move this to MapReduce project.

> Option to shuffle splits of equal size
> --------------------------------------
>                 Key: HADOOP-10124
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10124
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Mike Liddell
>         Attachments: HADOOP-10124.1.patch
> Mapreduce split calculation has the following base logic (via JobClient and the major
InputFormat implementations ):
> ◾enumerate input files in natural (aka linear) order.
> ◾create one split for each 'block-size' of each input. Apart from rack-awareness, combining
and so on, the input file order remains in its natural order.
> ◾sort the splits by size using a stable sort based on splitsize.
> When data from multiple storage services are used in a single hadoop job, we get better
I/O utilization if the list of splits does round-robin or random-access across the services.

> The particular scenario arises in Azure HDInsight where jobs can easily read from many
storage accounts and each storage account has hard limits on throughtput.  Concurrent access
to the accounts is substantially better than 
> Two common scenarios can cause non-ideal access pattern:
>  1. many/all input files are the same size
>  2. files have different sizes, but many/all input files have size>blocksize.
>  In the second scenario, for each file will have one or more splits with size exactly
equal to block size so it basically degenerates to the first scenario.
> There are various ways to solve the problem but the simplest is to alter the mapreduce
JobClient to sort splits by size _and_ randomize the order of splits with equal size. This
keeps the old behavior effectively unchanged while also fixing both common problematic scenarios.
> Some rare scenarios will still suffer bad access patterns due. For example if two storage
accounts are used and the files from one storage account are all smaller than from the other
then problems can arise. Addressing these scenarios would be further work, perhaps by completely
randomizing the split order. These problematic scenarios are considered rare and not requiring
immediate attention.
> If further algorithms for split ordering are necessary, the implementation in JobClient
will change to being interface-based (eg interface splitOrderer) with various standard implementations.
 At this time there is only the need for two implementations and so simple Boolean flag and
if/then logic is used.

This message was sent by Atlassian JIRA

View raw message