hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Seth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5352) Optimize node local splits generated by CombineFileInputFormat
Date Thu, 01 Aug 2013 01:47:48 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725954#comment-13725954
] 

Siddharth Seth commented on MAPREDUCE-5352:
-------------------------------------------

bq. blockToNodes doesnt look like it needs to be a map?
Could likely be replaced by a set. That existed before this patch, haven't tried to clean
up more than necessary for this specific jira.

bq. What are the results of running the new test with the old code. From what I see, the test
has a uniform distribution of blocks and the old code should pass the test too. The test by
itself is a good test to have. Distribution fixes like the one in this patch are not easy
to test anyways.
The test fails with old code - it ends up generating an additional split. A uniform distribution
doesn't necessarily mean the old code would generate uniform splits since it leaves fewer
options for the last few nodes being processed. That said, tweaking variables on the test
can get the new code to generate additional splits as well. In general though, the new code
should generate better splits more consistently. Agree, verifying the distribution is difficult.

bq. Would be great if we can ascertain how the performance of the new algo compares to the
earlier one. e.g. how much time does it take to create splits for 1 million blocks on 10000
machines with 4 blocks per split for example. I am expecting that looping once for every split
will be slower though not quite sure how much.
I don't expect this to cause a big difference. The code removes entries from the nodeToBlocks
map so that it does not walk those entries for each split generated for a node - that seems
like most of the additional cost. Otherwise it just walks the block list for each node which
is similar to the old code.
                
> Optimize node local splits generated by CombineFileInputFormat 
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-5352
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5352
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.0.5-alpha
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>         Attachments: MAPREDUCE-5352.1.txt, MAPREDUCE-5352.2.txt, MAPREDUCE-5352.3.txt,
MAPREDUCE-5352.4.txt
>
>
> CombineFileInputFormat currently walks through all available nodes and generates multiple
(maxSplitsPerNode) splits on a single node before attempting to generate splits on subsequent
nodes. This ends up reducing the possibility of generating splits for subsequent nodes - since
these blocks will no longer be available for subsequent nodes. Allowing splits to go 1 block
above the max-split-size makes this worse.
> Allocating a single split per node in one iteration, should help increase the distribution
of splits across nodes - so the subsequent nodes will have more blocks to choose from.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message