hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jothi Padmanabhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4396) sort on 400 nodes is now slower than in 18
Date Tue, 14 Oct 2008 17:37:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639501#action_12639501
] 

Jothi Padmanabhan commented on HADOOP-4396:
-------------------------------------------

Devaraj and I did a deep dive into this and found the following:

1. On an average, the map tasks take less than a minute for completion. However, we observed
that there are a few stragglers at the end of the run which take an unduly long time for completion
(~15 minutes) that were primarily resulting in the overall increased run time. Most of these
tasks are data-local tasks. There were other few tasks that took about 4-5 minutes, but those
are expected towards the end of the run and are not the suspects.
2. The task logs for these tasks indicated that the actual map function (up to the beginning
of the first spill) took about 14 minutes, sort and spill + merge parts took less than a minute
3. The data node log indicated that the first contact by the map task was as soon as the job
started, but the map task got its final data set only after 14 minutes. 
4. Most of these tasks ran on a few specific nodes. For example, in one run, 4 of these ran
on node x, 3 ran on node y. 
5. However, the specific nodes x and y themselves do not have any problems. On the next invocation
of sort (the cluster was not reallocated, it was the same), the problem nodes were a different
x' and y', all the tasks on x and y ran fine.
6. While these straggler tasks are running, the task tracker appeared to be busy handling
shuffles
7. The nodes where these tasks were running did not show any unduly high CPU usage or Memory
usage

Given that the 3514 patch affects sort, spill and merge parts of the map task and not the
functionality before it, it appears that the bug is more likely a side effect. 
One change by this patch that could possibly be causing this side effect is the change to
use the RawLocalFileSystem instead of the LocalFileSystem for the creation and handling of
the intermediate files. However, It is not very clear how this change is affecting the data
node performance?  Thoughts?



> sort on 400 nodes is now slower than in 18
> ------------------------------------------
>
>                 Key: HADOOP-4396
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4396
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Jothi Padmanabhan
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> Sort on 400 nodes on  hadoop release 18 takes about 29 minutes, but with the 19 branch
takes about 32 minutes. This behavior is consistent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message