hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jothi Padmanabhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3478) The algorithm to decide map re-execution on fetch failures can be improved
Date Mon, 02 Jun 2008 12:17:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601612#action_12601612
] 

Jothi Padmanabhan commented on HADOOP-3478:
-------------------------------------------

This is related to HADOOP-3327.

The current design for Map re-execution on fetch failures (in shuffle) is
to wait for 3 fetch failure notifications for a given map before
re-executing it.

For load balancing, the reducers randomize the order of maps to fetch.
If a fetch for a particular map fails, there is no guarantee that this
map would be attempted for fetch in the next iteration. What this implies
is that if a particular location is faulty, it might take a long time to
determine this and re-execute all the maps for this location.

For example, consider the simplest case of 1 reducer, fetching maps
M1,M2 and M3 from a single location L1.  Assume that the location L1 
developed a hardware failure after the map execution and so none of the 
maps, M1,M2 or M3 is available.

The following could be the order of execution in the reducer

1. Try Fetching M1 from L1  --> Failure
2. Try Fetching M2 from L1  --> Failure
3. Try Fetching M3 from L1  --> Failure

==> At this point we have three fetch failures from Location L1, but
they are for different Maps. So, the map re-execution criteria is not met.

4. Try Fetching M1 from L1  --> Failure
5. Try Fetching M2 from L1  --> Failure
6. Try Fetching M3 from L1  --> Failure

==> Now, there are six failures from L1, but still the maps are not
re-executed.

7. Try Fetching M1 from L1  --> Failure
==> Now, M1 is re-executed
8. Try Fetching M2 from L1  --> Failure
==> Now, M2 is re-executed
9. Try Fetching M2 from L1  --> Failure
==> Now, M3 is re-executed

A better alternative could be to do this. Force the order in which maps
are executed in a particular location. A simple mechanism could be to sort
by Map Ids (for that location) and then fetch them in that order. An additional heuristic
\
could be added to hasten the re-execution of maps if the number of failures
on a particular location exceeds some threshold (irrespective of the
number of fetch failures for that particular map id).

In the above example, let us assume that the sorted map Ids are in
the order M1,M2 and M3. Now, considering the above example

1. Try Fetching M1 from L1  --> Failure
2. Try Fetching M1 from L1  --> Failure
3. Try Fetching M1 from L1  --> Failure

==> Re-execute Map M1 now.

4. Try Fetching M2 from L1  --> Failure
5. Try Fetching M2 from L1  --> Failure

==> Assuming the threshold for max failures from a location is 5, M2 will
be re-executed now.

6. Try Fetching M3 from L1 --> Failure
Re-execute M3 now.


> The algorithm to decide map re-execution on fetch failures can be improved
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-3478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3478
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Jothi Padmanabhan
>
> The algorithm to decide map re-execution on fetch failures can be improved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message