hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-3780) JobTracker should synchronously resolve the tasktracker's network location when the tracker registers
Date Thu, 17 Jul 2008 13:23:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614307#action_12614307
] 

amar_kamat edited comment on HADOOP-3780 at 7/17/08 6:21 AM:
-------------------------------------------------------------

The reason why this issue is important for HADOOP-3245 is as follows :

_Summary_ : 
In HADOOP-3245 we are adding a new operation called _SYNC_ operation. This directs the task
tracker to upload its local state to the jobtracker. The whole design expects the _SYNC_ operation
to complete in one go. Partial updates can cause the JobTracker to be in an inconsistent state
and might cause the job to get stuck. As of now, the only thing that can cause the _SYNC_
operation to fail is an update from an unresolved tracker. Under such conditions the JT is
partially updated, which breaks HADOOP-3245. 

_Info:_
||SYM||Stands for||Description|| Used for||
|IC | Initial contact | whether the TT is connected to the JT or not, TT's point of view |
Re-init/Sync the TT|
|SB | Seen before | whether there are some previous status entries | Mark a TT as lost|
|HBE | Heartbeat entry | whether the TT is connected/registered, JT's point of view | Re-init/Sync
the TT|
|JTR| JT restarted | Whether the JT has restarted | Re-init/Sync the TT|

_Rules :_

||IC||HBE||SB||JTR||Action||
|false|false|-|true|SYNC|
|false|false|-|false|Re-init|
|false|true|-|-|Re-send prev response|
|true|-|true|-|Mark lost (kill tasks)|
|false|-|false|-|make SB false i.e clear previous status entries|

_Description :_
{noformat}

0) JT restarts and hence HBE for all TT's will be false.
1) TT connects to the restarted JT with IC=false.
2) JT sends a SYNC operation to the TT.
3) TT uploads the task statuses with IC = true.
4) JT (as a part of heartbeat) tries to update the task states/status.
5) If (4) is successful : JT makes an HBE=true for this TT.
6) If (4) fails : the JT has made some changes in the task states but HBE=false.
     Consider task t being marked as SUCCEEDED before the SYNC fails.
7) TT comes back with IC = false.
8) IC == false && HBE == false && JTR == true .... JT sends a SYNC again.
9) (3) happens again.
10) (4) happens again. Since IC == true and SB == true, JT consider this TT as lost.
11) This causes the task t to be marked as KILLED.
12) In the same method the status updates are applied and hence t will be marked as SUCCEEDED.
13) Now we have task completion events with a same task marked as KILLED and SUCCEEDED.
14) Since task t is marked as SUCCEEDED later, the JT assumes that the TIP is completed while
the reducers keep 
      on ignoring the task t's output.
15) Job is stuck.
{noformat}

This problem will not occur if {{(4)}} succeeds without any problem i.e every {{SYNC}} should
make HBE = true. {{4}} can only fail if the tracker is not resolved. Hence inline resolution
solves the problem.


      was (Author: amar_kamat):
    The problem that we are facing is as follows :

||SYM||Stands for||Description|| Used for||
|IC | Initial contact | whether the TT is connected to the JT or not, TT's point of view |
Re-init/Sync the TT|
|SB | Seen before | whether there are some previous status entries | Mark a TT as lost|
|HBE | Heartbeat entry | whether the TT is connected/registered, JT's point of view | Re-init/Sync
the TT|
|JTR| JT restarted | Whether the JT has restarted | Re-init/Sync the TT|

Rules :

||IC||HBE||SB||JTR||Action||
|false|false|-|true|SYNC|
|false|false|-|false|Re-init|
|false|true|-|-|Re-send prev response|
|true|-|true|-|Mark lost (kill tasks)|
|false|-|false|-|make SB false i.e clear previous status entries|


{noformat}

0) JT restarts and hence HBE for all TT's will be false.
1) TT connects to the restarted JT with IC=false
2) JT sends a SYNC
3) TT uploads the task statuses
4) JT (as a part of heartbeat) tries to update the task states/status
5) If (4) is successful : JT makes an HBE=true for this TT
6) If (4) fails : the JT has made some changes in the task states but HBE=false.
     Consider task t being marked as SUCCEEDED before the SYNC fails.
7) TT comes back with IC = false
8) IC == false && HBE == false && JTR == true .... JT sends a SYNC again
9) TT responds back with IC = true and all updates
10) JT tries (4) again. Since IC == true and SB == true, JT consider this TT as lost.
11) This causes the task t to be marked as KILLED
12) In the same method the status updates are applied and hence t will be marked as SUCCEEDED
13) Now we have task completion events with a same task marked as KILLED and SUCCEEDED.
14) Since task t is marked as SUCCEEDED later, the JT assumes that the TIP is completed while
the reducers keep on ignoring the task t's output.
15) Job stucks
{noformat}

This problem will not occur if {{(4)}} succeeds without any problem i.e every {{SYNC}} should
make HBE = true. {{4}} can only fail if the tracker is not resolved. Hence inline resolution
solves the problem.

  
> JobTracker should synchronously resolve the tasktracker's network location when the tracker
registers
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3780
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3780
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Amar Kamat
>
> This issue is inspired by HADOOP-3620. In JobTracker, the network address of tracker
gets resolved asynchronously. Now it can be done inline i.e while the trackers register. This
is of great help for HADOOP-3245 where this enhancement makes the design simpler.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message