spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Williams (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-10551) Successful task-end event after task failed due to executor loss
Date Thu, 10 Sep 2015 21:13:48 GMT
Ryan Williams created SPARK-10551:
-------------------------------------

             Summary: Successful task-end event after task failed due to executor loss
                 Key: SPARK-10551
                 URL: https://issues.apache.org/jira/browse/SPARK-10551
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.4.1
            Reporter: Ryan Williams


Doing forensics on some failed Spark applications and seeing nonsensical things in the event
logs, e.g.:

{code}
$ grep -n '"Task ID":12083' application_1439224376754_5702
24578:{"Event":"SparkListenerTaskStart","Stage ID":6,"Stage Attempt ID":0,"Task Info":{"Task
ID":12083,"Index":145,"Attempt":0,"Launch Time":1440703704768,"Executor ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting
Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}}
28918:{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task
End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"232"},"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch
Time":1440703704768,"Executor ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting
Result Time":0,"Finish Time":1440703707747,"Failed":true,"Accumulables":[]}}
29062:{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task
End Reason":{"Reason":"Success"},"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch
Time":1440703704768,"Executor ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting
Result Time":0,"Finish Time":1440703707747,"Failed":true,"Accumulables":[]},"Task Metrics":{"Host
Name":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Executor Deserialize Time":181,"Executor
Run Time":1585,"Result Size":8760,"JVM GC Time":0,"Result Serialization Time":0,"Memory Bytes
Spilled":0,"Disk Bytes Spilled":0,"Shuffle Write Metrics":{"Shuffle Bytes Written":454121,"Shuffle
Write Time":43293396,"Shuffle Records Written":2549},"Input Metrics":{"Data Read Method":"Memory","Bytes
Read":810520,"Records Read":2549}}}
{code}

Task ID 12083 has a TaskStart event, a TaskEnd event indicating that the task failed due to
{{ExecutorLostFailure}}, and then a TaskEnd event saying that the task succeeded.

The history server is not showing me this file in the "complete" or "incomplete" sections,
though it has this line in its stdout (and no apparent exceptions later), which I thought
meant that it parsed the file correctly:

{code}
15/09/10 17:57:56 INFO FsHistoryProvider: Replaying log path: hdfs://demeter-nn1.demeter.hpc.mssm.edu/spark/tmp/logs/willir31/application_1439224376754_5702
{code}

[~arahuja] ran this application originally and says that the live web UI was showing inconsistent/nonsensical
data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message