asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Taewoo Kim <wangs...@gmail.com>
Subject Re: Requesting node X to shutdown to ensure failure?
Date Mon, 01 Oct 2018 07:47:02 GMT
The CC node runs an ingestion from Twitter. It inserts about 20 tweets per
sec. Other than that, it does the same amount of work like other NCs.

Best,
Taewoo

On Sun, Sep 30, 2018 at 9:52 PM, Till Westmann <tillw@apache.org> wrote:

> So then it seems that all non-CC nodes have died.
> Do you have an idea of the load on the CC?
> Is is possible that it was too busy to receive the heartbeats?
>
> Cheers,
> Till
>
>
> On 30 Sep 2018, at 13:48, Taewoo Kim wrote:
>
> OK. I got that part. There are five nodes (separate NUC machines) and the
>> first node runs CC and NC.
>>
>> Best,
>> Taewoo
>>
>> On Sun, Sep 30, 2018 at 1:15 PM, Till Westmann <tillw@apache.org> wrote:
>>
>> Yes, and after 10 * 8000 ms = 80s have passed without a heartbeat the node
>>> is considered dead. This seems to align with the log messages:
>>>
>>> 21:51:54.203 [Worker:ClusterController] INFO
>>> org.apache.hyracks.control.cc
>>> .cluster.NodeManager - Requesting node 2 to shutdown to ensure failure
>>> 21:51:54.206 [Worker:ClusterController] INFO
>>> org.apache.hyracks.control.cc
>>> .cluster.NodeManager - Request to shutdown failed node 2 succeeded. false
>>> positive heartbeat miss indication
>>> 21:51:54.206 [Worker:ClusterController] INFO
>>> org.apache.hyracks.control.cc
>>> .cluster.NodeManager - 2 considered dead. Last heartbeat received 83799ms
>>> ago. Max miss period: 80000ms
>>>
>>> So that looks consistent.
>>> The question seems to be (unless I missed something) why no heartbeats
>>> were received for 80s.
>>>
>>> And it also looks like this happened for more than one node at the same
>>> time. Are these nodes running on a single box or on multiple boxes?
>>>
>>> Cheers,
>>> Till
>>>
>>>
>>> On 30 Sep 2018, at 13:03, Taewoo Kim wrote:
>>>
>>> P.S. We set the heartbeat setting as follows.
>>>
>>>>
>>>> [cc]
>>>> heartbeat.max.misses=10
>>>> heartbeat.period=8000
>>>>
>>>> Best,
>>>> Taewoo
>>>>
>>>> On Sun, Sep 30, 2018 at 12:50 PM, Taewoo Kim <wangsaeu@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Till,
>>>>
>>>>>
>>>>> These files are indeed the same log file. I just extracted one line
>>>>> from
>>>>> the original log and placed the line at the beginning. At 9:50 pm, the
>>>>> CC
>>>>> initiates a task cluster and all NCs were executing. Node 2 sent the
>>>>> taskcomplete message, and somehow the CC decides to shut down the node
>>>>> 2.
>>>>> That's my understanding.
>>>>>
>>>>> Best,
>>>>> Taewoo
>>>>>
>>>>> On Sun, Sep 30, 2018 at 12:40 PM, Till Westmann <tillw@apache.org>
>>>>> wrote:
>>>>>
>>>>> Hi Taewoo,
>>>>>
>>>>>>
>>>>>> I’m little confused as we seem to have 2 cc logs.
>>>>>>
>>>>>> The "Requesting node 2 to shutdown …" message is logged, when nodes
>>>>>> are
>>>>>> considered to be dead as they have missed too many heartbeats. And
you
>>>>>> can
>>>>>> see in the later logs that indeed more than 80s passed without
>>>>>> receiving a
>>>>>> heartbeat.
>>>>>> Could it be that the machines were too busy to send/receive
>>>>>> heartbeats?
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On 30 Sep 2018, at 10:52, Taewoo Kim wrote:
>>>>>>
>>>>>> Hi devs,
>>>>>>
>>>>>>
>>>>>>> I saw the following log messages in the cc.log of the Cloudberrry
>>>>>>> cluster.
>>>>>>> Yesterday around 9:50 pm, a task cluster was created and executed.
>>>>>>> And
>>>>>>> around 9:51 pm, the cluster shows the following message. In the
>>>>>>> nc.log
>>>>>>> of
>>>>>>> the node 2, there were no noticeable log messages. I wonder what
>>>>>>> could
>>>>>>> cause this situation.
>>>>>>>
>>>>>>> -- cc.log at 9:51 pm
>>>>>>> *21:51:54.203 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - Requesting
>>>>>>> node 2
>>>>>>> to
>>>>>>> shutdown to ensure failure*
>>>>>>>
>>>>>>> *-- nc-2.log around 9:50 pm*
>>>>>>>
>>>>>>> 21:50:30.314 [Worker:2] INFO
>>>>>>>  org.apache.hyracks.control.nc.work.StartTasksWork - input: 0:
>>>>>>> CDID:2
>>>>>>> 21:50:30.375 [Worker:2] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:7:0:3:0
>>>>>>> 21:50:30.382 [Worker:2] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:9:0:3:0
>>>>>>> 21:50:30.390 [Worker:2] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:9:0:2:0
>>>>>>> 21:50:30.391 [Worker:2] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:7:0:2:0
>>>>>>> 21:50:30.392 [Worker:2] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:3:0
>>>>>>> 21:50:30.392 [Worker:2] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:2:0
>>>>>>> 21:50:30.398 [Worker:2] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:3:0
>>>>>>> 21:50:30.398 [Worker:2] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:2:0
>>>>>>> 22:02:30.290 [JVM exit thread] INFO  org.apache.hyracks.util.ExitUt
>>>>>>> il
>>>>>>> -
>>>>>>> JVM
>>>>>>> exiting with status 0; bye!
>>>>>>>
>>>>>>>
>>>>>>> -- *cc.log from 9:51 pm*
>>>>>>>
>>>>>>> 21:50:30.313 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.executor.ActivityClusterPlanner
-
>>>>>>> Built 1
>>>>>>> Task Clusters
>>>>>>> 21:50:30.313 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.executor.ActivityClusterPlanner
-
>>>>>>> Tasks:
>>>>>>> [TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1, TID:ANID:ODID:1:1:2,
>>>>>>> TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4, TID:ANID:ODID:1:1:5,
>>>>>>> TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7, TID:ANID:ODID:1:1:8,
>>>>>>> TID:ANID:ODID:1:1:9, TID:ANID:ODID:2:0:0, TID:ANID:ODID:2:0:1,
>>>>>>> TID:ANID:ODID:2:0:2, TID:ANID:ODID:2:0:3, TID:ANID:ODID:2:0:4,
>>>>>>> TID:ANID:ODID:2:0:5, TID:ANID:ODID:2:0:6, TID:ANID:ODID:2:0:7,
>>>>>>> TID:ANID:ODID:2:0:8, TID:ANID:ODID:2:0:9, TID:ANID:ODID:9:0:0,
>>>>>>> TID:ANID:ODID:9:0:1, TID:ANID:ODID:9:0:2, TID:ANID:ODID:9:0:3,
>>>>>>> TID:ANID:ODID:9:0:4, TID:ANID:ODID:9:0:5, TID:ANID:ODID:9:0:6,
>>>>>>> TID:ANID:ODID:9:0:7, TID:ANID:ODID:9:0:8, TID:ANID:ODID:9:0:9]
>>>>>>> 21:50:30.313 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.executor.JobExecutor - Runnable
TC
>>>>>>> roots:
>>>>>>> [TC:[TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1, TID:ANID:ODID:1:1:2,
>>>>>>> TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4, TID:ANID:ODID:1:1:5,
>>>>>>> TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7, TID:ANID:ODID:1:1:8,
>>>>>>> TID:ANID:ODID:1:1:9, TID:ANID:ODID:2:0:0, TID:ANID:ODID:2:0:1,
>>>>>>> TID:ANID:ODID:2:0:2, TID:ANID:ODID:2:0:3, TID:ANID:ODID:2:0:4,
>>>>>>> TID:ANID:ODID:2:0:5, TID:ANID:ODID:2:0:6, TID:ANID:ODID:2:0:7,
>>>>>>> TID:ANID:ODID:2:0:8, TID:ANID:ODID:2:0:9, TID:ANID:ODID:9:0:0,
>>>>>>> TID:ANID:ODID:9:0:1, TID:ANID:ODID:9:0:2, TID:ANID:ODID:9:0:3,
>>>>>>> TID:ANID:ODID:9:0:4, TID:ANID:ODID:9:0:5, TID:ANID:ODID:9:0:6,
>>>>>>> TID:ANID:ODID:9:0:7, TID:ANID:ODID:9:0:8, TID:ANID:ODID:9:0:9]],
>>>>>>> inProgressTaskClusters: []
>>>>>>> 21:50:30.366 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:7:0:5:0]
>>>>>>> 21:50:30.374 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:7:0:9:0]
>>>>>>> 21:50:30.376 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:7:0:3:0]
>>>>>>> 21:50:30.376 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:7:0:8:0]
>>>>>>> 21:50:30.377 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:7:0:0:0]
>>>>>>> 21:50:30.378 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:7:0:4:0]
>>>>>>> 21:50:30.381 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:7:0:1:0]
>>>>>>> 21:50:30.382 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:9:0:8:0]
>>>>>>> 21:50:30.383 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:9:0:3:0]
>>>>>>> 21:50:30.384 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:9:0:4:0]
>>>>>>> 21:50:30.386 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:9:0:5:0]
>>>>>>> 21:50:30.386 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:9:0:9:0]
>>>>>>> 21:50:30.389 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:9:0:6:0]
>>>>>>> 21:50:30.389 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:9:0:0:0]
>>>>>>> 21:50:30.392 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:9:0:2:0]
>>>>>>> 21:50:30.392 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:7:0:7:0]
>>>>>>> 21:50:30.392 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:7:0:6:0]
>>>>>>> 21:50:30.394 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:9:0:1:0]
>>>>>>> 21:50:30.395 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:7:0:2:0]
>>>>>>> 21:50:30.395 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:9:0:7:0]
>>>>>>> 21:50:30.396 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:1:1:1:0]
>>>>>>> 21:50:30.396 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:1:1:0:0]
>>>>>>> 21:50:30.396 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:1:1:0:0]
>>>>>>> 21:50:30.396 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:1:1:1:0]
>>>>>>> 21:50:30.397 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:1:1:8:0]
>>>>>>> 21:50:30.398 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:1:1:4:0]
>>>>>>> 21:50:30.398 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:1:1:5:0]
>>>>>>> 21:50:30.398 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:1:1:4:0]
>>>>>>> 21:50:30.398 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:1:1:5:0]
>>>>>>> 21:50:30.399 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:1:1:6:0]
>>>>>>> 21:50:30.399 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:1:1:3:0]
>>>>>>> 21:50:30.399 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:1:1:2:0]
>>>>>>> 21:50:30.399 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:1:1:9:0]
>>>>>>> 21:50:30.399 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:1:1:8:0]
>>>>>>> 21:50:30.399 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:1:1:9:0]
>>>>>>> 21:50:30.402 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:1:1:7:0]
>>>>>>> 21:50:30.402 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:1:1:6:0]
>>>>>>> 21:50:30.402 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:1:1:7:0]
>>>>>>> 21:50:30.402 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:5:0]
>>>>>>> 21:50:30.402 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:4:0]
>>>>>>> 21:50:30.402 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:5:0]
>>>>>>> 21:50:30.402 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:4:0]
>>>>>>> 21:50:30.404 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:1:1:3:0]
>>>>>>> 21:50:30.404 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
>>>>>>> TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:1:1:2:0]
>>>>>>> *21:51:54.203 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - Requesting
>>>>>>> node 2
>>>>>>> to
>>>>>>> shutdown to ensure failure*
>>>>>>>
>>>>>>> 21:51:54.206 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - Request
to
>>>>>>> shutdown
>>>>>>> failed node 2 succeeded. false positive heartbeat miss indication
>>>>>>> 21:51:54.206 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - 2 considered
>>>>>>> dead.
>>>>>>> Last heartbeat received 83799ms ago. Max miss period: 80000ms
>>>>>>> 21:51:54.206 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - Requesting
>>>>>>> node 5
>>>>>>> to
>>>>>>> shutdown to ensure failure
>>>>>>> 21:51:54.206 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - Request
to
>>>>>>> shutdown
>>>>>>> failed node 5 succeeded. false positive heartbeat miss indication
>>>>>>> 21:51:54.206 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - 5 considered
>>>>>>> dead.
>>>>>>> Last heartbeat received 83806ms ago. Max miss period: 80000ms
>>>>>>> 21:51:54.207 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - Requesting
>>>>>>> node 3
>>>>>>> to
>>>>>>> shutdown to ensure failure
>>>>>>> 21:51:54.207 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - Request
to
>>>>>>> shutdown
>>>>>>> failed node 3 succeeded. false positive heartbeat miss indication
>>>>>>> 21:51:54.207 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - 3 considered
>>>>>>> dead.
>>>>>>> Last heartbeat received 83804ms ago. Max miss period: 80000ms
>>>>>>> 21:51:54.207 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - Requesting
>>>>>>> node 4
>>>>>>> to
>>>>>>> shutdown to ensure failure
>>>>>>> 21:51:54.207 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - Request
to
>>>>>>> shutdown
>>>>>>> failed node 4 succeeded. false positive heartbeat miss indication
>>>>>>> 21:51:54.207 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.cluster.NodeManager - 4 considered
>>>>>>> dead.
>>>>>>> Last heartbeat received 83804ms ago. Max miss period: 80000ms
>>>>>>> 21:51:54.207 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.work.RemoveDeadNodesWork - Number
of
>>>>>>> affected jobs: 3
>>>>>>> 21:51:54.216 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.executor.JobExecutor - Abort map
for
>>>>>>> job:
>>>>>>> JID:0.1: {1=[TAID:TID:ANID:ODID:0:0:0:0, TAID:TID:ANID:ODID:3:0:0:0,
>>>>>>> TAID:TID:ANID:ODID:3:0:1:0]}
>>>>>>> 21:51:54.216 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.executor.JobExecutor - Aborting:
>>>>>>> [TAID:TID:ANID:ODID:0:0:0:0, TAID:TID:ANID:ODID:3:0:0:0,
>>>>>>> TAID:TID:ANID:ODID:3:0:1:0] at 1
>>>>>>> 21:51:54.217 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.partitions.PartitionMatchMaker
-
>>>>>>> Removing
>>>>>>> uncommitted partitions: []
>>>>>>> 21:51:54.217 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.partitions.PartitionMatchMaker
-
>>>>>>> Removing
>>>>>>> partition requests: []
>>>>>>> 21:51:54.218 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.executor.JobExecutor - aborting
>>>>>>> doomed
>>>>>>> task
>>>>>>> clusters
>>>>>>> 21:51:54.218 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.executor.JobExecutor - number
of
>>>>>>> doomed
>>>>>>> task
>>>>>>> clusters found = 0
>>>>>>> 21:51:54.218 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.executor.JobExecutor - Runnable
TC
>>>>>>> roots:
>>>>>>> [TC:[TID:ANID:ODID:0:0:0, TID:ANID:ODID:3:0:0, TID:ANID:ODID:3:0:1,
>>>>>>> TID:ANID:ODID:3:0:2, TID:ANID:ODID:3:0:3, TID:ANID:ODID:3:0:4,
>>>>>>> TID:ANID:ODID:3:0:5, TID:ANID:ODID:3:0:6, TID:ANID:ODID:3:0:7,
>>>>>>> TID:ANID:ODID:3:0:8, TID:ANID:ODID:3:0:9]], inProgressTaskClusters:
>>>>>>> []
>>>>>>> 21:51:54.218 [Worker:ClusterController] ERROR
>>>>>>> org.apache.hyracks.control.cc.executor.JobExecutor - Unexpected
>>>>>>> failure.
>>>>>>> Aborting job JID:0.1
>>>>>>> org.apache.hyracks.api.exceptions.HyracksException: Node 2 not
live
>>>>>>>         at
>>>>>>> org.apache.hyracks.control.cc.executor.JobExecutor.assignLoc
>>>>>>> ation(JobExecutor.java:473)
>>>>>>> ~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>>>>>>>         at
>>>>>>> org.apache.hyracks.control.cc.executor.JobExecutor.assignTas
>>>>>>> kLocations(JobExecutor.java:365)
>>>>>>> ~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>>>>>>>         at
>>>>>>> org.apache.hyracks.control.cc.executor.JobExecutor.startRunn
>>>>>>> ableTaskClusters(JobExecutor.java:245)
>>>>>>> ~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>>>>>>>         at
>>>>>>> org.apache.hyracks.control.cc.executor.JobExecutor.startRunn
>>>>>>> ableActivityClusters(JobExecutor.java:209)
>>>>>>> ~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>>>>>>>         at
>>>>>>> org.apache.hyracks.control.cc.executor.JobExecutor.notifyNod
>>>>>>> eFailures(JobExecutor.java:731)
>>>>>>> [hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>>>>>>>         at
>>>>>>> org.apache.hyracks.control.cc.work.RemoveDeadNodesWork.run(R
>>>>>>> emoveDeadNodesWork.java:60)
>>>>>>> [hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>>>>>>>         at
>>>>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThrea
>>>>>>> d.run(WorkQueue.java:127)
>>>>>>> [hyracks-control-common-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>>>>>>> 21:51:54.218 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.executor.JobExecutor - Abort map
for
>>>>>>> job:
>>>>>>> JID:0.700: {}
>>>>>>> 21:51:54.218 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.partitions.PartitionMatchMaker
-
>>>>>>> Removing
>>>>>>> uncommitted partitions: []
>>>>>>> 21:51:54.218 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.partitions.PartitionMatchMaker
-
>>>>>>> Removing
>>>>>>> partition requests: []
>>>>>>> 21:51:54.218 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.executor.JobExecutor - aborting
>>>>>>> doomed
>>>>>>> task
>>>>>>> clusters
>>>>>>> 21:51:54.219 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.executor.JobExecutor - number
of
>>>>>>> doomed
>>>>>>> task
>>>>>>> clusters found = 0
>>>>>>> 21:51:54.219 [Worker:ClusterController] INFO
>>>>>>>  org.apache.hyracks.control.cc.executor.JobExecutor - Abort map
for
>>>>>>> job:
>>>>>>> JID:0.700: {1=[TAID:TID:ANID:ODID:2:0:0:0,
>>>>>>> TAID:TID:ANID:ODID:2:0:1:0]}
>>>>>>>
>>>>>>> Best,
>>>>>>> Taewoo
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message