asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Taewoo Kim <wangs...@gmail.com>
Subject Re: One of NC node is not stopping for a hash join.
Date Fri, 02 Dec 2016 18:22:40 GMT
Additional note: @Till: Yes. It happened again for the same hash-join
query. As we can see in the bold part of the following CC.log, one node
alone was executing for two hours.


Dec 01, 2016 10:41:56 PM
org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner
planActivityCluster
INFO: Plan for org.apache.hyracks.api.job.ActivityCluster@383ecfdd
Dec 01, 2016 10:41:56 PM
org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner
planActivityCluster
INFO: Built 1 Task Clusters
Dec 01, 2016 10:41:56 PM
org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner
planActivityCluster
INFO: Tasks: [TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1,
TID:ANID:ODID:1:1:2, TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4,
TID:ANID:ODID:1:1:5, TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7,
TID:ANID:ODID:1:1:8, TID:ANID:ODID:1:1:9, TID:ANID:ODID:1:1:10,
TID:ANID:ODID:1:1:11, TID:ANID:ODID:1:1:12, TID:ANID:ODID:1:1:13,
TID:ANID:ODID:1:1:14, TID:ANID:ODID:1:1:15, TID:ANID:ODID:1:1:16,
TID:ANID:ODID:1:1:17, TID:ANID:ODID:4:0:0]
Dec 01, 2016 10:43:18 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete: [ss1120_nc3[JID:5:TAID:TID:ANID:ODID:1:1:5:0]
Dec 01, 2016 10:43:22 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete: [ss1120_nc4[JID:5:TAID:TID:ANID:ODID:1:1:7:0]
Dec 01, 2016 10:43:23 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete:
[ss1120_nc9[JID:5:TAID:TID:ANID:ODID:1:1:16:0]
Dec 01, 2016 10:43:28 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete: [ss1120_nc2[JID:5:TAID:TID:ANID:ODID:1:1:2:0]
Dec 01, 2016 10:43:31 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete: [ss1120_nc2[JID:5:TAID:TID:ANID:ODID:1:1:3:0]
Dec 01, 2016 10:43:34 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete: [ss1120_nc5[JID:5:TAID:TID:ANID:ODID:1:1:8:0]
Dec 01, 2016 10:43:40 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete:
[ss1120_nc9[JID:5:TAID:TID:ANID:ODID:1:1:17:0]
Dec 01, 2016 10:43:41 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete: [ss1120_nc4[JID:5:TAID:TID:ANID:ODID:1:1:6:0]
Dec 01, 2016 10:43:49 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete:
[ss1120_nc7[JID:5:TAID:TID:ANID:ODID:1:1:12:0]
Dec 01, 2016 10:43:51 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete: [ss1120_nc1[JID:5:TAID:TID:ANID:ODID:1:1:1:0]
Dec 01, 2016 10:43:53 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete: [ss1120_nc5[JID:5:TAID:TID:ANID:ODID:1:1:9:0]
Dec 01, 2016 10:43:58 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete:
[ss1120_nc7[JID:5:TAID:TID:ANID:ODID:1:1:13:0]
Dec 01, 2016 10:44:25 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete:
[ss1120_nc8[JID:5:TAID:TID:ANID:ODID:1:1:14:0]
Dec 01, 2016 10:44:29 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete:
[ss1120_nc6[JID:5:TAID:TID:ANID:ODID:1:1:11:0]
Dec 01, 2016 10:44:51 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete:
[ss1120_nc8[JID:5:TAID:TID:ANID:ODID:1:1:15:0]
Dec 01, 2016 10:45:19 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete: [ss1120_nc3[JID:5:TAID:TID:ANID:ODID:1:1:4:0]
*Dec 01, 2016 10:46:31 PM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run*
*INFO: Executing: TaskComplete:
[ss1120_nc1[JID:5:TAID:TID:ANID:ODID:1:1:0:0]*
*Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run*
*INFO: Executing: RegisterResultPartitionLocation: JobId@JID:5
ResultSetId@RSID:0 Partition@0 NPartitions@1
ResultPartitionLocation@128.195.11.47:45470
<http://ResultPartitionLocation@128.195.11.47:45470> OrderedResult@true
EmptyResult@false*
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: ReportResultPartitionWriteCompletion: JobId@JID:5
ResultSetId@RSID:0 Partition@0
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete: [ss1120_nc6[JID:5:TAID:TID:ANID:ODID:4:0:0:0]
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: TaskComplete:
[ss1120_nc6[JID:5:TAID:TID:ANID:ODID:1:1:10:0]
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: JobCleanup: JobId@JID:5 Status@TERMINATED
Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.cc.work.JobCleanupWork
run
INFO: Cleanup for JobRun with id: JID:5
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: JobletCleanupNotification
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: JobletCleanupNotification
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: JobletCleanupNotification
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: JobletCleanupNotification
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: JobletCleanupNotification
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: JobletCleanupNotification
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: JobletCleanupNotification
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: JobletCleanupNotification
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: JobletCleanupNotification
Dec 02, 2016 12:30:19 AM org.apache.asterix.active.ActiveLifecycleListener
notifyJobFinish
INFO: NO NEED TO NOTIFY JOB FINISH!
Dec 02, 2016 12:30:19 AM
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
INFO: Executing: GetResultPartitionLocations: JobId@JID:5 ResultSetId@RSID:0
Known@null

Best,
Taewoo

On Thu, Dec 1, 2016 at 10:43 PM, Taewoo Kim <wangsaeu@gmail.com> wrote:

> @Ian: I have a separate CC on one node that doesn't have a NC. Yourkit
> might be a good way to find the reason. Thanks.
>
> @Till: I think so. I am sending the same query now to see what happens
> this time.
>
> Best,
> Taewoo
>
> On Thu, Dec 1, 2016 at 10:41 PM, Till Westmann <tillw@apache.org> wrote:
>
>> Hi Taewoo,
>>
>> is this behavior reproducible?
>>
>> Cheers,
>> Till
>>
>>
>> On 1 Dec 2016, at 22:14, Taewoo Kim wrote:
>>
>> PS: It took 2 more hours to finish the job on one NC. I wonder why this
>>> happens.
>>>
>>> Dec 01, 2016 7:19:35 PM
>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>> INFO: Executing: NotifyTaskComplete
>>> Dec 01, 2016 9:11:23 PM
>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>> INFO: Executing: CleanupJoblet
>>> Dec 01, 2016 9:11:23 PM
>>> org.apache.hyracks.control.nc.work.CleanupJobletWork run
>>> INFO: Cleaning up after job: JID:4
>>> Dec 01, 2016 9:11:23 PM org.apache.hyracks.control.nc.Joblet close
>>> WARNING: Freeing leaked 54919521 bytes
>>>
>>> Best,
>>> Taewoo
>>>
>>> On Thu, Dec 1, 2016 at 8:39 PM, Taewoo Kim <wangsaeu@gmail.com> wrote:
>>>
>>> Hi All,
>>>>
>>>> Have you experienced this case?
>>>>
>>>> I have 9 NCs and the CPU utilization of one NC shows 100% for 1 hour and
>>>> 30 minutes while other NCs have finished their job about 1 hour ago.
>>>> Even
>>>> the problematic NC shows the following log at the end. So, looks like
>>>> it's
>>>> done but I'm not sure why this job never finishes. It's a simple hash
>>>> join
>>>> for 9M records on 9 nodes.
>>>>
>>>> Dec 01, 2016 7:18:02 PM org.apache.hyracks.control.com
>>>> mon.work.WorkQueue$WorkerThread
>>>> run
>>>> INFO: Executing: NotifyTaskComplete
>>>>
>>>> Best,
>>>> Taewoo
>>>>
>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message