asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Taewoo Kim <wangs...@gmail.com>
Subject Re: One of NC node is not stopping for a hash join.
Date Tue, 06 Dec 2016 05:39:47 GMT
@Abdullah: Thanks. I missed your e-mail and just checked that. Will try.

Best,
Taewoo

On Fri, Dec 2, 2016 at 10:32 AM, abdullah alamoudi <bamousaa@gmail.com>
wrote:

> Taewoo,
> You can use the diagnostics end point (/admin/diagnostics) to look at all
> the stack traces from a single interface when that happens. This could give
> an idea on what is happening in such case.
> Although, from what you described, it could be that we have some skewness
> during query execution? (could be nulls,missing? any special values?). That
> is also worth considering.
>
> Trying to help without enough context :-). Cheers,
> Abdullah.
>
> > On Dec 2, 2016, at 10:22 AM, Taewoo Kim <wangsaeu@gmail.com> wrote:
> >
> > Additional note: @Till: Yes. It happened again for the same hash-join
> > query. As we can see in the bold part of the following CC.log, one node
> > alone was executing for two hours.
> >
> >
> > Dec 01, 2016 10:41:56 PM
> > org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner
> > planActivityCluster
> > INFO: Plan for org.apache.hyracks.api.job.ActivityCluster@383ecfdd
> > Dec 01, 2016 10:41:56 PM
> > org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner
> > planActivityCluster
> > INFO: Built 1 Task Clusters
> > Dec 01, 2016 10:41:56 PM
> > org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner
> > planActivityCluster
> > INFO: Tasks: [TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1,
> > TID:ANID:ODID:1:1:2, TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4,
> > TID:ANID:ODID:1:1:5, TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7,
> > TID:ANID:ODID:1:1:8, TID:ANID:ODID:1:1:9, TID:ANID:ODID:1:1:10,
> > TID:ANID:ODID:1:1:11, TID:ANID:ODID:1:1:12, TID:ANID:ODID:1:1:13,
> > TID:ANID:ODID:1:1:14, TID:ANID:ODID:1:1:15, TID:ANID:ODID:1:1:16,
> > TID:ANID:ODID:1:1:17, TID:ANID:ODID:4:0:0]
> > Dec 01, 2016 10:43:18 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete: [ss1120_nc3[JID:5:TAID:TID:
> ANID:ODID:1:1:5:0]
> > Dec 01, 2016 10:43:22 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete: [ss1120_nc4[JID:5:TAID:TID:
> ANID:ODID:1:1:7:0]
> > Dec 01, 2016 10:43:23 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete:
> > [ss1120_nc9[JID:5:TAID:TID:ANID:ODID:1:1:16:0]
> > Dec 01, 2016 10:43:28 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete: [ss1120_nc2[JID:5:TAID:TID:
> ANID:ODID:1:1:2:0]
> > Dec 01, 2016 10:43:31 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete: [ss1120_nc2[JID:5:TAID:TID:
> ANID:ODID:1:1:3:0]
> > Dec 01, 2016 10:43:34 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete: [ss1120_nc5[JID:5:TAID:TID:
> ANID:ODID:1:1:8:0]
> > Dec 01, 2016 10:43:40 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete:
> > [ss1120_nc9[JID:5:TAID:TID:ANID:ODID:1:1:17:0]
> > Dec 01, 2016 10:43:41 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete: [ss1120_nc4[JID:5:TAID:TID:
> ANID:ODID:1:1:6:0]
> > Dec 01, 2016 10:43:49 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete:
> > [ss1120_nc7[JID:5:TAID:TID:ANID:ODID:1:1:12:0]
> > Dec 01, 2016 10:43:51 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete: [ss1120_nc1[JID:5:TAID:TID:
> ANID:ODID:1:1:1:0]
> > Dec 01, 2016 10:43:53 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete: [ss1120_nc5[JID:5:TAID:TID:
> ANID:ODID:1:1:9:0]
> > Dec 01, 2016 10:43:58 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete:
> > [ss1120_nc7[JID:5:TAID:TID:ANID:ODID:1:1:13:0]
> > Dec 01, 2016 10:44:25 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete:
> > [ss1120_nc8[JID:5:TAID:TID:ANID:ODID:1:1:14:0]
> > Dec 01, 2016 10:44:29 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete:
> > [ss1120_nc6[JID:5:TAID:TID:ANID:ODID:1:1:11:0]
> > Dec 01, 2016 10:44:51 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete:
> > [ss1120_nc8[JID:5:TAID:TID:ANID:ODID:1:1:15:0]
> > Dec 01, 2016 10:45:19 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete: [ss1120_nc3[JID:5:TAID:TID:
> ANID:ODID:1:1:4:0]
> > *Dec 01, 2016 10:46:31 PM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run*
> > *INFO: Executing: TaskComplete:
> > [ss1120_nc1[JID:5:TAID:TID:ANID:ODID:1:1:0:0]*
> > *Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run*
> > *INFO: Executing: RegisterResultPartitionLocation: JobId@JID:5
> > ResultSetId@RSID:0 Partition@0 NPartitions@1
> > ResultPartitionLocation@128.195.11.47:45470
> > <http://ResultPartitionLocation@128.195.11.47:45470> OrderedResult@true
> > EmptyResult@false*
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: ReportResultPartitionWriteCompletion: JobId@JID:5
> > ResultSetId@RSID:0 Partition@0
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete: [ss1120_nc6[JID:5:TAID:TID:
> ANID:ODID:4:0:0:0]
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: TaskComplete:
> > [ss1120_nc6[JID:5:TAID:TID:ANID:ODID:1:1:10:0]
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: JobCleanup: JobId@JID:5 Status@TERMINATED
> > Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.cc.
> work.JobCleanupWork
> > run
> > INFO: Cleanup for JobRun with id: JID:5
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: JobletCleanupNotification
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: JobletCleanupNotification
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: JobletCleanupNotification
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: JobletCleanupNotification
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: JobletCleanupNotification
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: JobletCleanupNotification
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: JobletCleanupNotification
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: JobletCleanupNotification
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: JobletCleanupNotification
> > Dec 02, 2016 12:30:19 AM org.apache.asterix.active.
> ActiveLifecycleListener
> > notifyJobFinish
> > INFO: NO NEED TO NOTIFY JOB FINISH!
> > Dec 02, 2016 12:30:19 AM
> > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> > INFO: Executing: GetResultPartitionLocations: JobId@JID:5
> ResultSetId@RSID:0
> > Known@null
> >
> > Best,
> > Taewoo
> >
> > On Thu, Dec 1, 2016 at 10:43 PM, Taewoo Kim <wangsaeu@gmail.com> wrote:
> >
> >> @Ian: I have a separate CC on one node that doesn't have a NC. Yourkit
> >> might be a good way to find the reason. Thanks.
> >>
> >> @Till: I think so. I am sending the same query now to see what happens
> >> this time.
> >>
> >> Best,
> >> Taewoo
> >>
> >> On Thu, Dec 1, 2016 at 10:41 PM, Till Westmann <tillw@apache.org>
> wrote:
> >>
> >>> Hi Taewoo,
> >>>
> >>> is this behavior reproducible?
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>>
> >>> On 1 Dec 2016, at 22:14, Taewoo Kim wrote:
> >>>
> >>> PS: It took 2 more hours to finish the job on one NC. I wonder why this
> >>>> happens.
> >>>>
> >>>> Dec 01, 2016 7:19:35 PM
> >>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> >>>> INFO: Executing: NotifyTaskComplete
> >>>> Dec 01, 2016 9:11:23 PM
> >>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> >>>> INFO: Executing: CleanupJoblet
> >>>> Dec 01, 2016 9:11:23 PM
> >>>> org.apache.hyracks.control.nc.work.CleanupJobletWork run
> >>>> INFO: Cleaning up after job: JID:4
> >>>> Dec 01, 2016 9:11:23 PM org.apache.hyracks.control.nc.Joblet close
> >>>> WARNING: Freeing leaked 54919521 bytes
> >>>>
> >>>> Best,
> >>>> Taewoo
> >>>>
> >>>> On Thu, Dec 1, 2016 at 8:39 PM, Taewoo Kim <wangsaeu@gmail.com>
> wrote:
> >>>>
> >>>> Hi All,
> >>>>>
> >>>>> Have you experienced this case?
> >>>>>
> >>>>> I have 9 NCs and the CPU utilization of one NC shows 100% for 1
hour
> and
> >>>>> 30 minutes while other NCs have finished their job about 1 hour
ago.
> >>>>> Even
> >>>>> the problematic NC shows the following log at the end. So, looks
like
> >>>>> it's
> >>>>> done but I'm not sure why this job never finishes. It's a simple
hash
> >>>>> join
> >>>>> for 9M records on 9 nodes.
> >>>>>
> >>>>> Dec 01, 2016 7:18:02 PM org.apache.hyracks.control.com
> >>>>> mon.work.WorkQueue$WorkerThread
> >>>>> run
> >>>>> INFO: Executing: NotifyTaskComplete
> >>>>>
> >>>>> Best,
> >>>>> Taewoo
> >>>>>
> >>>>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message