asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Westmann" <ti...@apache.org>
Subject Re: One of NC node is not stopping for a hash join.
Date Thu, 08 Dec 2016 03:30:22 GMT
Interesting. Thanks for filing the issue as well!

Cheers,
Till

On 7 Dec 2016, at 15:08, Taewoo Kim wrote:

> In short, the reason on why a specific one node among 9 nodes didn't 
> stop
> its hash-join job was due to a skewness (out of 9M records, 40,000 
> records
> was containing the same join key) as Abdullah suggested. Thanks all 
> for the
> information. Our system works fine as expected for this matter! Along 
> the
> way, I found some optimization chances and excessive byte[] allocation
> issues and fixing this will increase the performance. :-)
>
> Best,
> Taewoo
>
> On Mon, Dec 5, 2016 at 9:39 PM, Taewoo Kim <wangsaeu@gmail.com> wrote:
>
>> @Abdullah: Thanks. I missed your e-mail and just checked that. Will 
>> try.
>>
>> Best,
>> Taewoo
>>
>> On Fri, Dec 2, 2016 at 10:32 AM, abdullah alamoudi 
>> <bamousaa@gmail.com>
>> wrote:
>>
>>> Taewoo,
>>> You can use the diagnostics end point (/admin/diagnostics) to look 
>>> at all
>>> the stack traces from a single interface when that happens. This 
>>> could give
>>> an idea on what is happening in such case.
>>> Although, from what you described, it could be that we have some 
>>> skewness
>>> during query execution? (could be nulls,missing? any special 
>>> values?). That
>>> is also worth considering.
>>>
>>> Trying to help without enough context :-). Cheers,
>>> Abdullah.
>>>
>>>> On Dec 2, 2016, at 10:22 AM, Taewoo Kim <wangsaeu@gmail.com> wrote:
>>>>
>>>> Additional note: @Till: Yes. It happened again for the same 
>>>> hash-join
>>>> query. As we can see in the bold part of the following CC.log, one 
>>>> node
>>>> alone was executing for two hours.
>>>>
>>>>
>>>> Dec 01, 2016 10:41:56 PM
>>>> org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner
>>>> planActivityCluster
>>>> INFO: Plan for org.apache.hyracks.api.job.ActivityCluster@383ecfdd
>>>> Dec 01, 2016 10:41:56 PM
>>>> org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner
>>>> planActivityCluster
>>>> INFO: Built 1 Task Clusters
>>>> Dec 01, 2016 10:41:56 PM
>>>> org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner
>>>> planActivityCluster
>>>> INFO: Tasks: [TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1,
>>>> TID:ANID:ODID:1:1:2, TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4,
>>>> TID:ANID:ODID:1:1:5, TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7,
>>>> TID:ANID:ODID:1:1:8, TID:ANID:ODID:1:1:9, TID:ANID:ODID:1:1:10,
>>>> TID:ANID:ODID:1:1:11, TID:ANID:ODID:1:1:12, TID:ANID:ODID:1:1:13,
>>>> TID:ANID:ODID:1:1:14, TID:ANID:ODID:1:1:15, TID:ANID:ODID:1:1:16,
>>>> TID:ANID:ODID:1:1:17, TID:ANID:ODID:4:0:0]
>>>> Dec 01, 2016 10:43:18 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete: [ss1120_nc3[JID:5:TAID:TID:ANI
>>> D:ODID:1:1:5:0]
>>>> Dec 01, 2016 10:43:22 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete: [ss1120_nc4[JID:5:TAID:TID:ANI
>>> D:ODID:1:1:7:0]
>>>> Dec 01, 2016 10:43:23 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete:
>>>> [ss1120_nc9[JID:5:TAID:TID:ANID:ODID:1:1:16:0]
>>>> Dec 01, 2016 10:43:28 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete: [ss1120_nc2[JID:5:TAID:TID:ANI
>>> D:ODID:1:1:2:0]
>>>> Dec 01, 2016 10:43:31 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete: [ss1120_nc2[JID:5:TAID:TID:ANI
>>> D:ODID:1:1:3:0]
>>>> Dec 01, 2016 10:43:34 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete: [ss1120_nc5[JID:5:TAID:TID:ANI
>>> D:ODID:1:1:8:0]
>>>> Dec 01, 2016 10:43:40 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete:
>>>> [ss1120_nc9[JID:5:TAID:TID:ANID:ODID:1:1:17:0]
>>>> Dec 01, 2016 10:43:41 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete: [ss1120_nc4[JID:5:TAID:TID:ANI
>>> D:ODID:1:1:6:0]
>>>> Dec 01, 2016 10:43:49 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete:
>>>> [ss1120_nc7[JID:5:TAID:TID:ANID:ODID:1:1:12:0]
>>>> Dec 01, 2016 10:43:51 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete: [ss1120_nc1[JID:5:TAID:TID:ANI
>>> D:ODID:1:1:1:0]
>>>> Dec 01, 2016 10:43:53 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete: [ss1120_nc5[JID:5:TAID:TID:ANI
>>> D:ODID:1:1:9:0]
>>>> Dec 01, 2016 10:43:58 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete:
>>>> [ss1120_nc7[JID:5:TAID:TID:ANID:ODID:1:1:13:0]
>>>> Dec 01, 2016 10:44:25 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete:
>>>> [ss1120_nc8[JID:5:TAID:TID:ANID:ODID:1:1:14:0]
>>>> Dec 01, 2016 10:44:29 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete:
>>>> [ss1120_nc6[JID:5:TAID:TID:ANID:ODID:1:1:11:0]
>>>> Dec 01, 2016 10:44:51 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete:
>>>> [ss1120_nc8[JID:5:TAID:TID:ANID:ODID:1:1:15:0]
>>>> Dec 01, 2016 10:45:19 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete: [ss1120_nc3[JID:5:TAID:TID:ANI
>>> D:ODID:1:1:4:0]
>>>> *Dec 01, 2016 10:46:31 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run*
>>>> *INFO: Executing: TaskComplete:
>>>> [ss1120_nc1[JID:5:TAID:TID:ANID:ODID:1:1:0:0]*
>>>> *Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run*
>>>> *INFO: Executing: RegisterResultPartitionLocation: JobId@JID:5
>>>> ResultSetId@RSID:0 Partition@0 NPartitions@1
>>>> ResultPartitionLocation@128.195.11.47:45470
>>>> <http://ResultPartitionLocation@128.195.11.47:45470> 
>>>> OrderedResult@true
>>>> EmptyResult@false*
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: ReportResultPartitionWriteCompletion: JobId@JID:5
>>>> ResultSetId@RSID:0 Partition@0
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete: [ss1120_nc6[JID:5:TAID:TID:ANI
>>> D:ODID:4:0:0:0]
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: TaskComplete:
>>>> [ss1120_nc6[JID:5:TAID:TID:ANID:ODID:1:1:10:0]
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: JobCleanup: JobId@JID:5 Status@TERMINATED
>>>> Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.cc.
>>> work.JobCleanupWork
>>>> run
>>>> INFO: Cleanup for JobRun with id: JID:5
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: JobletCleanupNotification
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: JobletCleanupNotification
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: JobletCleanupNotification
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: JobletCleanupNotification
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: JobletCleanupNotification
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: JobletCleanupNotification
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: JobletCleanupNotification
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: JobletCleanupNotification
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: JobletCleanupNotification
>>>> Dec 02, 2016 12:30:19 AM org.apache.asterix.active.Acti
>>> veLifecycleListener
>>>> notifyJobFinish
>>>> INFO: NO NEED TO NOTIFY JOB FINISH!
>>>> Dec 02, 2016 12:30:19 AM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: GetResultPartitionLocations: JobId@JID:5
>>> ResultSetId@RSID:0
>>>> Known@null
>>>>
>>>> Best,
>>>> Taewoo
>>>>
>>>> On Thu, Dec 1, 2016 at 10:43 PM, Taewoo Kim <wangsaeu@gmail.com> 
>>>> wrote:
>>>>
>>>>> @Ian: I have a separate CC on one node that doesn't have a NC. 
>>>>> Yourkit
>>>>> might be a good way to find the reason. Thanks.
>>>>>
>>>>> @Till: I think so. I am sending the same query now to see what 
>>>>> happens
>>>>> this time.
>>>>>
>>>>> Best,
>>>>> Taewoo
>>>>>
>>>>> On Thu, Dec 1, 2016 at 10:41 PM, Till Westmann <tillw@apache.org>
>>> wrote:
>>>>>
>>>>>> Hi Taewoo,
>>>>>>
>>>>>> is this behavior reproducible?
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>>
>>>>>> On 1 Dec 2016, at 22:14, Taewoo Kim wrote:
>>>>>>
>>>>>> PS: It took 2 more hours to finish the job on one NC. I wonder 
>>>>>> why
>>> this
>>>>>>> happens.
>>>>>>>
>>>>>>> Dec 01, 2016 7:19:35 PM
>>>>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread

>>>>>>> run
>>>>>>> INFO: Executing: NotifyTaskComplete
>>>>>>> Dec 01, 2016 9:11:23 PM
>>>>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread

>>>>>>> run
>>>>>>> INFO: Executing: CleanupJoblet
>>>>>>> Dec 01, 2016 9:11:23 PM
>>>>>>> org.apache.hyracks.control.nc.work.CleanupJobletWork run
>>>>>>> INFO: Cleaning up after job: JID:4
>>>>>>> Dec 01, 2016 9:11:23 PM org.apache.hyracks.control.nc.Joblet

>>>>>>> close
>>>>>>> WARNING: Freeing leaked 54919521 bytes
>>>>>>>
>>>>>>> Best,
>>>>>>> Taewoo
>>>>>>>
>>>>>>> On Thu, Dec 1, 2016 at 8:39 PM, Taewoo Kim <wangsaeu@gmail.com>
>>> wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Have you experienced this case?
>>>>>>>>
>>>>>>>> I have 9 NCs and the CPU utilization of one NC shows 100%
for 1
>>> hour and
>>>>>>>> 30 minutes while other NCs have finished their job about
1 hour 
>>>>>>>> ago.
>>>>>>>> Even
>>>>>>>> the problematic NC shows the following log at the end. So,

>>>>>>>> looks
>>> like
>>>>>>>> it's
>>>>>>>> done but I'm not sure why this job never finishes. It's a

>>>>>>>> simple
>>> hash
>>>>>>>> join
>>>>>>>> for 9M records on 9 nodes.
>>>>>>>>
>>>>>>>> Dec 01, 2016 7:18:02 PM org.apache.hyracks.control.com
>>>>>>>> mon.work.WorkQueue$WorkerThread
>>>>>>>> run
>>>>>>>> INFO: Executing: NotifyTaskComplete
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Taewoo
>>>>>>>>
>>>>>>>>
>>>>>
>>>
>>>
>>

Mime
View raw message