From issues-return-43652-archive-asf-public=cust-asf.ponee.io@tez.apache.org Fri Jan 29 16:26:03 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 6822918064F for ; Fri, 29 Jan 2021 17:26:03 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 6EC2243B01 for ; Fri, 29 Jan 2021 16:26:02 +0000 (UTC) Received: (qmail 32041 invoked by uid 500); 29 Jan 2021 16:26:02 -0000 Mailing-List: contact issues-help@tez.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tez.apache.org Delivered-To: mailing list issues@tez.apache.org Received: (qmail 32029 invoked by uid 99); 29 Jan 2021 16:26:02 -0000 Received: from ec2-52-204-25-47.compute-1.amazonaws.com (HELO mailrelay1-ec2-va.apache.org) (52.204.25.47) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jan 2021 16:26:02 +0000 Received: from jira2-he-de.apache.org (jira2-he-de.apache.org [168.119.33.54]) by mailrelay1-ec2-va.apache.org (ASF Mail Server at mailrelay1-ec2-va.apache.org) with ESMTPS id D769E3EB01 for ; Fri, 29 Jan 2021 16:26:01 +0000 (UTC) Received: from jira2-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira2-he-de.apache.org (ASF Mail Server at jira2-he-de.apache.org) with ESMTP id 2A5AEC802CD for ; Fri, 29 Jan 2021 16:26:01 +0000 (UTC) Date: Fri, 29 Jan 2021 16:26:01 +0000 (UTC) From: =?utf-8?Q?L=C3=A1szl=C3=B3_Bodor_=28Jira=29?= To: issues@tez.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (TEZ-4277) AsyncDispatcher can hang on serviceStop if the eventhandling thread is in BLOCKED state MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TEZ-4277?page=3Dcom.atlassian.j= ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D172745= 53#comment-17274553 ]=20 L=C3=A1szl=C3=B3 Bodor commented on TEZ-4277: ----------------------------------- https://github.com/apache/tez/pull/97 > AsyncDispatcher can hang on serviceStop if the eventhandling thread is in= BLOCKED state > -------------------------------------------------------------------------= -------------- > > Key: TEZ-4277 > URL: https://issues.apache.org/jira/browse/TEZ-4277 > Project: Apache Tez > Issue Type: Bug > Reporter: L=C3=A1szl=C3=B3 Bodor > Assignee: L=C3=A1szl=C3=B3 Bodor > Priority: Major > Fix For: 0.10.1 > > > While further testing TEZ-4236, I discovered a strange hang in AsyncDispa= tcher. The symptom was a BLOCKED state event-handling thread. This block is= most likely caused by the dispatcher itself, so instead of making yet anot= her sync hack in the dispatcher, or even DrainDispatcher (however they have= a common codebase for serviceStop at the moment), I'm about to put a safet= y check, which solved my issue. > jstack of a hang (it hangs [here|https://github.com/apache/tez/blob/maste= r/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L157]= ) > {code} > "Thread-1212" #1581 prio=3D5 os_prio=3D31 tid=3D0x00007fdb4c628800 nid=3D= 0x1b84f in Object.wait() [0x000070000997b000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1252) > - locked <0x00000007ae967340> (a java.lang.Thread) > at java.lang.Thread.join(Thread.java:1326) > at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java= :157) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:22= 0) > - locked <0x00000007ae642cc8> (a java.lang.Object) > at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.jav= a:54) > at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperati= ons.java:102) > at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperati= ons.java:67) > at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:193= 3) > at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2155= ) > - locked <0x00000007ae53d3c0> (a org.apache.tez.dag.app.LocalDAGAppMaster= ) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:22= 0) > - locked <0x00000007ae53d5f0> (a java.lang.Object) > at org.apache.tez.client.LocalClient.close(LocalClient.java:135) > at org.apache.tez.client.TezClient.stop(TezClient.java:775) > - locked <0x00000007ade485a0> (a org.apache.tez.client.TezClient) > at org.apache.tez.test.TestLocalMode.testMultipleClientsWithSession(TestL= ocalMode.java:147) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j= ava:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess= orImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkM= ethod.java:47) > at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCall= able.java:12) > at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMet= hod.java:44) > at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMeth= od.java:17) > at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.ru= n(FailOnTimeout.java:74) > {code} > during the investigation I added some log messages, that lead to this sol= ution: > 1. there is not a single event in the queue > 2. event handling thread is in BLOCKED state > 3. AM is already shut down, but test hangs as LocalClient.stop waits to A= M services to be stopped > {code} > 2021-01-29 16:16:44,350 INFO [Thread-2198] common.AsyncDispatcher (Async= Dispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop, ig= noring any new events. > ... > 2021-01-29 16:16:45,354 INFO [Thread-2198] common.AsyncDispatcher (Async= Dispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...= current events count: 0, handler thread state: BLOCKED > 2021-01-29 16:16:46,358 INFO [Thread-2198] common.AsyncDispatcher (Async= Dispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...= current events count: 0, handler thread state: BLOCKED > 2021-01-29 16:16:47,362 INFO [Thread-2198] common.AsyncDispatcher (Async= Dispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...= current events count: 0, handler thread state: BLOCKED > ... > 2021-01-29 16:16:48,365 INFO [Thread-2198] common.AsyncDispatcher (Async= Dispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...= current events count: 0, handler thread state: BLOCKED > 2021-01-29 16:16:49,349 INFO [AMShutdownThread] app.DAGAppMaster (DAGApp= Master.java:run(1002)) - Calling stop for all the services > 2021-01-29 16:16:49,349 INFO [AMShutdownThread] app.DAGAppMaster (DAGApp= Master.java:run(1013)) - Exiting DAGAppMaster..GoodBye! > 2021-01-29 16:16:49,370 INFO [Thread-2198] common.AsyncDispatcher (Async= Dispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...= current events count: 0, handler thread state: BLOCKED > 2021-01-29 16:16:49,370 WARN [Thread-2198] common.AsyncDispatcher (Async= Dispatcher.java:serviceStop(153)) - Cannot wait for drain events anymore, l= ogging current events if any... > 2021-01-29 16:16:49,370 INFO [Thread-2198] common.AsyncDispatcher (Async= Dispatcher.java:logCurrentEvents(186)) - Current events in the queue: 0 > 2021-01-29 16:16:49,371 INFO [Thread-2198] common.AsyncDispatcher (Async= Dispatcher.java:serviceStop(168)) - interrupting event handling thread, sta= te: BLOCKED > 2021-01-29 16:16:49,371 INFO [Thread-2198] common.AsyncDispatcher (Async= Dispatcher.java:serviceStop(170)) - interrupted event handling thread, stat= e: BLOCKED > 2021-01-29 16:16:49,371 INFO [Thread-2198] common.AsyncDispatcher (Async= Dispatcher.java:serviceStop(172)) - joining event handling thread, state: B= LOCKED > {code} > under normal circumstances the event handling thread is in WAITING state = (which I think is because of queue.take) > + I also included the timeout logic of YARN-3999 here -- This message was sent by Atlassian Jira (v8.3.4#803005)