hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()
Date Thu, 23 Mar 2017 02:37:41 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937587#comment-15937587
] 

Hadoop QA commented on HADOOP-14214:
------------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 25s{color} | {color:blue}
Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  0s{color} |
{color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  0s{color} | {color:red}
The patch doesn't appear to include any new or modified tests. Please justify why no new tests
are needed for this patch. Also please list what manual steps were performed to verify this
patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 12m 39s{color}
| {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 20m 17s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 35s{color}
| {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  0s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 19s{color}
| {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 25s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 50s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 37s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 58s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 58s{color} | {color:green}
the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 36s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  0s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 19s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m  0s{color}
| {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 38s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 49s{color} |
{color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  7m 44s{color} | {color:red}
hadoop-common in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 33s{color}
| {color:green} The patch does not generate ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 68m 36s{color} | {color:black}
{color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.security.TestRaceWhenRelogin |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:a9ad5d6 |
| JIRA Issue | HADOOP-14214 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12860055/HADOOP-14214.000.patch
|
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  unit  findbugs
 checkstyle  |
| uname | Linux a20fbdc56866 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / f462e1f |
| Default Java | 1.8.0_121 |
| findbugs | v3.0.0 |
| unit | https://builds.apache.org/job/PreCommit-HADOOP-Build/11889/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common.txt
|
|  Test Results | https://builds.apache.org/job/PreCommit-HADOOP-Build/11889/testReport/ |
| modules | C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
|
| Console output | https://builds.apache.org/job/PreCommit-HADOOP-Build/11889/console |
| Powered by | Apache Yetus 0.5.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> DomainSocketWatcher::add()/delete() should not self interrupt while looping await()
> -----------------------------------------------------------------------------------
>
>                 Key: HADOOP-14214
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14214
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: hdfs-client
>            Reporter: Mingliang Liu
>            Assignee: Mingliang Liu
>            Priority: Critical
>         Attachments: HADOOP-14214.000.patch
>
>
> Our hive team found a TPCDS job whose queries running on LLAP seem to be getting stuck.
Dozens of threads were waiting for the {{DfsClientShmManager::lock}}, as following jstack:
> {code}
> Thread 251 (IO-Elevator-Thread-5):
>   State: WAITING
>   Blocked count: 3871
>   Wtaited count: 4565
>   Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
>     org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
>     org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
>     org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017)
>     org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476)
>     org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784)
>     org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718)
>     org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422)
>     org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333)
>     org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181)
>     org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118)
>     org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478)
>     org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441)
>     org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
>     org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
>     org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166)
>     org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.<init>(OrcStripeMetadata.java:64)
>     org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622)
> {code}
> The thread that is expected to signal those threads is calling {{DomainSocketWatcher::add()}}
method, but it gets stuck there dealing with InterruptedException infinitely. The jstack is
like:
> {code}
> Thread 44417 (TezTR-257387_2840_12_10_52_0):
>   State: RUNNABLE
>   Blocked count: 3
>   Wtaited count: 5
>   Stack:
>     java.lang.Throwable.fillInStackTrace(Native Method)
>     java.lang.Throwable.fillInStackTrace(Throwable.java:783)
>     java.lang.Throwable.<init>(Throwable.java:250)
>     java.lang.Exception.<init>(Exception.java:54)
>     java.lang.InterruptedException.<init>(InterruptedException.java:57)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034)
>     org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325)
>     org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266)
>     org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
>     org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017)
>     org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476)
>     org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784)
>     org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718)
>     org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422)
>     org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333)
>     org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181)
>     org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118)
>     org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478)
>     org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441)
>     org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
> {code}
> The whole job makes no progress because of this.
> The thread in {{DomainSocketWatcher::add()}} is expected to eventually break the while
loop where it waits for the newly added entry being deleted by another thread. However, if
this thread is ever interrupted, chances are that it will hold the lock forever so {{if(!toAdd.contains(entry))}}
always be false.
> {code:title=DomainSocketWatcher::add()}
>   public void add(DomainSocket sock, Handler handler) {
>     lock.lock();
>     try {
>       ......
>       toAdd.add(entry);
>       kick();
>       while (true) {
>         try {
>           processedCond.await();
>         } catch (InterruptedException e) {
>           Thread.currentThread().interrupt();
>         }
>         if (!toAdd.contains(entry)) {
>           break;
>         }
>       }
>     } finally {
>       lock.unlock();
>     }
>   }
> {code}
> The reason here is that, this method catches the InterruptedException and self interrupts
during await(). The await() method internally calls {{AbstractQueuedSynchronizer::await()}},
which will throw a new InterruptedException if it's interrupted.
> {code:title=AbstractQueuedSynchronizer::await()}
>         public final void await() throws InterruptedException {
>             if (Thread.interrupted())
>                 throw new InterruptedException();
>             Node node = addConditionWaiter();
>             ...
> {code}
> Our code in {{DomainSocketWatcher::add()}} catches this exception (again) and self interrupt
(again). Please note in this process, the associated lock is never released so that the other
thread which is supposed to make {{if(!toAdd.contains(entry))}} be true is still pending on
the lock.
> {{DomainSocketWatcher::delete()}} has similar code logic and should suffer from similar
problems. 
> Thanks [~jdere] for testing and reporting this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message