hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18628) ZKPermissionWatcher blocks all ZK notifications
Date Tue, 22 Aug 2017 01:07:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136118#comment-16136118
] 

Hadoop QA commented on HBASE-18628:
-----------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 22s{color} | {color:blue}
Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  0s{color}
| {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  0s{color} |
{color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  0s{color} | {color:red}
The patch doesn't appear to include any new or modified tests. Please justify why no new tests
are needed for this patch. Also please list what manual steps were performed to verify this
patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 28s{color}
| {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 37s{color} |
{color:green} branch-1 passed with JDK v1.8.0_144 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 41s{color} |
{color:green} branch-1 passed with JDK v1.7.0_131 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m  8s{color}
| {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 26s{color}
| {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 15s{color} |
{color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 44s{color} |
{color:green} branch-1 passed with JDK v1.8.0_144 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 41s{color} |
{color:green} branch-1 passed with JDK v1.7.0_131 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 49s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 40s{color} |
{color:green} the patch passed with JDK v1.8.0_144 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 40s{color} | {color:green}
the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 41s{color} |
{color:green} the patch passed with JDK v1.7.0_131 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 41s{color} | {color:green}
the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m  3s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 20s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m  0s{color}
| {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 18m 26s{color}
| {color:green} The patch does not cause any errors with Hadoop 2.4.0 2.4.1 2.5.0 2.5.1 2.5.2
2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 38s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 37s{color} |
{color:green} the patch passed with JDK v1.8.0_144 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 36s{color} |
{color:green} the patch passed with JDK v1.7.0_131 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}113m 46s{color} | {color:red}
hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 29s{color}
| {color:green} The patch does not generate ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}155m  9s{color} | {color:black}
{color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.client.TestClientScannerRPCTimeout |
|   | hadoop.hbase.security.access.TestAccessControlFilter |
|   | hadoop.hbase.security.access.TestTablePermissions |
|   | hadoop.hbase.client.TestReplicasClient |
|   | hadoop.hbase.master.TestMasterFailover |
| Timed out junit tests | org.apache.hadoop.hbase.regionserver.TestHRegion |
|   | org.apache.hadoop.hbase.security.access.TestScanEarlyTermination |
|   | org.apache.hadoop.hbase.mapreduce.TestTableSnapshotInputFormat |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.13.1 Server=1.13.1 Image:yetus/hbase:6f1cc2c |
| JIRA Issue | HBASE-18628 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12882989/HBASE-18628.branch-1.v5.patch
|
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  hbaseanti  checkstyle
 compile  |
| uname | Linux 0912d6c1b47f 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 11:05:26 UTC 2017
x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/hbase.sh |
| git revision | branch-1 / 19fffac |
| Default Java | 1.7.0_131 |
| Multi-JDK versions |  /usr/lib/jvm/java-8-oracle:1.8.0_144 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_131
|
| findbugs | v3.0.0 |
| unit | https://builds.apache.org/job/PreCommit-HBASE-Build/8210/artifact/patchprocess/patch-unit-hbase-server.txt
|
|  Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/8210/testReport/ |
| modules | C: hbase-server U: hbase-server |
| Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/8210/console |
| Powered by | Apache Yetus 0.4.0   http://yetus.apache.org |


This message was automatically generated.



> ZKPermissionWatcher blocks all ZK notifications
> -----------------------------------------------
>
>                 Key: HBASE-18628
>                 URL: https://issues.apache.org/jira/browse/HBASE-18628
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: Mike Drob
>            Assignee: Mike Drob
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.0-alpha-3
>
>         Attachments: HBASE-18628.branch-1.v5.patch, HBASE-18628.patch, HBASE-18628.v2.patch,
HBASE-18628.v3.patch, HBASE-18628.v4.patch, HBASE-18628.v5.patch, jstack
>
>
> Buckle up folks, we're going for a ride here. I've seeing this on a branch-2 based build,
but I think the problem will affect branch-1 as well. I'm not able to easily reproduce the
issue, but it will usually come up within an hour on a given cluster that I have, at which
point the problem persists until an RS restart. I've been seeing the problem and paying attention
for maybe two months, but I suspect it's been happening much longer than that.
> h3. Problem
> When running in a secure cluster, sometimes the ZK EventThread will get stuck on a permissions
update and not be able to process new notifications. This happens to also block flush and
snapshot, which is how we found it.
> h3. Analysis
> The main smoking gun is seeing this in repeated jstacks:
> {noformat}
> "main-EventThread" #43 daemon prio=5 os_prio=0 tid=0x00007f0b92644000 nid=0x6e69 waiting
on condition [0x00007f0b6730f000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.hbase.security.access.ZKPermissionWatcher.nodeChildrenChanged(ZKPermissionWatcher.java:191)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:503)
>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> {noformat}
> That sleep is a 20ms sleep in an {{AtomicReference.compareAndSet}} loop - but it never
gets past the condition.
> {code}
>         while (!nodes.compareAndSet(null, nodeList)) {
>           try {
>             Thread.sleep(20);
>           } catch (InterruptedException e) {
>             LOG.warn("Interrupted while setting node list", e);
>             Thread.currentThread().interrupt();
>           }
>         }
> {code}
> The warning never shows up in the logs, it just keeps looping and looping. The last relevant
line from the watcher in logs is:
> {noformat}
> 2017-08-17 21:25:12,379 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:22101-0x15df38884c80024,
quorum=zk1:2181,zk2:2181,zk3:2181, baseZNode=/hbase Received ZooKeeper Event, type=NodeChildrenChanged,
state=SyncConnected, path=/hbase/acl
> {noformat}
> Which makes sense, because the code snippet is from permission watcher's {{nodeChildrenChanged}}
handler.
> The separate thread introduced in HBASE-14370 is present, but not doing anything. And
this event hasn't gotten to the part where it splits off into a thread:
> {noformat}
> "zk-permission-watcher4-thread-1" #160 daemon prio=5 os_prio=0 tid=0x0000000001750800
nid=0x6fd9 waiting on condition [0x00007f0b5dce5000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000007436ecea0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> h3. Solutions
> There's a few approaches we can take to fix this, I think they are all complimentary.
It might be useful to file subtasks or new issues for some of the solutions if they are longer
term.
> # Move flush and snapshot to ProcedureV2. This makes my proximate problem go away, but
it's only relevant to branch-2 and master, and doesn't fix anything on branch-1. Also, Permissions
updates would still get stuck, preventing future permissions updates. I think this is important
long term for the robustness of the system, but not a viable short term fix.
> # Add an Executor to ZookeeperWatcher and launch threads from there. Maybe we'd want
to pull the Executor out of ZKPW, but that's not strictly necessary and can be optimized later
-- if we're already threading, then adding another layer isn't a huge cost.
> # Figure out the race condition or logic problem that causes {{nodes}} to be non-null
above. I've tried looking at this and visual inspection isn't getting me anywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message