Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8925718E7D for ; Mon, 14 Mar 2016 16:05:41 +0000 (UTC) Received: (qmail 26835 invoked by uid 500); 14 Mar 2016 16:05:38 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 26769 invoked by uid 500); 14 Mar 2016 16:05:37 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 26596 invoked by uid 99); 14 Mar 2016 16:05:35 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Mar 2016 16:05:35 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 94EC52C1F62 for ; Mon, 14 Mar 2016 16:05:33 +0000 (UTC) Date: Mon, 14 Mar 2016 16:05:33 +0000 (UTC) From: "Nate McCall (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-11340) Heavy read activity on system_auth tables can cause apparent livelock MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-11340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193529#comment-15193529 ] Nate McCall commented on CASSANDRA-11340: ----------------------------------------- [~jjirsa] I think the barrage of RRs triggered on a big cluster backing up the SimpleCondition's queue is the main issue. > Heavy read activity on system_auth tables can cause apparent livelock > --------------------------------------------------------------------- > > Key: CASSANDRA-11340 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11340 > Project: Cassandra > Issue Type: Bug > Reporter: Jeff Jirsa > Assignee: Aleksey Yeschenko > > Reproduced in at least 2.1.9. > It appears possible for queries against system_auth tables to trigger speculative retry, which causes auth to block on traffic going off node. In some cases, it appears possible for threads to become deadlocked, causing load on the nodes to increase sharply. This happens even in clusters with RF of system_auth == N, as all requests being served locally puts the bar for 99% SR pretty low. > Incomplete stack trace below, but we haven't yet figured out what exactly is blocking: > {code} > Thread 82291: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 (Compiled frame) > - org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUntil(long) @bci=28, line=307 (Compiled frame) > - org.apache.cassandra.utils.concurrent.SimpleCondition.await(long, java.util.concurrent.TimeUnit) @bci=76, line=63 (Compiled frame) > - org.apache.cassandra.service.ReadCallback.await(long, java.util.concurrent.TimeUnit) @bci=25, line=92 (Compiled frame) > - org.apache.cassandra.service.AbstractReadExecutor$SpeculatingReadExecutor.maybeTryAdditionalReplicas() @bci=39, line=281 (Compiled frame) > - org.apache.cassandra.service.StorageProxy.fetchRows(java.util.List, org.apache.cassandra.db.ConsistencyLevel) @bci=175, line=1338 (Compiled frame) > - org.apache.cassandra.service.StorageProxy.readRegular(java.util.List, org.apache.cassandra.db.ConsistencyLevel) @bci=9, line=1274 (Compiled frame) > - org.apache.cassandra.service.StorageProxy.read(java.util.List, org.apache.cassandra.db.ConsistencyLevel, org.apache.cassandra.service.ClientState) @bci=57, line=1199 (Compiled frame) > - org.apache.cassandra.cql3.statements.SelectStatement.execute(org.apache.cassandra.service.pager.Pageable, org.apache.cassandra.cql3.QueryOptions, int, long, org.apache.cassandra.service.QueryState) @bci=35, line=272 (Compiled frame) > - org.apache.cassandra.cql3.statements.SelectStatement.execute(org.apache.cassandra.service.QueryState, org.apache.cassandra.cql3.QueryOptions) @bci=105, line=224 (Compiled frame) > - org.apache.cassandra.auth.Auth.selectUser(java.lang.String) @bci=27, line=265 (Compiled frame) > - org.apache.cassandra.auth.Auth.isExistingUser(java.lang.String) @bci=1, line=86 (Compiled frame) > - org.apache.cassandra.service.ClientState.login(org.apache.cassandra.auth.AuthenticatedUser) @bci=11, line=206 (Compiled frame) > - org.apache.cassandra.transport.messages.AuthResponse.execute(org.apache.cassandra.service.QueryState) @bci=58, line=82 (Compiled frame) > - org.apache.cassandra.transport.Message$Dispatcher.channelRead0(io.netty.channel.ChannelHandlerContext, org.apache.cassandra.transport.Message$Request) @bci=75, line=439 (Compiled frame) > - org.apache.cassandra.transport.Message$Dispatcher.channelRead0(io.netty.channel.ChannelHandlerContext, java.lang.Object) @bci=6, line=335 (Compiled frame) > - io.netty.channel.SimpleChannelInboundHandler.channelRead(io.netty.channel.ChannelHandlerContext, java.lang.Object) @bci=17, line=105 (Compiled frame) > - io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(java.lang.Object) @bci=9, line=333 (Compiled frame) > - io.netty.channel.AbstractChannelHandlerContext.access$700(io.netty.channel.AbstractChannelHandlerContext, java.lang.Object) @bci=2, line=32 (Compiled frame) > - io.netty.channel.AbstractChannelHandlerContext$8.run() @bci=8, line=324 (Compiled frame) > - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 (Compiled frame) > - org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run() @bci=5, line=164 (Compiled frame) > - org.apache.cassandra.concurrent.SEPWorker.run() @bci=87, line=105 (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > {code} > In a cluster with many connected clients (potentially thousands), a reconnection flood (for example, restarting all at once) is likely to trigger this bug. However, it is unlikely to be seen in normal operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)